Hi Anders,

The main problem is that IMMND doesn't handle the MDS failure when discarding a 
client.
   immnd_proc_imma_discard_connection()

IMMD can be on different node with the IMMND that sends discard messages, so 
MDS failure is possible.
An implementer can be marked as dying forever in that case.

Also, please see my answers inline.

BR,

Hung Nguyen - DEK Technologies


--------------------------------------------------------------------------------
From: Anders Bjornerstedt anders.bjornerst...@telia.com
Sent: Friday, June 03, 2016 5:32PM
To: Hung Nguyen, Zoran Milinkovic, Neelakanta Reddy
     hung.d.ngu...@dektech.com.au, zoran.milinko...@ericsson.com, 
reddy.neelaka...@oracle.com
Cc: Opensaf-devel
     opensaf-devel@lists.sourceforge.net
Subject: Re: [devel] [PATCH 0 of 1] Review Request for imm: Retry discarding 
client if it is not completely discarded [#1855]


Hi

There is something wrong with this ticket.
The discard node message has always worked.
So if you see a problem with this mechanism then it must be some recently 
introduced problem.
Of course in the headless state all bets are off. But I assume the headless 
state is not involved here.

[Hung] Neel and I was talking about the case that IMMD down can lead to 
a node reboot. And the node reboot (discard node message) can cure the 
implementer that is marked as dying. But IMMD is not always on the same 
node with the implementer so that (node discard message) doesn't always 
work (discarding the dying implementer). I didn't mean node discard 
message fails to discard the node. The node discard message still works 
perfectly.

If I remember correctly, the discard node message is generated either by the 
active IMMD or the IMMND coord (colocated with
the active IMMD), I have never seen a case where node local MDS failed. The 
IMMD sends the message to all IMmndS over fevs.
This means the message can not be lost, actually it can but it would result in 
a fevs count mismatch (a gap in the fevs sequence).
Such gaps are vaught and escalated.

A fevs count mismatch is visible in the syslog.
If you dont see it, then there is no lost message.

There may of course be the possibility that you have introduced a bug 
relatively recently.




/AndersBj


> ----Ursprungligt meddelande----
> Från : hung.d.ngu...@dektech.com.au
> Datum : 2016-06-03 - 09:23 (WEST)
> Till : reddy.neelaka...@oracle.com, zoran.milinko...@ericsson.com
> Kopia : opensaf-devel@lists.sourceforge.net
> Ämne : Re: [devel] [PATCH 0 of 1] Review Request for imm: Retry discarding 
> client if it is not completely discarded [#1855]
>
> Hi Neel,
>
> The IMMD is not always on the same node with the IMMND that send the discard 
> message.
> IMMND can also be on PL.
>
> If IMMD is on different node, DISCARD_NODE message also doesn't work.
>
> BR,
>
>
> Hung Nguyen - DEK Technologies
>
>
> --------------------------------------------------------------------------------
> From: Neelakanta Reddy reddy.neelaka...@oracle.com
> Sent: Friday, June 03, 2016 3:19PM
> To: Hung Nguyen, Zoran Milinkovic
>      hung.d.ngu...@dektech.com.au, zoran.milinko...@ericsson.com
> Cc: Opensaf-devel
>      opensaf-devel@lists.sourceforge.net
> Subject: Re: [PATCH 0 of 1] Review Request for imm: Retry discarding client 
> if it is not completely discarded [#1855]
>
>
> Hi Hung,
>
> The #1855 discuss mainly two problems:
>
> 1. IMMND not sending the message to IMMD.
>
> The message between IMMND to IMMD is internal to the node, and this must
> be delivered to IMMD.
> If the internal message is not delivered then there is serious problem
> in transport.
> which means the cluster is in bad state and these type case may not be
> supported.
>
> 2. IMMD crash :
>
> If IMMD crash happens, then node goes for reboot.There will be
> DISCARD_NODE message and eventually the implementer will be
> discarded in the remaining nodes(if the implementer is originated in the
> discarded node).
>
> /Neel.
>
> On 2016/06/03 01:27 PM, Hung Nguyen wrote:
>> Hi Neel,
>>
>> There's a TR that the messages can't reach the IMMD.
>> They admit that there are some tipc problems going on.
>>
>> But they want IMM to be consistent in that case.
>> The implementer should not be marked as dying forever just because of losing 
>> some messages.
>>
>> That causes a high impact on their application as the OI can't set impl name 
>> and fails to provide the service.
>>
>> Both mds.log and mds.log.old were rotated.
>> We don't have any information about that except that they confirmed that was 
>> a tipc problem.
>>
>>
>> BR,
>> Hung Nguyen - DEK Technologies
>>
>> --------------------------------------------------------------------------------
>> From: Neelakanta reddyreddy.neelaka...@oracle.com
>> Sent: Friday, June 03, 2016 2:21PM
>> To: Hung Nguyen, Zoran Milinkovic
>>       hung.d.ngu...@dektech.com.au,zoran.milinko...@ericsson.com
>> Cc:
>>       
>> Subject: Re: [PATCH 0 of 1] Review Request for imm: Retry discarding client 
>> if it is not completely discarded [#1855]
>>
>>
>> Hi Hung,
>>
>> [off the list]
>>
>> What is the reason why the IMMD_EVT_ND2D_DISCARD_IMPL is not sent?if
>> you have mds (transport) problems?
>> if you have mds.log can you please share along with syslog.
>>
>>
>> Regards,
>> Neel.
>>
>> On 2016/05/31 04:53 PM, Hung Nguyen wrote:
>>> Summary: imm: Retry discarding client if it is not completely
>>> discarded [#1855]
>>> Review request for Trac Ticket(s): 1855
>>> Peer Reviewer(s): Zoran, Neel
>>> Pull request to:
>>> Affected branch(es): 4.7, 5.0, 5.1
>>> Development branch: 5.1
>>>
>>> --------------------------------
>>> Impacted area       Impact y/n
>>> --------------------------------
>>>    Docs                    n
>>>    Build system            n
>>>    RPM/packaging           n
>>>    Configuration files     n
>>>    Startup scripts         n
>>>    SAF services            n
>>>    OpenSAF services        y
>>>    Core libraries          n
>>>    Samples                 n
>>>    Tests                   n
>>>    Other                   n
>>>
>>>
>>> Comments (indicate scope for each "y" above):
>>> ---------------------------------------------
>>>
>>>
>>> changeset ce30dc25dcb608e9a4562b9afdca91457620a0dd
>>> Author:    Hung Nguyen <hung.d.ngu...@dektech.com.au>
>>> Date:    Tue, 31 May 2016 18:19:11 +0700
>>>
>>>      imm: Retry discarding client if it is not completely discarded
>>> [#1855]
>>>
>>>      When discarding a client, there are chances that IMMND fails to send
>>>      impl/admo/ccb discard messages or IMMD can't receive those
>>> messages. In that
>>>      case, we have to make sure that those messages are re-sent. If
>>> not, we will
>>>      have admo/ccb resource leak or implementers marked as dying forever.
>>>      DEFAULT_TIMEOUT_SEC (6sec) is used as retry interval.
>>>
>>>
>>> Complete diffstat:
>>> ------------------
>>>    osaf/services/saf/immsv/immnd/ImmModel.cc  |  86
>>> +++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++-
>>>    osaf/services/saf/immsv/immnd/ImmModel.hh  |   2 ++
>>>    osaf/services/saf/immsv/immnd/immnd_init.h |   4 ++++
>>>    osaf/services/saf/immsv/immnd/immnd_proc.c |  43
>>> ++++++++++++++++++++++++++++++++-----------
>>>    4 files changed, 123 insertions(+), 12 deletions(-)
>>>
>>>
>>> Testing Commands:
>>> -----------------
>>>
>>>
>>>
>>> Testing, Expected Results:
>>> --------------------------
>>>
>>>
>>>
>>> Conditions of Submission:
>>> -------------------------
>>> Ack from reviewers.
>>>
>>>
>>> Arch      Built     Started    Linux distro
>>> -------------------------------------------
>>> mips        n          n
>>> mips64      n          n
>>> x86         n          n
>>> x86_64      n          n
>>> powerpc     n          n
>>> powerpc64   n          n
>>>
>>>
>>> Reviewer Checklist:
>>> -------------------
>>> [Submitters: make sure that your review doesn't trigger any checkmarks!]
>>>
>>>
>>> Your checkin has not passed review because (see checked entries):
>>>
>>> ___ Your RR template is generally incomplete; it has too many blank
>>> entries
>>>       that need proper data filled in.
>>>
>>> ___ You have failed to nominate the proper persons for review and push.
>>>
>>> ___ Your patches do not have proper short+long header
>>>
>>> ___ You have grammar/spelling in your header that is unacceptable.
>>>
>>> ___ You have exceeded a sensible line length in your
>>> headers/comments/text.
>>>
>>> ___ You have failed to put in a proper Trac Ticket # into your commits.
>>>
>>> ___ You have incorrectly put/left internal data in your comments/files
>>>       (i.e. internal bug tracking tool IDs, product names etc)
>>>
>>> ___ You have not given any evidence of testing beyond basic build tests.
>>>       Demonstrate some level of runtime or other sanity testing.
>>>
>>> ___ You have ^M present in some of your files. These have to be removed.
>>>
>>> ___ You have needlessly changed whitespace or added whitespace crimes
>>>       like trailing spaces, or spaces before tabs.
>>>
>>> ___ You have mixed real technical changes with whitespace and other
>>>       cosmetic code cleanup changes. These have to be separate commits.
>>>
>>> ___ You need to refactor your submission into logical chunks; there is
>>>       too much content into a single commit.
>>>
>>> ___ You have extraneous garbage in your review (merge commits etc)
>>>
>>> ___ You have giant attachments which should never have been sent;
>>>       Instead you should place your content in a public tree to be
>>> pulled.
>>>
>>> ___ You have too many commits attached to an e-mail; resend as threaded
>>>       commits, or place in a public tree for a pull.
>>>
>>> ___ You have resent this content multiple times without a clear
>>> indication
>>>       of what has changed between each re-send.
>>>
>>> ___ You have failed to adequately and individually address all of the
>>>       comments and change requests that were proposed in the initial
>>> review.
>>>
>>> ___ You have a misconfigured ~/.hgrc file (i.e. username, email etc)
>>>
>>> ___ Your computer have a badly configured date and time; confusing the
>>>       the threaded patch review.
>>>
>>> ___ Your changes affect IPC mechanism, and you don't present any results
>>>       for in-service upgradability test.
>>>
>>> ___ Your changes affect user manual and documentation, your patch series
>>>       do not contain the patch that updates the Doxygen manual.
>>>
> ------------------------------------------------------------------------------
> What NetFlow Analyzer can do for you? Monitors network bandwidth and traffic
> patterns at an interface-level. Reveals which users, apps, and protocols are
> consuming the most bandwidth. Provides multi-vendor support for NetFlow,
> J-Flow, sFlow and other flows. Make informed decisions using capacity
> planning reports. https://ad.doubleclick.net/ddm/clk/305295220;132659582;e
> _______________________________________________
> Opensaf-devel mailing list
> Opensaf-devel@lists.sourceforge.net
> https://lists.sourceforge.net/lists/listinfo/opensaf-devel
>

------------------------------------------------------------------------------
What NetFlow Analyzer can do for you? Monitors network bandwidth and traffic
patterns at an interface-level. Reveals which users, apps, and protocols are 
consuming the most bandwidth. Provides multi-vendor support for NetFlow, 
J-Flow, sFlow and other flows. Make informed decisions using capacity 
planning reports. https://ad.doubleclick.net/ddm/clk/305295220;132659582;e
_______________________________________________
Opensaf-devel mailing list
Opensaf-devel@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/opensaf-devel

Reply via email to