[jira] [Commented] (GEODE-6244) Healthy member kicked out by Sick member when final-check fails

ASF subversion and git services (JIRA) Fri, 18 Jan 2019 10:48:57 -0800


    [ 
https://issues.apache.org/jira/browse/GEODE-6244?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16746568#comment-16746568
 ]


ASF subversion and git services commented on GEODE-6244:
--------------------------------------------------------

Commit ffd6b38e78ccb7f4a1b451cbcf59c16e7696393e in geode's branch 
refs/heads/develop from Bruce Schuchardt
[ https://gitbox.apache.org/repos/asf?p=geode.git;h=ffd6b38 ]

GEODE-6244 Healthy member kicked out by sick member

- do not allow membership manager suspect initiation to kick out a
member on the first failed check
- perform a self-health check before sending SuspectRequest messages
- consider members who have sent shutdown messages as gone when
performing "should I become coordinator" checks in GMSHealthMonitor


> Healthy member kicked out by Sick member when final-check fails
> ---------------------------------------------------------------
>
>                 Key: GEODE-6244
>                 URL: https://issues.apache.org/jira/browse/GEODE-6244
>             Project: Geode
>          Issue Type: New Feature
>          Components: membership
>    Affects Versions: 1.1.0, 1.1.1, 1.2.0, 1.3.0, 1.2.1, 1.4.0, 1.5.0, 1.6.0, 
> 1.7.0, 1.8.0
>            Reporter: Bruce Schuchardt
>            Priority: Major
>             Fix For: 1.9.0
>
>
> I observed this in a user's logs & can't include artifacts:  Clients were 
> herding to one server when another server was being slow to return results.  
> The clients caused the server to run out of file descriptors because the 
> descriptor limit was set pretty low.  When that happened the server had 
> trouble forming an outgoing tcp/ip connection to another server.  It tried 
> using MembershipManager.verifyMember() which also failed to connect to the 
> other server.  When that happened it sent a RemoveMessage to the locators and 
> several of the other servers, including the one it couldn't connect to.  That 
> server immediately shut itself down.
> MembershipManager.verifyMember() is documented to only initiate suspect 
> processing on the target, not initiate immediate removal.  This is supposed 
> to be done so that some other process (i.e., the membership coordinator) will 
> do additional checking on the suspect in case the initiator is itself sick.  
> That was the case in this situation.
> serverA unable to connect to serverB
> serverA performs tcp/ip check in verifyMember
> serverA's tcp/ip check fails (it's out of file descriptors, duh)
> serverA sends RemoveMember message to locators and serverB
> serverB shuts itself down (ForcedDisconnect)
> The behavior should instead be
> serverA unable to connect to serverB
> serverA performs tcp/ip check in verifyMember
> serverA's tcp/ip check fails (it's out of file descriptors, duh)
> serverA sends SuspectMember message to locators & other servers
> coordinator performs tcp/ip and heartbeat check on the suspect
> coordinator determines suspect is available
> This is all due to the checkMember call in GMSMembershipManager passing 
> _true_ for the _initiateRemoval_ parameter.  It should be passing _false_.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

[jira] [Commented] (GEODE-6244) Healthy member kicked out by Sick member when final-check fails

Reply via email to