[ https://issues.apache.org/jira/browse/GEODE-6244?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16746568#comment-16746568 ]
ASF subversion and git services commented on GEODE-6244: -------------------------------------------------------- Commit ffd6b38e78ccb7f4a1b451cbcf59c16e7696393e in geode's branch refs/heads/develop from Bruce Schuchardt [ https://gitbox.apache.org/repos/asf?p=geode.git;h=ffd6b38 ] GEODE-6244 Healthy member kicked out by sick member - do not allow membership manager suspect initiation to kick out a member on the first failed check - perform a self-health check before sending SuspectRequest messages - consider members who have sent shutdown messages as gone when performing "should I become coordinator" checks in GMSHealthMonitor > Healthy member kicked out by Sick member when final-check fails > --------------------------------------------------------------- > > Key: GEODE-6244 > URL: https://issues.apache.org/jira/browse/GEODE-6244 > Project: Geode > Issue Type: New Feature > Components: membership > Affects Versions: 1.1.0, 1.1.1, 1.2.0, 1.3.0, 1.2.1, 1.4.0, 1.5.0, 1.6.0, > 1.7.0, 1.8.0 > Reporter: Bruce Schuchardt > Priority: Major > Fix For: 1.9.0 > > > I observed this in a user's logs & can't include artifacts: Clients were > herding to one server when another server was being slow to return results. > The clients caused the server to run out of file descriptors because the > descriptor limit was set pretty low. When that happened the server had > trouble forming an outgoing tcp/ip connection to another server. It tried > using MembershipManager.verifyMember() which also failed to connect to the > other server. When that happened it sent a RemoveMessage to the locators and > several of the other servers, including the one it couldn't connect to. That > server immediately shut itself down. > MembershipManager.verifyMember() is documented to only initiate suspect > processing on the target, not initiate immediate removal. This is supposed > to be done so that some other process (i.e., the membership coordinator) will > do additional checking on the suspect in case the initiator is itself sick. > That was the case in this situation. > serverA unable to connect to serverB > serverA performs tcp/ip check in verifyMember > serverA's tcp/ip check fails (it's out of file descriptors, duh) > serverA sends RemoveMember message to locators and serverB > serverB shuts itself down (ForcedDisconnect) > The behavior should instead be > serverA unable to connect to serverB > serverA performs tcp/ip check in verifyMember > serverA's tcp/ip check fails (it's out of file descriptors, duh) > serverA sends SuspectMember message to locators & other servers > coordinator performs tcp/ip and heartbeat check on the suspect > coordinator determines suspect is available > This is all due to the checkMember call in GMSMembershipManager passing > _true_ for the _initiateRemoval_ parameter. It should be passing _false_. -- This message was sent by Atlassian JIRA (v7.6.3#76005)