[
https://issues.apache.org/jira/browse/GEODE-8659?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17221777#comment-17221777
]
Dan Smith commented on GEODE-8659:
----------------------------------
This appears to be a race condition during the join process. In the logs, there
are two locators, one with port 41002 (locator A) and one with port 41003
(locator B). Locator A is configured with a locator-wait-time and the address
of locator B. Locator B has no wait time and no knowledge of other locators.
What I see in the test is this:
1. Locator A starts up. It is spinning and checking for Locator B to start
{noformat}
info 2020/10/26 17:42:42.970 GMT <pool-3-thread-1> tid=93] Unable to contact
locator HostAndPort[5199b02bbb6f:32857]: java.io.EOFException: Locator at
HostAndPort[5199b02bbb6f:32857] did not respond. This is normal if the locator
was shutdown. If it wasn't check its log for exceptions.
[info 2020/10/26 17:42:43.975 GMT <pool-3-thread-1> tid=93] Discovery state
after looking for membership coordinator is locatorsContacted=1;
findInViewResponses=0; alreadyTried=[];
registrants=[172.17.0.34(1:locator)<ec>:41002];
possibleCoordinator=172.17.0.34(1:locator)<ec>:41002; viewId=-1;
hasContactedAJoinedLocator=false; view=null; responses=[]
{noformat}
2. Locator B starts. It creates a distributed system and becomes the coordinator
{noformat}
[info 2020/10/26 17:42:44.818 GMT <Test worker> tid=27] Starting membership
services
[info 2020/10/26 17:42:45.003 GMT <Test worker> tid=27] This member is becoming
the membership coordinator with address 172.17.0.34(1:locator)<ec>:41003
[info 2020/10/26 17:42:45.010 GMT <Test worker> tid=27] Finished joining (took
60ms).
{noformat}
3. Locator A manages to contact Locator B after it has created the system but
*before* it has installed a view. This means it gets an interesting response.
It is not a failure, but it also does not have any joined members.
{noformat}
[info 2020/10/26 17:42:45.013 GMT <pool-3-thread-1> tid=93] received
FindCoordinatorResponse(coordinator=172.17.0.34(1:locator)<ec><v0>:41003,
fromView=false, viewId=null, registrants=[],
senderId=172.17.0.34(1:locator)<ec><v0>:41003, network partition detection
enabled=true, locators preferred as coordinators=true, view=null) from locator
HostAndPort[5199b02bbb6f:32857]
[info 2020/10/26 17:42:45.013 GMT <pool-3-thread-1> tid=93] Locator's address
indicates it is part of a distributed system so I will not become membership
coordinator on this attempt to join
[info 2020/10/26 17:42:45.022 GMT <pool-3-thread-1> tid=93] received
FindCoordinatorResponse(coordinator=172.17.0.34(1:locator)<ec>:41002,
fromView=false, viewId=null, registrants=[172.17.0.34(1:locator)<ec>:41002],
senderId=172.17.0.34(1:locator)<ec>:41002, network partition detection
enabled=true, locators preferred as coordinators=true, view=null) from locator
HostAndPort[5199b02bbb6f:46597]
[info 2020/10/26 17:42:45.023 GMT <pool-3-thread-1> tid=93] findCoordinator
chose 172.17.0.34(1:locator)<ec>:41002 out of these possible coordinators:
[172.17.0.34(1:locator)<ec><v0>:41003, 172.17.0.34(1:locator)<ec>:41002]
[info 2020/10/26 17:42:45.023 GMT <pool-3-thread-1> tid=93] Discovery state
after looking for membership coordinator is locatorsContacted=2;
findInViewResponses=0; alreadyTried=[];
registrants=[172.17.0.34(1:locator)<ec>:41002];
possibleCoordinator=172.17.0.34(1:locator)<ec>:41002; viewId=-1;
hasContactedAJoinedLocator=true; view=null; responses=[]
{noformat}
4. It appears because Locator A sees that it has contacted all of the locators,
and there is not actually a view in that response, it decides it should become
the coordinator as well, creating a split brain.
{noformat}
[info 2020/10/26 17:42:45.023 GMT <pool-3-thread-1> tid=93] found possible
coordinator 172.17.0.34(1:locator)<ec>:41002
[info 2020/10/26 17:42:45.023 GMT <pool-3-thread-1> tid=93] This member is
becoming the membership coordinator with address
172.17.0.34(1:locator)<ec>:41002
{noformat}
The problem seems to be the above response. Locator B is sending a response to
Locator A, but that is not informing Locator A that it should not become the
coordinator. It believe it's passing through this check because
*joinedMembersContacted* is still zero:
{code}
if (state.joinedMembersContacted <= 0 && ((now >= locatorGiveUpTime
&&
tries >= minimumRetriesBeforeBecomingCoordinator) ||
state.locatorsContacted >= locators.size())) {
synchronized (viewInstallationLock) {
becomeCoordinator();
}
{code}
There is *another* flag on SearchState that we are not checking in the above
block, which is *hasContactedAJoinedLocator*. So maybe we could check that as
well? However, it's not clear to me that it will definitely always be true in
the above case. Ideally, a locator will either (A) not respond, or (B) respond
and affirmatively indicate that it's becoming the coordinator in this case.
> MembershipIntegrationTest.secondMembershipPausesForLocatorWaitTime fails with
> ConditionTimeoutException
> -------------------------------------------------------------------------------------------------------
>
> Key: GEODE-8659
> URL: https://issues.apache.org/jira/browse/GEODE-8659
> Project: Geode
> Issue Type: Bug
> Components: membership, tests
> Reporter: Kirk Lund
> Priority: Major
>
> =-=-=-=-=-=-=-=-=-=-=-=-=-=-= Test Results URI
> =-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=
> http://files.apachegeode-ci.info/builds/apache-develop-pr/geode-pr-5667/test-results/integrationTest/1603735181/
> =-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=
> Test report artifacts from this job are available at:
> http://files.apachegeode-ci.info/builds/apache-develop-pr/geode-pr-5667/test-artifacts/1603735181/integrationtestfiles-geode-pr-5667.tgz
> {noformat}
> org.apache.geode.distributed.internal.membership.gms.MembershipIntegrationTest
> > secondMembershipPausesForLocatorWaitTime FAILED
> org.awaitility.core.ConditionTimeoutException: Assertion condition
> defined as a lambda expression in
> org.apache.geode.distributed.internal.membership.gms.MembershipIntegrationTest
> that uses org.apache.geode.distributed.internal.membership.api.Membership
> Expected size:<2> but was:<1> in:
> <[172.17.0.34(1:locator)<ec><v0>:41003]> within 5 minutes.
> at
> org.awaitility.core.ConditionAwaiter.await(ConditionAwaiter.java:165)
> at
> org.awaitility.core.AssertionCondition.await(AssertionCondition.java:119)
> at
> org.awaitility.core.AssertionCondition.await(AssertionCondition.java:31)
> at
> org.awaitility.core.ConditionFactory.until(ConditionFactory.java:895)
> at
> org.awaitility.core.ConditionFactory.untilAsserted(ConditionFactory.java:679)
> at
> org.apache.geode.distributed.internal.membership.gms.MembershipIntegrationTest.secondMembershipPausesForLocatorWaitTime(MembershipIntegrationTest.java:270)
> Caused by:
> java.lang.AssertionError:
> Expected size:<2> but was:<1> in:
> <[172.17.0.34(1:locator)<ec><v0>:41003]>
> at
> org.apache.geode.distributed.internal.membership.gms.MembershipIntegrationTest.lambda$secondMembershipPausesForLocatorWaitTime$8(MembershipIntegrationTest.java:271)
> {noformat}
--
This message was sent by Atlassian Jira
(v8.3.4#803005)