[
https://issues.apache.org/jira/browse/GEODE-6646?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16816695#comment-16816695
]
Shelley Lynn Hughes-Godfrey edited comment on GEODE-6646 at 4/12/19 10:35 PM:
------------------------------------------------------------------------------
In this test, we start a locator and 2 servers (server-1 and server-2).
Then we forcefully disconnect server-2 and the locator before waiting for the
locator to reconnect + start server-3.
We must also expect server2 to reconnect; but it looks like the locator,
server-2 and server-3 form a new DS (without server-1).
{noformat}
@Test
public void serverRestartsAfterLocatorReconnects() throws Exception {
IgnoredException.addIgnoredException("org.apache.geode.ForcedDisconnectException:
for testing");
IgnoredException.addIgnoredException("cluster configuration service not
available");
IgnoredException.addIgnoredException("This thread has been stalled");
IgnoredException
.addIgnoredException("member unexpectedly shut down shared, unordered
connection");
IgnoredException.addIgnoredException("Connection refused");
MemberVM locator0 = rule.startLocatorVM(0);
rule.startServerVM(1, locator0.getPort());
MemberVM server2 = rule.startServerVM(2, locator0.getPort());
addDisconnectListener(locator0);
server2.forceDisconnect();
locator0.forceDisconnect();
waitForLocatorToReconnect(locator0);
rule.startServerVM(3, locator0.getPort());
gfsh.connectAndVerify(locator0);
await()
.untilAsserted(() -> gfsh.executeAndAssertThat("list
members").statusIsSuccess()
.tableHasColumnOnlyWithValues("Name", "locator-0", "server-1",
"server-2", "server-3"));
}
{noformat}
locator and server-2 are forcefully disconnected at 19:30:45 and it looks like
server-1 tried to become the coordinator ... but in the end, he didn't get any
responses from the others and they seems to create their own DS.
{noformat}
[vm2] [info 2019/04/12 19:30:45.491 UTC <RMI TCP
Connection(1)-172.17.0.2> tid=0x20] GroupMembershipService.beSick invoked
for 172.17.0.2(server-2:249)<v2>:41003 - simulating sickness
[vm2] [info 2019/04/12 19:30:45.491 UTC <RMI TCP
Connection(1)-172.17.0.2> tid=0x20] GroupMembershipService.playDead invoked
for 172.17.0.2(server-2:249)<v2>:41003
[vm0] [info 2019/04/12 19:30:45.716 UTC <RMI TCP
Connection(1)-172.17.0.2> tid=0x20] GroupMembershipService.beSick invoked
for 172.17.0.2(locator-0:1011:locator)<ec><v0>:41001 - simulating
sickness
[vm0] [info 2019/04/12 19:30:45.716 UTC <RMI TCP
Connection(1)-172.17.0.2> tid=0x20] GroupMembershipService.playDead invoked
for 172.17.0.2(locator-0:1011:locator)<ec><v0>:41001
{noformat}
vm1 reports the locator and server-2 as suspect and becomes the membership
coordinator
{noformat}
[vm1] [info 2019/04/12 19:30:50.772 UTC <Geode Failure Detection thread
3> tid=0xca] Availability check failed for member
172.17.0.2(server-2:249)<v2>:41003
[vm1] [info 2019/04/12 19:30:50.773 UTC <Geode Failure Detection thread
3> tid=0xca] Requesting removal of suspect member
172.17.0.2(server-2:249)<v2>:41003
[vm1] [info 2019/04/12 19:30:50.772 UTC <Geode Failure Detection thread
2> tid=0xc9] Availability check failed for member
172.17.0.2(locator-0:1011:locator)<ec><v0>:41001
[vm1] [info 2019/04/12 19:30:50.776 UTC <Geode Failure Detection thread
2> tid=0xc9] Requesting removal of suspect member
172.17.0.2(locator-0:1011:locator)<ec><v0>:41001
[vm1] [info 2019/04/12 19:30:50.776 UTC <Geode Failure Detection thread
2> tid=0xc9] This member is becoming the membership coordinator with address
172.17.0.2(server-1:245)<v1>:41002
[vm1] [info 2019/04/12 19:30:50.777 UTC <Geode Failure Detection thread
2> tid=0xc9] ViewCreator starting on:172.17.0.2(server-1:245)<v1>:41002
[vm1] [info 2019/04/12 19:30:50.777 UTC <Geode Membership View Creator>
tid=0xcb] View Creator thread is starting
[vm1] [info 2019/04/12 19:30:50.779 UTC <Geode Membership View Creator>
tid=0xcb] preparing new view View[172.17.0.2(server-1:245)<v1>:41002|9]
members: [172.17.0.2(server-1:245)<v1>:41002{lead},
172.17.0.2(server-2:249)<v2>:41003] crashed:
[172.17.0.2(locator-0:1011:locator)<ec><v0>:41001]
...
[vm1] [info 2019/04/12 19:31:41.970 UTC <Geode Membership View Creator>
tid=0xcb] sending new view View[172.17.0.2(server-1:245)<v1>:41002|12]
members: [172.17.0.2(server-1:245)<v1>:41002{lead}] crashed:
[172.17.0.2(locator-0:1011:locator)<ec><v11>:41001,
172.17.0.2(server-2:249)<v11>:41003]
[vm2] [info 2019/04/12 19:31:41.970 UTC <unicast
receiver,bba57c926507-60306> tid=0x8a] Ignoring the view
View[172.17.0.2(server-1:245)<v1>:41002|12] members:
[172.17.0.2(server-1:245)<v1>:41002{lead}] crashed:
[172.17.0.2(server-2:249)<v11>:41003,
172.17.0.2(locator-0:1011:locator)<ec><v11>:41001] from member
172.17.0.2<v1>:41002, which is not in my current view
View[172.17.0.2(locator-0:1011:locator)<ec><v0>:41001|1] members:
[172.17.0.2(locator-0:1011:locator)<ec><v0>:41001,
172.17.0.2(server-2:249)<v1>:41003{lead},
172.17.0.2(server-3:255)<v1>:41004]
[vm0] [info 2019/04/12 19:31:41.970 UTC <unicast
receiver,bba57c926507-9474> tid=0x31] Ignoring the view
View[172.17.0.2(server-1:245)<v1>:41002|12] members:
[172.17.0.2(server-1:245)<v1>:41002{lead}] crashed:
[172.17.0.2(server-2:249)<v11>:41003,
172.17.0.2(locator-0:1011:locator)<ec><v11>:41001] from member
172.17.0.2<v1>:41002, which is not in my current view
View[172.17.0.2(locator-0:1011:locator)<ec><v0>:41001|1] members:
[172.17.0.2(locator-0:1011:locator)<ec><v0>:41001,
172.17.0.2(server-2:249)<v1>:41003{lead},
172.17.0.2(server-3:255)<v1>:41004]
[vm0] [info 2019/04/12 19:31:42.024 UTC <RMI TCP
Connection(7)-172.17.0.2> tid=0x20] Executing command: list members
Command result for <list members>:
Name | Id
--------- | --------------------------------------------------------------
locator-0 | 172.17.0.2(locator-0:1011:locator)<ec><v0>:41001
[Coordinator]
server-2 | 172.17.0.2(server-2:249)<v1>:41003
server-3 | 172.17.0.2(server-3:255)<v1>:41004
{noformat}
was (Author: lhughesgodfrey):
In this test, we start a locator and 2 servers (server-1 and server-2).
Then we forcefully disconnect server-2 and the locator before waiting for the
locator to reconnect + start server-3.
We must also expect server2 to reconnect; but it looks like the locator,
server-2 and server-3 form a new DS (without server-1).
{noformat}
@Test
public void serverRestartsAfterLocatorReconnects() throws Exception {
IgnoredException.addIgnoredException("org.apache.geode.ForcedDisconnectException:
for testing");
IgnoredException.addIgnoredException("cluster configuration service not
available");
IgnoredException.addIgnoredException("This thread has been stalled");
IgnoredException
.addIgnoredException("member unexpectedly shut down shared, unordered
connection");
IgnoredException.addIgnoredException("Connection refused");
MemberVM locator0 = rule.startLocatorVM(0);
rule.startServerVM(1, locator0.getPort());
MemberVM server2 = rule.startServerVM(2, locator0.getPort());
addDisconnectListener(locator0);
server2.forceDisconnect();
locator0.forceDisconnect();
waitForLocatorToReconnect(locator0);
rule.startServerVM(3, locator0.getPort());
gfsh.connectAndVerify(locator0);
await()
.untilAsserted(() -> gfsh.executeAndAssertThat("list
members").statusIsSuccess()
.tableHasColumnOnlyWithValues("Name", "locator-0", "server-1",
"server-2", "server-3"));
}
{noformat}
> CI:
> org.apache.geode.management.internal.configuration.ClusterConfigLocatorRestartDUnitTest
> > serverRestartsAfterLocatorReconnects FAILED
> -----------------------------------------------------------------------------------------------------------------------------------------
>
> Key: GEODE-6646
> URL: https://issues.apache.org/jira/browse/GEODE-6646
> Project: Geode
> Issue Type: Bug
> Components: gfsh, membership
> Affects Versions: 1.10.0
> Reporter: Shelley Lynn Hughes-Godfrey
> Priority: Major
> Labels: CI
>
> https://concourse.apachegeode-ci.info/teams/main/pipelines/apache-develop-main/jobs/DistributedTestOpenJDK8/builds/617
> {noformat}
> org.apache.geode.management.internal.configuration.ClusterConfigLocatorRestartDUnitTest
> > serverRestartsAfterLocatorReconnects FAILED
> org.awaitility.core.ConditionTimeoutException: Assertion condition
> defined as a lambda expression in
> org.apache.geode.management.internal.configuration.ClusterConfigLocatorRestartDUnitTest
>
> Expecting:
> <["locator-0", "server-2", "server-3"]>
> to contain only:
> <["locator-0", "server-1", "server-2", "server-3"]>
> but could not find the following elements:
> <["server-1"]>
> within 300 seconds.
> at
> org.awaitility.core.ConditionAwaiter.await(ConditionAwaiter.java:145)
> at
> org.awaitility.core.AssertionCondition.await(AssertionCondition.java:122)
> at
> org.awaitility.core.AssertionCondition.await(AssertionCondition.java:32)
> at
> org.awaitility.core.ConditionFactory.until(ConditionFactory.java:902)
> at
> org.awaitility.core.ConditionFactory.untilAsserted(ConditionFactory.java:723)
> at
> org.apache.geode.management.internal.configuration.ClusterConfigLocatorRestartDUnitTest.serverRestartsAfterLocatorReconnects(ClusterConfigLocatorRestartDUnitTest.java:81)
> Caused by:
> java.lang.AssertionError:
> Expecting:
> <["locator-0", "server-2", "server-3"]>
> to contain only:
> <["locator-0", "server-1", "server-2", "server-3"]>
> but could not find the following elements:
> <["server-1"]>
> at
> org.apache.geode.test.junit.assertions.CommandResultAssert.tableHasColumnOnlyWithValues(CommandResultAssert.java:308)
> at
> org.apache.geode.management.internal.configuration.ClusterConfigLocatorRestartDUnitTest.lambda$serverRestartsAfterLocatorReconnects$0(ClusterConfigLocatorRestartDUnitTest.java:82)
> {noformat}
> Artifacts available here:
> {noformat}
> =-=-=-=-=-=-=-=-=-=-=-=-=-=-= Test Results URI
> =-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=
> http://files.apachegeode-ci.info/builds/apache-develop-main/1.10.0-SNAPSHOT.0177/test-results/distributedTest/1555101232/
> =-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=
> Test report artifacts from this job are available at:
> http://files.apachegeode-ci.info/builds/apache-develop-main/1.10.0-SNAPSHOT.0177/test-artifacts/1555101232/distributedtestfiles-OpenJDK8-1.10.0-SNAPSHOT.0177.tgz
> {noformat}
--
This message was sent by Atlassian JIRA
(v7.6.3#76005)