[ https://issues.apache.org/jira/browse/GEODE-5703?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16632629#comment-16632629 ]
Brian Rowe commented on GEODE-5703: ----------------------------------- So, I've spent far too long on this for how rare the failure is and am going to put this down to focus on more pressing issues. Some observations from my debugging so far: This failure seems to occur when this test (or similar tests, such as when run as part of a full AlterRuntimeCommandDUnitTest run) is run multiple times in a particular set of VMs. I was able to reproduce this by repeatedly running the test in IntelliJ (took several hundred test runs), but was unable to get it to fail in an isolated run of a single instance (with over 20,000 attempts). The failure is due to being unable to connect to the locator, which can be seen either when starting a server, or in the locator startup itself (when joining the distributed system immediately after starting the locator). In the specific case where I was able to add some additional logging and drill down into this (which was a locator startup failure), what I specifically saw was that the GMSMembershipManager.join got back a false when invoking GMSJoinLeave.join. This in turn was caused by the GMSJoinLeave.findCoordinator returning false, which was caused by tcpClientWrapper.sendCoordinatorFindRequest throwing an IOException containing "Unknown header byte: 0." That exception string only appears in the InternalDataSerializer.basicReadObject, which indicates that this exception was thrown when reading the request object, or an object nested within the request. Either way, the server had successfully read a gossip version and object header byte off the stream before encountering an invalid 0 byte (so the entire stream couldn't have been corrupted). Looking through the serialization code, there were no obvious places we could have written this incorrectly. Because the server-side socket is visible to only a single thread (we believe), it seems unlikely that the stream is being corrupted on the server-side (e.g. by two server-side threads reading from the same socket concurrently). Perhaps the client is actually writing a 0 in the wrong place? But the client too, creates a single-use socket, visible only to one thread (we believe), so contention there would be unlikely. That doesn't rule out simply broken single-threaded logic. Of course, the fact that this only shows up in repeated runs on a particular VM suggests there's some deeper issue contributing to this as well. > CI Failure: AlterRuntimeCommandDUnitTest > > alterStatArchiveFileWithMember_updatesSelectedServerConfigs(true) > ------------------------------------------------------------------------------------------------------------ > > Key: GEODE-5703 > URL: https://issues.apache.org/jira/browse/GEODE-5703 > Project: Geode > Issue Type: Bug > Reporter: Helena Bales > Assignee: Brian Rowe > Priority: Major > Labels: swat > > CI Failure can be found here: > https://concourse.apachegeode-ci.info/teams/main/pipelines/develop/jobs/DistributedTest/builds/363 > Failed with Stack Trace: > {code:java} > org.apache.geode.management.internal.cli.commands.AlterRuntimeCommandDUnitTest > > alterStatArchiveFileWithMember_updatesSelectedServerConfigs(true) [0] > FAILED > > org.apache.geode.test.dunit.RMIException: While invoking > org.apache.geode.test.dunit.rules.ClusterStartupRule$$Lambda$41/1318057984.call > in VM 0 running on Host fca019e1fb13 with 4 VMs > > at org.apache.geode.test.dunit.VM.invoke(VM.java:450) > > at org.apache.geode.test.dunit.VM.invoke(VM.java:419) > > at org.apache.geode.test.dunit.VM.invoke(VM.java:385) > > at > org.apache.geode.test.dunit.rules.ClusterStartupRule.startLocatorVM(ClusterStartupRule.java:198) > > at > org.apache.geode.test.dunit.rules.ClusterStartupRule.startLocatorVM(ClusterStartupRule.java:191) > > at > org.apache.geode.management.internal.cli.commands.AlterRuntimeCommandDUnitTest.alterStatArchiveFileWithMember_updatesSelectedServerConfigs(AlterRuntimeCommandDUnitTest.java:466) > > > Caused by: > > org.apache.geode.GemFireConfigException: Unable to join the > distributed system. Operation either timed out, was stopped or Locator does > not exist. > {code} -- This message was sent by Atlassian JIRA (v7.6.3#76005)