[ 
https://issues.apache.org/jira/browse/GEODE-5703?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16632629#comment-16632629
 ] 

Brian Rowe commented on GEODE-5703:
-----------------------------------

So, I've spent far too long on this for how rare the failure is and am going to 
put this down to focus on more pressing issues.  Some observations from my 
debugging so far:

This failure seems to occur when this test (or similar tests, such as when run 
as part of a full AlterRuntimeCommandDUnitTest run) is run multiple times in a 
particular set of VMs.  I was able to reproduce this by repeatedly running the 
test in IntelliJ (took several hundred test runs), but was unable to get it to 
fail in an isolated run of a  single instance (with over 20,000 attempts).

The failure is due to being unable to connect to the locator, which can be seen 
either when starting a server, or in the locator startup itself (when joining 
the distributed system immediately after starting the locator).  In the 
specific case where I was able to add some additional logging and drill down 
into this (which was a locator startup failure), what I specifically saw was 
that the GMSMembershipManager.join got back a false when invoking 
GMSJoinLeave.join.  This in turn was caused by the GMSJoinLeave.findCoordinator 
returning false, which was caused by 
tcpClientWrapper.sendCoordinatorFindRequest throwing an IOException containing 
"Unknown header byte: 0."  That exception string only appears in the 
InternalDataSerializer.basicReadObject, which indicates that this exception was 
thrown when reading the request object, or an object nested within the request. 
 Either way, the server had successfully read a gossip version and object 
header byte off the stream before encountering an invalid 0 byte (so the entire 
stream couldn't have been corrupted).  Looking through the serialization code, 
there were no obvious places we could have written this incorrectly.

Because the server-side socket is visible to only a single thread (we believe), 
it seems unlikely that the stream is being corrupted on the server-side (e.g. 
by two server-side threads reading from the same socket concurrently). Perhaps 
the client is actually writing a 0 in the wrong place? But the client too, 
creates a single-use socket, visible only to one thread (we believe), so 
contention there would be unlikely. That doesn't rule out simply broken 
single-threaded logic.

Of course, the fact that this only shows up in repeated runs on a particular VM 
suggests there's some deeper issue contributing to this as well.

> CI Failure: AlterRuntimeCommandDUnitTest > 
> alterStatArchiveFileWithMember_updatesSelectedServerConfigs(true)
> ------------------------------------------------------------------------------------------------------------
>
>                 Key: GEODE-5703
>                 URL: https://issues.apache.org/jira/browse/GEODE-5703
>             Project: Geode
>          Issue Type: Bug
>            Reporter: Helena Bales
>            Assignee: Brian Rowe
>            Priority: Major
>              Labels: swat
>
> CI Failure can be found here:
> https://concourse.apachegeode-ci.info/teams/main/pipelines/develop/jobs/DistributedTest/builds/363
> Failed with Stack Trace:
> {code:java}
> org.apache.geode.management.internal.cli.commands.AlterRuntimeCommandDUnitTest
>  > alterStatArchiveFileWithMember_updatesSelectedServerConfigs(true) [0] 
> FAILED
>       
>     org.apache.geode.test.dunit.RMIException: While invoking 
> org.apache.geode.test.dunit.rules.ClusterStartupRule$$Lambda$41/1318057984.call
>  in VM 0 running on Host fca019e1fb13 with 4 VMs
>       
>         at org.apache.geode.test.dunit.VM.invoke(VM.java:450)
>       
>         at org.apache.geode.test.dunit.VM.invoke(VM.java:419)
>       
>         at org.apache.geode.test.dunit.VM.invoke(VM.java:385)
>       
>         at 
> org.apache.geode.test.dunit.rules.ClusterStartupRule.startLocatorVM(ClusterStartupRule.java:198)
>       
>         at 
> org.apache.geode.test.dunit.rules.ClusterStartupRule.startLocatorVM(ClusterStartupRule.java:191)
>       
>         at 
> org.apache.geode.management.internal.cli.commands.AlterRuntimeCommandDUnitTest.alterStatArchiveFileWithMember_updatesSelectedServerConfigs(AlterRuntimeCommandDUnitTest.java:466)
>       
>       
>         Caused by:
>       
>         org.apache.geode.GemFireConfigException: Unable to join the 
> distributed system.  Operation either timed out, was stopped or Locator does 
> not exist.
> {code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

Reply via email to