[jira] [Commented] (CASSANDRA-10205) decommissioned_wiped_node_can_join_test fails on Jenkins

Stefania (JIRA) Fri, 28 Aug 2015 00:10:47 -0700

    [ 
https://issues.apache.org/jira/browse/CASSANDRA-10205?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14718135#comment-14718135
 ]


Stefania commented on CASSANDRA-10205:
--------------------------------------

Thanks for this. I had to swap the populate and set_log_level lines, I 
committed directly to master in order to run the test again and retrieve the 
log files of all nodes at trace level (attached). 

After node4 restarts, it sends the GOSSIP shadow round request and the other 
nodes receive it and reply. However, node4 never sees the reply on Jenkins 
whilst it does locally. My _best guess_ is that the socket to the previous 
process was not closed and the messages were sent to the wrong socket. Although 
I don't understand why this would only happen on Jenkins boxes and not locally.

{{markDead}} is not called after a decommission because the status is LEFT. I 
think we should mark the node as dead if it's state is still alive, regardless 
of whether it has left the ring; see [this C* 
patch|https://github.com/stef1927/cassandra/commits/10205-3.0].

With that patch applied, we can then add {{wait_other_notice=True}} when 
stopping the node, see [this dtest 
patch|https://github.com/stef1927/cassandra-dtest/commits/10205] and this might 
fix it. Without the patch the stop will hang since "is now DOWN" will be 
missing from the log files.

Would you be able to set-up a Jenkins job to run my dtest patch on the patched 
C* branch so that we know for sure? Alternatively, if easier and if you don't 
mind polluting the master branch, we can just add {{@require("10205")}} to the 
test and merge my dtest patch to master. 


> decommissioned_wiped_node_can_join_test fails on Jenkins
> --------------------------------------------------------
>
>                 Key: CASSANDRA-10205
>                 URL: https://issues.apache.org/jira/browse/CASSANDRA-10205
>             Project: Cassandra
>          Issue Type: Test
>            Reporter: Stefania
>            Assignee: Stefania
>         Attachments: decommissioned_wiped_node_can_join_test.tar.gz
>
>
> This test passes locally but reliably fails on Jenkins. It seems after we 
> restart node4, it is unable to Gossip with other nodes:
> {code}
> INFO  [HANDSHAKE-/127.0.0.2] 2015-08-27 06:50:42,778 
> OutboundTcpConnection.java:494 - Handshaking version with /127.0.0.2
> INFO  [HANDSHAKE-/127.0.0.1] 2015-08-27 06:50:42,778 
> OutboundTcpConnection.java:494 - Handshaking version with /127.0.0.1
> INFO  [HANDSHAKE-/127.0.0.3] 2015-08-27 06:50:42,778 
> OutboundTcpConnection.java:494 - Handshaking version with /127.0.0.3
> ERROR [main] 2015-08-27 06:51:13,785 CassandraDaemon.java:635 - Exception 
> encountered during startup
> java.lang.RuntimeException: Unable to gossip with any seeds
>         at 
> org.apache.cassandra.gms.Gossiper.doShadowRound(Gossiper.java:1342) 
> ~[main/:na]
>         at 
> org.apache.cassandra.service.StorageService.checkForEndpointCollision(StorageService.java:518)
>  ~[main/:na]
>         at 
> org.apache.cassandra.service.StorageService.prepareToJoin(StorageService.java:763)
>  ~[main/:na]
>         at 
> org.apache.cassandra.service.StorageService.initServer(StorageService.java:687)
>  ~[main/:na]
>         at 
> org.apache.cassandra.service.StorageService.initServer(StorageService.java:570)
>  ~[main/:na]
>         at 
> org.apache.cassandra.service.CassandraDaemon.setup(CassandraDaemon.java:320) 
> [main/:na]
>         at 
> org.apache.cassandra.service.CassandraDaemon.activate(CassandraDaemon.java:516)
>  [main/:na]
>         at 
> org.apache.cassandra.service.CassandraDaemon.main(CassandraDaemon.java:622) 
> [main/:na]
> WARN  [StorageServiceShutdownHook] 2015-08-27 06:51:13,799 Gossiper.java:1453 
> - No local state or state is in silent shutdown, not announcing shutdown
> {code}
> It seems both the addresses and port number of the seeds are correct so I 
> don't think the problem is the Amazon private addresses but I might be wrong. 
> It's also worth noting that the first time the node starts up without 
> problems. The problem only occurs during a restart.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (CASSANDRA-10205) decommissioned_wiped_node_can_join_test fails on Jenkins

Reply via email to