[jira] [Commented] (CASSANDRA-10844) failed_bootstrap_wiped_node_can_join_test is failing

Joel Knighton (JIRA) Wed, 30 Dec 2015 14:10:40 -0800

    [ 
https://issues.apache.org/jira/browse/CASSANDRA-10844?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15075465#comment-15075465
 ]


Joel Knighton commented on CASSANDRA-10844:
-------------------------------------------

In [CASSANDRA-7069], we added consistent range movement to prevent concurrent 
bootstraps/decommissions.

We do this in the {{checkForEndpointCollision}} method.

In [CASSANDRA-7939], we observed that this prevented immediately retrying a 
failed bootstrap. In order to avoid this, we switched to checking if a node is 
a fat client and not checking {{isSafeForBootstrap}} in this situation, but 
instead iterating over
 all endpoint states and checking if any endpoints are in 
STATUS_LEAVING/STATUS_MOVING/STATUS_BOOTSTRAPPING.

However, this didn't solve the problem if a node had reached the point of 
setting this status before failing its bootstrap.

In [CASSANDRA-8494], this deficiency was noticed in adding resumable 
bootstrapping and a line was added in [this 
commit|https://github.com/yukim/cassandra/commit/5f7fd497ae83f813078d56ba1b61f7ea322e5d5a]
 to ignore this gossip state for a fat client with the same broadcastAddress as 
the bootstrapping node. Since resumable bootstrapping went in to 2.2+ only, 
this explains why this test is failing only on 2.1 (since we aren't ignoring 
the fat client gossip entry for our previous failed bootstrap).

This failing test was added in [CASSANDRA-9765], which addressed deficiencies 
in {{checkForEndpointCollision}}. 

The consensus on 9765 was that bootstrapping is a safe state when checking for 
endpoint collisions (deferring to 7939).

I think the best fix here is to backport the bootstrapping broadcastAddress 
check from 2.2 - what do you think [~Stefania]? Do you recall seeing a 
different behavior for this test on 2.1?



> failed_bootstrap_wiped_node_can_join_test is failing
> ----------------------------------------------------
>
>                 Key: CASSANDRA-10844
>                 URL: https://issues.apache.org/jira/browse/CASSANDRA-10844
>             Project: Cassandra
>          Issue Type: Sub-task
>          Components: Streaming and Messaging, Testing
>            Reporter: Philip Thompson
>             Fix For: 2.1.x
>
>         Attachments: node1.log, node2.log
>
>
> {{bootstrap_test.TestBootstrap.failed_bootstap_wiped_node_can_join_test}} is 
> failing on 2.1-head. The second node fails to join the cluster. I see a lot 
> of exceptions in node1's log, such as 
> {code}
> ERROR [STREAM-OUT-/127.0.0.2] 2015-12-11 12:06:13,778 StreamSession.java:505 
> - [Stream #7b5ec5a0-a029-11e5-bad9-ffd0922f40e6] Streaming error occurred
> java.io.IOException: Broken pipe
>         at sun.nio.ch.FileDispatcherImpl.write0(Native Method) ~[na:1.8.0_51]
>         at sun.nio.ch.SocketDispatcher.write(SocketDispatcher.java:47) 
> ~[na:1.8.0_51]
>         at sun.nio.ch.IOUtil.writeFromNativeBuffer(IOUtil.java:93) 
> ~[na:1.8.0_51]
>         at sun.nio.ch.IOUtil.write(IOUtil.java:65) ~[na:1.8.0_51]
>         at sun.nio.ch.SocketChannelImpl.write(SocketChannelImpl.java:471) 
> ~[na:1.8.0_51]
>         at 
> org.apache.cassandra.io.util.DataOutputStreamAndChannel.write(DataOutputStreamAndChannel.java:48)
>  ~[main/:na]
>         at 
> org.apache.cassandra.streaming.messages.StreamMessage.serialize(StreamMessage.java:44)
>  ~[main/:na]
>         at 
> org.apache.cassandra.streaming.ConnectionHandler$OutgoingMessageHandler.sendMessage(ConnectionHandler.java:351)
>  [main/:na]
>         at 
> org.apache.cassandra.streaming.ConnectionHandler$OutgoingMessageHandler.run(ConnectionHandler.java:331)
>  [main/:na]
>         at java.lang.Thread.run(Thread.java:745) [na:1.8.0_51]
> {code}
> Which seem consistent with node2 being killed, so the bootstrap fails. But 
> then when restarting node2, it does not join. It *looks* like it fails to 
> rejoin because of a false positive in checking the 2 minute rule.
> {code}
> ERROR [main] 2015-12-11 12:06:17,954 CassandraDaemon.java:579 - Except
> ion encountered during startup
> java.lang.UnsupportedOperationException: Other bootstrapping/leaving/m
> oving nodes detected, cannot bootstrap while cassandra.consistent.rang
> emovement is true
>         at org.apache.cassandra.service.StorageService.checkForEndpoin
> tCollision(StorageService.java:559) ~[main/:na]
>         at org.apache.cassandra.service.StorageService.prepareToJoin(S
> torageService.java:789) ~[main/:na]
>         at org.apache.cassandra.service.StorageService.initServer(Stor
> ageService.java:721) ~[main/:na]
>         at org.apache.cassandra.service.StorageService.initServer(Stor
> ageService.java:612) ~[main/:na]
>         at 
> org.apache.cassandra.service.CassandraDaemon.setup(CassandraDaemon.java:387) 
> [main/:na]
>         at 
> org.apache.cassandra.service.CassandraDaemon.activate(CassandraDaemon.java:562)
>  [main/:na]
>         at 
> org.apache.cassandra.service.CassandraDaemon.main(CassandraDaemon.java:651) 
> [main/:na]
> {code}
> This fails consistently locally and on cassci. Logs attached.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (CASSANDRA-10844) failed_bootstrap_wiped_node_can_join_test is failing

Reply via email to