[ 
https://issues.apache.org/jira/browse/CASSANDRA-12008?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15406203#comment-15406203
 ] 

Kaide Mu commented on CASSANDRA-12008:
--------------------------------------

New dtests patch: 
https://github.com/riptano/cassandra-dtest/compare/master...kdmu:cassandra-12008?expand=1
New implementation patch: 
https://github.com/apache/cassandra/compare/trunk...kdmu:trunk-12008?expand=1

bq. Error while decommissioning node is never printed because the 
ExecutionException is being wrapped in a RuntimeException on unbootstrap, so 
perhaps you can modify unbootstrap to throw ExecutionException | 
InterruptedException and catch that on decomission to wrap in RuntimeException.
Now ExecutionException | InterruptedException is handled directly by 
unbootstrap.

bq. When verifying if the retrieved data is correct on 
resumable_decommission_test, you need to stop either node1 or node3 when 
querying the other otherwise the data may be in only one of these nodes (while 
it must be in both nodes, since RF=2 and N=2).
Instead of stopping and starting nodes, I changed stress read with a CL=TWO, in 
this way I guess we can ensure that node1 and node3 is "replying" to the 
request. Also if we do stop and restart node, it seems the restarted node will 
raise an error due to it is looking on node2 log that restarting node is alive 
which such operation is not possible since node2 is decommissioned.

bq. Instead of counting for decommission_error you can add a self.fail("second 
rebuild should fail") after node2.nodetool('decommission') and on the except 
part perhaps check that the following message is being print on logs Error 
while decommissioning node
I guess I'll use insetead  assertRaises which seems more suitable to ensure 
NodetoolError is raised. WDYT [~pauloricardomg] [~yukim]?

> Make decommission operations resumable
> --------------------------------------
>
>                 Key: CASSANDRA-12008
>                 URL: https://issues.apache.org/jira/browse/CASSANDRA-12008
>             Project: Cassandra
>          Issue Type: Improvement
>          Components: Streaming and Messaging
>            Reporter: Tom van der Woerdt
>            Assignee: Kaide Mu
>            Priority: Minor
>
> We're dealing with large data sets (multiple terabytes per node) and 
> sometimes we need to add or remove nodes. These operations are very dependent 
> on the entire cluster being up, so while we're joining a new node (which 
> sometimes takes 6 hours or longer) a lot can go wrong and in a lot of cases 
> something does.
> It would be great if the ability to retry streams was implemented.
> Example to illustrate the problem :
> {code}
> 03:18 PM   ~ $ nodetool decommission
> error: Stream failed
> -- StackTrace --
> org.apache.cassandra.streaming.StreamException: Stream failed
>         at 
> org.apache.cassandra.streaming.management.StreamEventJMXNotifier.onFailure(StreamEventJMXNotifier.java:85)
>         at com.google.common.util.concurrent.Futures$6.run(Futures.java:1310)
>         at 
> com.google.common.util.concurrent.MoreExecutors$DirectExecutor.execute(MoreExecutors.java:457)
>         at 
> com.google.common.util.concurrent.ExecutionList.executeListener(ExecutionList.java:156)
>         at 
> com.google.common.util.concurrent.ExecutionList.execute(ExecutionList.java:145)
>         at 
> com.google.common.util.concurrent.AbstractFuture.setException(AbstractFuture.java:202)
>         at 
> org.apache.cassandra.streaming.StreamResultFuture.maybeComplete(StreamResultFuture.java:210)
>         at 
> org.apache.cassandra.streaming.StreamResultFuture.handleSessionComplete(StreamResultFuture.java:186)
>         at 
> org.apache.cassandra.streaming.StreamSession.closeSession(StreamSession.java:430)
>         at 
> org.apache.cassandra.streaming.StreamSession.complete(StreamSession.java:622)
>         at 
> org.apache.cassandra.streaming.StreamSession.messageReceived(StreamSession.java:486)
>         at 
> org.apache.cassandra.streaming.ConnectionHandler$IncomingMessageHandler.run(ConnectionHandler.java:274)
>         at java.lang.Thread.run(Thread.java:745)
> 08:04 PM   ~ $ nodetool decommission
> nodetool: Unsupported operation: Node in LEAVING state; wait for status to 
> become normal or restart
> See 'nodetool help' or 'nodetool help <command>'.
> {code}
> Streaming failed, probably due to load :
> {code}
> ERROR [STREAM-IN-/<ipaddr>] 2016-06-14 18:05:47,275 StreamSession.java:520 - 
> [Stream #<streamid>] Streaming error occurred
> java.net.SocketTimeoutException: null
>         at 
> sun.nio.ch.SocketAdaptor$SocketInputStream.read(SocketAdaptor.java:211) 
> ~[na:1.8.0_77]
>         at sun.nio.ch.ChannelInputStream.read(ChannelInputStream.java:103) 
> ~[na:1.8.0_77]
>         at 
> java.nio.channels.Channels$ReadableByteChannelImpl.read(Channels.java:385) 
> ~[na:1.8.0_77]
>         at 
> org.apache.cassandra.streaming.messages.StreamMessage.deserialize(StreamMessage.java:54)
>  ~[apache-cassandra-3.0.6.jar:3.0.6]
>         at 
> org.apache.cassandra.streaming.ConnectionHandler$IncomingMessageHandler.run(ConnectionHandler.java:268)
>  ~[apache-cassandra-3.0.6.jar:3.0.6]
>         at java.lang.Thread.run(Thread.java:745) [na:1.8.0_77]
> {code}
> If implementing retries is not possible, can we have a 'nodetool decommission 
> resume'?



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Reply via email to