[jira] [Commented] (CASSANDRA-12008) Make decommission operations resumable

Paulo Motta (JIRA) Thu, 04 Aug 2016 16:50:03 -0700

    [ 
https://issues.apache.org/jira/browse/CASSANDRA-12008?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15408662#comment-15408662
 ]


Paulo Motta commented on CASSANDRA-12008:
-----------------------------------------

Thanks for the update, [~kdmu].  Great job! I think the patch is ready to be 
committed, can you double check [~yukim]?

In order to prepare for commit, squash and rebase to latest trunk, add an entry 
on the top of {{CHANGES.txt}} and set the commit message to the following 
format:
{noformat}
Allow updating DynamicEndpointSnitch properties via JMX
   
patch by sankalp kohli; reviewed by Robert Stupp for CASSANDRA-12179
{noformat}

minor nit: typo on {{transfereedRangePerKeyspace}} (should be 
transferredRangesPerKeyspace).

I submitted a CI and multiplexer run with the current tests:
||trunk||dtest||
|[branch|https://github.com/apache/cassandra/compare/trunk...pauloricardomg:trunk-12008]|[branch|https://github.com/riptano/cassandra-dtest/compare/master...pauloricardomg:12008]|
|[testall|http://cassci.datastax.com/view/Dev/view/paulomotta/job/pauloricardomg-trunk-12008-testall/lastCompletedBuild/testReport/]|
|[dtest|http://cassci.datastax.com/view/Dev/view/paulomotta/job/pauloricardomg-trunk-12008-dtest/lastCompletedBuild/testReport/]|
|[multiplexer 
100x|https://cassci.datastax.com/view/Parameterized/job/parameterized_dtest_multiplexer/215/]|

The dtest is nearly done but need a little bit more working:

* Changing the CL to TWO is not sufficient since the data will be present in 
one of the nodes, after decommission you need to stop one of the nodes and then 
do the query (see {{bootstrap_test.py}} for reference).
* Set {{stream_throughput_outbound_megabits_per_sec=1}} on 
{{simple_decommission_test}} and {{node2.watch_log_for('DECOMMISSIONING')}} 
before starting the second decommission to avoid the race of CASSANDRA-11687.
* Make the skipped range check more specific, you can probably use wildcards in 
{{grep_log_for}}, something like {{"Skipping transferred range .* of keyspace 
keyspace1, endpoint /127.0.0.3"}}.
* Check that {{Error while decommissioning node}} is being print on node2 on 
{{resumable_decommission_test}}
* Move the tests to {{topology_test.py}} since decommission tests are present 
there (there is already a {{simple_decommission_test}} so rename yours to some 
other name)

After those are addressed you can go ahead and submit the pull request for the 
dtests and post the link here. Thanks!

> Make decommission operations resumable
> --------------------------------------
>
>                 Key: CASSANDRA-12008
>                 URL: https://issues.apache.org/jira/browse/CASSANDRA-12008
>             Project: Cassandra
>          Issue Type: Improvement
>          Components: Streaming and Messaging
>            Reporter: Tom van der Woerdt
>            Assignee: Kaide Mu
>            Priority: Minor
>
> We're dealing with large data sets (multiple terabytes per node) and 
> sometimes we need to add or remove nodes. These operations are very dependent 
> on the entire cluster being up, so while we're joining a new node (which 
> sometimes takes 6 hours or longer) a lot can go wrong and in a lot of cases 
> something does.
> It would be great if the ability to retry streams was implemented.
> Example to illustrate the problem :
> {code}
> 03:18 PM   ~ $ nodetool decommission
> error: Stream failed
> -- StackTrace --
> org.apache.cassandra.streaming.StreamException: Stream failed
>         at 
> org.apache.cassandra.streaming.management.StreamEventJMXNotifier.onFailure(StreamEventJMXNotifier.java:85)
>         at com.google.common.util.concurrent.Futures$6.run(Futures.java:1310)
>         at 
> com.google.common.util.concurrent.MoreExecutors$DirectExecutor.execute(MoreExecutors.java:457)
>         at 
> com.google.common.util.concurrent.ExecutionList.executeListener(ExecutionList.java:156)
>         at 
> com.google.common.util.concurrent.ExecutionList.execute(ExecutionList.java:145)
>         at 
> com.google.common.util.concurrent.AbstractFuture.setException(AbstractFuture.java:202)
>         at 
> org.apache.cassandra.streaming.StreamResultFuture.maybeComplete(StreamResultFuture.java:210)
>         at 
> org.apache.cassandra.streaming.StreamResultFuture.handleSessionComplete(StreamResultFuture.java:186)
>         at 
> org.apache.cassandra.streaming.StreamSession.closeSession(StreamSession.java:430)
>         at 
> org.apache.cassandra.streaming.StreamSession.complete(StreamSession.java:622)
>         at 
> org.apache.cassandra.streaming.StreamSession.messageReceived(StreamSession.java:486)
>         at 
> org.apache.cassandra.streaming.ConnectionHandler$IncomingMessageHandler.run(ConnectionHandler.java:274)
>         at java.lang.Thread.run(Thread.java:745)
> 08:04 PM   ~ $ nodetool decommission
> nodetool: Unsupported operation: Node in LEAVING state; wait for status to 
> become normal or restart
> See 'nodetool help' or 'nodetool help <command>'.
> {code}
> Streaming failed, probably due to load :
> {code}
> ERROR [STREAM-IN-/<ipaddr>] 2016-06-14 18:05:47,275 StreamSession.java:520 - 
> [Stream #<streamid>] Streaming error occurred
> java.net.SocketTimeoutException: null
>         at 
> sun.nio.ch.SocketAdaptor$SocketInputStream.read(SocketAdaptor.java:211) 
> ~[na:1.8.0_77]
>         at sun.nio.ch.ChannelInputStream.read(ChannelInputStream.java:103) 
> ~[na:1.8.0_77]
>         at 
> java.nio.channels.Channels$ReadableByteChannelImpl.read(Channels.java:385) 
> ~[na:1.8.0_77]
>         at 
> org.apache.cassandra.streaming.messages.StreamMessage.deserialize(StreamMessage.java:54)
>  ~[apache-cassandra-3.0.6.jar:3.0.6]
>         at 
> org.apache.cassandra.streaming.ConnectionHandler$IncomingMessageHandler.run(ConnectionHandler.java:268)
>  ~[apache-cassandra-3.0.6.jar:3.0.6]
>         at java.lang.Thread.run(Thread.java:745) [na:1.8.0_77]
> {code}
> If implementing retries is not possible, can we have a 'nodetool decommission 
> resume'?



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (CASSANDRA-12008) Make decommission operations resumable

Reply via email to