[jira] [Comment Edited] (CASSANDRA-10288) Incremental repair can hang if replica aren't all up (was: Inconsistent behaviours on repair when a node in RF is missing)

Yuki Morishita (JIRA) Thu, 10 Sep 2015 08:57:55 -0700

    [ 
https://issues.apache.org/jira/browse/CASSANDRA-10288?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14738975#comment-14738975
 ]


Yuki Morishita edited comment on CASSANDRA-10288 at 9/10/15 3:56 PM:
---------------------------------------------------------------------

Patch here: [2.1|https://github.com/yukim/cassandra/tree/10288-2.1] 
[2.2|https://github.com/yukim/cassandra/tree/10288-2.2] 
[3.0|https://github.com/yukim/cassandra/tree/10288-3.0]
[testall|http://cassci.datastax.com/job/yukim-10288-2.1-testall/] and 
[dtest|http://cassci.datastax.com/job/yukim-10288-2.1-dtest/].

I added live node check before sending prepare and anticompaction message.

bq. There should probably be a timeout-and-abort path as well.

There are timeout actually, but it's set to long (1 hour / 1 day).
I feel the need to periodically ping repair status of all replica, since 
timeout is not suitable for every environment, and still message can be lost.
I will create the ticket for this.


was (Author: yukim):
Patch here: https://github.com/yukim/cassandra/tree/10288-2.1
[testall|http://cassci.datastax.com/job/yukim-10288-2.1-testall/] and 
[dtest|http://cassci.datastax.com/job/yukim-10288-2.1-dtest/].

I added live node check before sending prepare and anticompaction message.

bq. There should probably be a timeout-and-abort path as well.

There are timeout actually, but it's set to long (1 hour / 1 day).
I feel the need to periodically ping repair status of all replica, since 
timeout is not suitable for every environment, and still message can be lost.
I will create the ticket for this.

> Incremental repair can hang if replica aren't all up (was: Inconsistent 
> behaviours on repair when a node in RF is missing)
> --------------------------------------------------------------------------------------------------------------------------
>
>                 Key: CASSANDRA-10288
>                 URL: https://issues.apache.org/jira/browse/CASSANDRA-10288
>             Project: Cassandra
>          Issue Type: Bug
>            Reporter: Alan Boudreault
>            Assignee: Yuki Morishita
>             Fix For: 2.1.x
>
>         Attachments: repait_test.sh
>
>
> So with a cluster of 3 nodes and a RF=3 for my keyspace, I tried to repair my 
> data with a single node down. I got 3 different behaviours with different C* 
> versions. With:
> cassandra-2.1: it fails saying a node is down. (acceptable)
> cassandra-2.2: it hangs forever (???)
> cassandra-3.0: it completes successfully
> What is the correct behaviour of this repair use case? Obviously, 
> cassandra-2.2 has to be fixed, too.
> Here are the result logs when testing:
> cassandra-2.1
> {code}
> ccmlib.node.NodetoolError: Nodetool command 
> '/home/aboudreault/git/cstar/cassandra/bin/nodetool -h localhost -p 7100 
> repair test test' failed; exit status: 2; stdout: [2015-09-08 16:32:24,488] 
> Starting repair command #3, repairing 3 ranges for keyspace test 
> (parallelism=SEQUENTIAL, full=true)
> [2015-09-08 16:32:24,492] Repair session b69b5990-5668-11e5-b4ae-b3ffbc47f04c 
> for range (3074457345618258602,-9223372036854775808] failed with error 
> java.io.IOException: Cannot proceed on repair because a neighbor (/127.0.0.2) 
> is dead: session failed
> [2015-09-08 16:32:24,493] Repair session b69b80a0-5668-11e5-b4ae-b3ffbc47f04c 
> for range (-9223372036854775808,-3074457345618258603] failed with error 
> java.io.IOException: Cannot proceed on repair because a neighbor (/127.0.0.2) 
> is dead: session failed
> [2015-09-08 16:32:24,494] Repair session b69ba7b0-5668-11e5-b4ae-b3ffbc47f04c 
> for range (-3074457345618258603,3074457345618258602] failed with error 
> java.io.IOException: Cannot proceed on repair because a neighbor (/127.0.0.2) 
> is dead: session failed
> [2015-09-08 16:32:24,494] Repair command #3 finished
> ; stderr: error: nodetool failed, check server logs
> -- StackTrace --
> java.lang.RuntimeException: nodetool failed, check server logs
>         at 
> org.apache.cassandra.tools.NodeTool$NodeToolCmd.run(NodeTool.java:291)
>         at org.apache.cassandra.tools.NodeTool.main(NodeTool.java:203)
> {code}
> cassandra-2.2:
> {code}
> just hangs .... waited more than 10 minutes.
> {code}
> cassandra-3.0:
> {code}
> $ ccm node1 nodetool repair test test
> [2015-09-08 16:39:40,139] Starting repair command #1, repairing keyspace test 
> with repair options (parallelism: parallel, primary range: false, 
> incremental: true, job threads: 1, ColumnFamilies: [test], dataCenters: [], 
> hosts: [], # of ranges: 2)
> [2015-09-08 16:39:40,241] Repair session ba4a1440-5669-11e5-bc8e-b3ffbc47f04c 
> for range [(3074457345618258602,-9223372036854775808], 
> (-9223372036854775808,3074457345618258602]] finished (progress: 80%)
> [2015-09-08 16:39:40,267] Repair completed successfully
> [2015-09-08 16:39:40,270] Repair command #1 finished in 0 seconds
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Comment Edited] (CASSANDRA-10288) Incremental repair can hang if replica aren't all up (was: Inconsistent behaviours on repair when a node in RF is missing)

Reply via email to