[jira] [Commented] (CASSANDRA-5426) Redesign repair messages

Yuki Morishita (JIRA) Fri, 31 May 2013 10:03:23 -0700

    [ 
https://issues.apache.org/jira/browse/CASSANDRA-5426?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13671647#comment-13671647
 ]


Yuki Morishita commented on CASSANDRA-5426:
-------------------------------------------

Pushed update to: https://github.com/yukim/cassandra/commits/5426-3

Removed all classes that was kept for backward compatibility.

bq. One thing I'm not sure of is that it seems that when we get an error, we 
log it but we doesn't error out the repair session itself. Maybe we should, 
otherwise I fear most people won't notice something went wrong.
bq. Also, when we fail, maybe we could send an error message (typically the 
exception message) for easier debugging/reporting.

The latest version notifies the user by throwing exception which is 
filled(RepairSession#exception) When the error occurred. Sending exception back 
to the coordinator can be useful, but I'd rather take different approach that 
use tracing CF(CASSANDRA-5483).

bq. I also wonder if maybe we should have more of a fail-fast policy when there 
is errors. For instance, if one node fail it's validation phase, maybe it might 
be worth failing right away and let the user re-trigger a repair once he has 
fixed whatever was the source of the error, rather than still 
differencing/syncing the other nodes (but I admit that both solutions are 
possible).

I changed to let repair session fail when error occurred, but I think it is 
better to have repair option(something like -k, --keep-going) to keep repair 
running and report failed session/job at the end. If you +1, I will do that in 
separate ticket.

bq. Going a bit further, I think we should add 2 messages to interrupt the 
validation and sync phase. If only because that could be useful to users if 
they need to stop a repair for some reason, but also, if we get an error during 
validation from one node, we could use that to interrupt the other nodes and 
thus fail fast while minimizing the amount of work done uselessly. But anyway, 
I guess that part can be done in a follow up ticket.

+1 on doing this on separate ticket. We also need to add the way to abort 
streaming to interrupt syncing.

bq. In RepairMessageType, if gossip is any proof, then it could be wise to add 
more "FUTURE" type, say 4 or 5 "just in case".
bq. Do we really need RepairMessageHeader? What about making RepairMessage a 
RepairJobDesc, a RepairMessageType and a body, rather than creating yet another 
class?

For messages, I mimicked the way o.a.c.transport.messages does.

bq. For the hashCode methods (Differencer, NodePair, RepairJobDesc,...), I'd 
prefer using guava's Objects.hashcode() (and Objects.equal() for equals() when 
there is null).

Done, if I didn't miss anything.

bq. I would move the gossiper/failure registration in ARS.addToActiveSessions.

Done.

bq. I'd remove Validator.rangeToValidate and just inline desc.range.

Done.

bq. Out of curiosity, what do you mean by the TODO in the comment of 
Validator.add().

That comment was from ancient version. Removed since it is no longer applicable.

bq. For MerkleTree.fullRange, maybe it's time to add it to the MT serializer 
rather than restoring it manually, which is ugly and error prone. Aslo, for the 
partitioner, let's maybe have MT uses DatabaseDescriptor.getPartitioner() 
directly rather than restoring them manually in Differencer.run().

Yup, this is a good time to finally cleanup MerkleTree serialization. Done.

                
> Redesign repair messages
> ------------------------
>
>                 Key: CASSANDRA-5426
>                 URL: https://issues.apache.org/jira/browse/CASSANDRA-5426
>             Project: Cassandra
>          Issue Type: Improvement
>            Reporter: Yuki Morishita
>            Assignee: Yuki Morishita
>            Priority: Minor
>              Labels: repair
>             Fix For: 2.0
>
>
> Many people have been reporting 'repair hang' when something goes wrong.
> Two major causes of hang are 1) validation failure and 2) streaming failure.
> Currently, when those failures happen, the failed node would not respond back 
> to the repair initiator.
> The goal of this ticket is to redesign message flows around repair so that 
> repair never hang.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (CASSANDRA-5426) Redesign repair messages

Reply via email to