[jira] [Commented] (CASSANDRA-10070) Automatic repair scheduling

Paulo Motta (JIRA) Fri, 19 Feb 2016 05:39:07 -0800

    [ 
https://issues.apache.org/jira/browse/CASSANDRA-10070?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15154215#comment-15154215
 ]


Paulo Motta commented on CASSANDRA-10070:
-----------------------------------------

bq. Another thing we should probably consider is whether or not multiple types 
of maintenance work should run simultaneously. If we need to add this 
constraint, should they use the same lock resources?

We could probably replace the single resource lock 
('RepairResource\-\{dc\}\-\{i\}') with global ('Global\-\{dc\}\-\{i\}') or 
mutually exclusive resources ('CleanupAndRepairResource-\{dc\}-\{i\}') later if 
necessary. We'll probably only need some special care during upgrades when we 
introduce these new locks, but other than that I don't see any problem that 
could arise with renaming the resources later if necessary. Do you see any 
issue with this approach?

bq. Sounds good, let's start with the lockResource field in the repair session 
and move to scheduled repairs all together later on (maybe optionally scheduled 
via JMX at first?).

+1

bq. But as you said, it should be done in a separate ticket.

Created CASSANDRA-11190 for failing repairs fast and linked as a requirement of 
this ticket.

bq. Would it be possible for a node to "drop" a validation/streaming without 
notifying the repair coordinator? 

No unless there is a bug. Repair messages are undroppable, and the nodes report 
the coordinator on failure.

bq. Do we have any time out scenarios that we could foresee before they occur?  
If we could detect that, it would be good to abort the repair as early as 
possible, assuming that the timeout would be set rather high.

We could probably handle explicit failures in CASSANDRA-11190 making sure all 
nodes are properly informed and abort their operations in case of failures in 
any of the nodes. The timeout in this context could be helpful in case of hangs 
in streaming or validation. But I suppose that as the protocol becomes more 
mature/correct and with fail fast in place these hanging situations will become 
more rare so I'm not sure timeouts would be required if we assume there are no 
hangs. I guess we can leave them out of the initial version for simplicity and 
add them later if necessary.

> Automatic repair scheduling
> ---------------------------
>
>                 Key: CASSANDRA-10070
>                 URL: https://issues.apache.org/jira/browse/CASSANDRA-10070
>             Project: Cassandra
>          Issue Type: Improvement
>            Reporter: Marcus Olsson
>            Assignee: Marcus Olsson
>            Priority: Minor
>             Fix For: 3.x
>
>         Attachments: Distributed Repair Scheduling.doc
>
>
> Scheduling and running repairs in a Cassandra cluster is most often a 
> required task, but this can both be hard for new users and it also requires a 
> bit of manual configuration. There are good tools out there that can be used 
> to simplify things, but wouldn't this be a good feature to have inside of 
> Cassandra? To automatically schedule and run repairs, so that when you start 
> up your cluster it basically maintains itself in terms of normal 
> anti-entropy, with the possibility for manual configuration.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (CASSANDRA-10070) Automatic repair scheduling

Reply via email to