[ 
https://issues.apache.org/jira/browse/CASSANDRA-10070?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15148842#comment-15148842
 ] 

Marcus Olsson commented on CASSANDRA-10070:
-------------------------------------------

bq. Do we intend to reuse the lock table for other maintenance tasks as well? 
If so, we must add a generic "holder" column to the lock table so we can reuse 
to identify resources other than the parent repair session in the future. We 
could also add an "attributes" map in the lock table to store additional 
attributes such as status, or have a separate table to maintain status to keep 
the lock table simple.

I think it could be reused, so it's probably better to do it generic from the 
start. I think that as long as we don't put too much data in the attributes 
map, it could be stored in the lock table. Another thing is that it's tightly 
bound to the lock itself, since we will use it to clean up repairs without a 
lock, which means keeping it in a single table is probably the easiest solution.

Another thing we should probably consider is whether or not multiple types of 
maintenance work should run simultaneously. If we need to add this constraint, 
should they use the same lock resources?

bq. Ideally all repairs would go through this interface, but this would 
probably add complexity at this stage. So we should probably just add a 
"lockResource" attribute to each repair session object, and each node would go 
through all repairs currently running checking if it still holds the lock in 
case the "lockResource" field is set.

Sounds good, let's start with the lockResource field in the repair session and 
move to scheduled repairs all together later on (maybe optionally scheduled via 
JMX at first?).

{quote}
It would probably be safe to abort ongoing validation and stream background 
tasks and cleanup repair state on all involved nodes before starting a new 
repair session in the same ranges. This doesn't seem to be done currently. As 
far as I understood, if there are nodes A, B, C running repair, A is the 
coordinator. If validation or streaming fails on node B, the coordinator (A) is 
notified and fails the repair session, but node C will remain doing validation 
and/or streaming, what could cause problems (or increased load) if we start 
another repair session on the same range. 

We will probably need to extend the repair protocol to perform this 
cleanup/abort step on failure. We already have a legacy cleanup message that 
doesn't seem to be used in the current protocol that we could maybe reuse to 
cleanup repair state after a failure. This repair abortion will probably have 
intersection with CASSANDRA-3486. In any case, this is a separate (but related) 
issue and we should address it in an independent ticket, and make this ticket 
dependent on that.
{quote}

Right now it seems that the cleanup message is only used to remove the parent 
repair session from the ActiveRepairService's map. I guess that if we should 
use it we would have to rewrite it to stop validation and streaming as well. 
But as you said, it should be done in a separate ticket.

bq. Another unrelated option that we should probably include in the future is a 
timeout, and abort repair sessions running longer than that.

Agreed. Do we have any time out scenarios that we could foresee before they 
occur? Would it be possible for a node to "drop" a validation/streaming without 
notifying the repair coordinator? If we could detect that, it would be good to 
abort the repair as early as possible, assuming that the timeout would be set 
rather high.

> Automatic repair scheduling
> ---------------------------
>
>                 Key: CASSANDRA-10070
>                 URL: https://issues.apache.org/jira/browse/CASSANDRA-10070
>             Project: Cassandra
>          Issue Type: Improvement
>            Reporter: Marcus Olsson
>            Assignee: Marcus Olsson
>            Priority: Minor
>             Fix For: 3.x
>
>         Attachments: Distributed Repair Scheduling.doc
>
>
> Scheduling and running repairs in a Cassandra cluster is most often a 
> required task, but this can both be hard for new users and it also requires a 
> bit of manual configuration. There are good tools out there that can be used 
> to simplify things, but wouldn't this be a good feature to have inside of 
> Cassandra? To automatically schedule and run repairs, so that when you start 
> up your cluster it basically maintains itself in terms of normal 
> anti-entropy, with the possibility for manual configuration.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Reply via email to