[
https://issues.apache.org/jira/browse/CASSANDRA-3486?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15259008#comment-15259008
]
Paulo Motta commented on CASSANDRA-3486:
----------------------------------------
Thanks for the feedback [~nickmbailey]. See follow-up below:
bq. If the abort is initiated on the coordinator can we return the
success/failure of the attempt to abort on the participants as well? And vice
versa? Similarly for the list of results when aborting all jobs.
We could, but in this initial implementation I opted to take an optimistic
approach to keep the protocol simple and non-blocking. If for some reason there
is a network partition and "orphaned" sessions keep running, you can always
abort them individually later. Do you think a blocking + timeout approach would
be preferable?
bq. Can we make sure we are testing the case where for whatever reason a
coordinator or participant receives an abort for a repair it doesn't know about?
Sure. One of the changes of this patch that I forgot to mention is that all
messages are validated against the repair session UUID, so if a node receives a
message from a repair it doesn't know about it logs and ignores it.
bq. Since we are now tracking repairs by uuid like this, can we expose a
progress API outside of the jmx notification process? An mbean for retrieving
the progress/status of a repair job by uuid?
We could, but we currently don't keep state or progress information in the
repair session. Furthermore we clear repair session information as soon as it's
finished, so the list repairs stub only list currently active repairs. So we
would need to maintain progress status and provide some way to clear repair
information after some time.
I personally think we should go this route of making repair more stateful, what
will not only improve monitoring but will also allow us to break up a repair
job into more decoupled subtasks, simplifying the single chain of futures we
have today, which can be quite complex to understand and error-prone.
> Node Tool command to stop repair
> --------------------------------
>
> Key: CASSANDRA-3486
> URL: https://issues.apache.org/jira/browse/CASSANDRA-3486
> Project: Cassandra
> Issue Type: Improvement
> Components: Tools
> Environment: JVM
> Reporter: Vijay
> Assignee: Paulo Motta
> Priority: Minor
> Labels: repair
> Fix For: 2.1.x
>
> Attachments: 0001-stop-repair-3583.patch
>
>
> After CASSANDRA-1740, If the validation compaction is stopped then the repair
> will hang. This ticket will allow users to kill the original repair.
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)