[
https://issues.apache.org/jira/browse/CASSANDRA-10070?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15142861#comment-15142861
]
Paulo Motta commented on CASSANDRA-10070:
-----------------------------------------
Sorry for the delay, will try to be faster on next iterations. Below are some
comments in your previous reply:
bq. A problem with this table is that if we have a setup with two data centers
and three replicas in each data center, then we have a total of six replicas
and QUORUM would require four replicas to succeed. This would require that both
data centers are available to be able to run repair.
All data centers involved in a repair must be available for a repair to
start/succeed, so if we make the lock resource dc-aware and try to create the
lock by contacting a node in each involved data center with LOCAL_SERIAL
consistency that should be sufficient to ensure correctness without the need
for a global lock. This will also play along well with both dc_parallelism
global option and with the {{\-\-local}} or {{\-\-dcs}} table repair options.
I thought of something along those lines:
{noformat}
dc_locks = {}
dcs = repair_dcs(keyspace, table) # this will depend on both keyspace settings
and table repair settings (--local or --dcs)
for dc in dcs:
for i in 0..dc_parallelism(dc):
if ((lock = get_node(dc).execute("INSERT INTO lock (resource) VALUES
('RepairResource-{dc}-{i}') IF NOT EXISTS USING TTL 30;", LOCAL_SERIAL) != nil)
dc_locks[dc] = lock
if len(dc_locks) != len(dcs):
release_locks(dc_locks)
else:
start_repair(table)
{noformat}
bq. Just a questions regarding your suggestion with the
node_repair_parallelism. Should it be used to specify the number of repairs a
node can initiate or how many repairs the node can be an active part of in
parallel? I guess the second alternative would be harder to implement, but it
is probably what one would expect.
The second alternative is probably the most desireable. Actually dc_parallelism
by itself might cause problems, since we can have a situation where all repairs
run in a single node or range, overloading those nodes. If we are to support
concurrent repairs in the first pass, I think we need both dc_parallelism and
node_parallelism options together.
I thought we could extend the previous lock acquiring algorithm with:
{noformat}
dc_locks = previous algorithm
if len(dc_locks) != len(dcs):
release_locks(dc_locks)
return;
node_locks = {}
nodes = repair_nodes(table, range)
for node in nodes:
for i in 0..node_parallelism(node):
if ((lock = node.execute("INSERT INTO lock (resource) VALUES
('RepairResource-{node}-{i}') IF NOT EXISTS USING TTL 30;", LOCAL_SERIAL)) !=
nil)
node_locks[node] = lock
break;
if len(node_locks) != len(nodes):
release_locks(dc_locks)
release_locks(node_locks)
else:
start_repair(table)
{noformat}
This is becoming a bit complex and there probably are some edge cases and/or
starvation scenarios so we should think carefully about before jumping into
implementation. What do you think about this approach? Should we stick to a
simpler non-parallel version in the first pass or think this through and
already support parallelism in the first version?
bq. It should be possible to extend the repair scheduler with subrange repairs
I like the token_division approach for supporting subrange repairs in addition
to {{-pr}}, but we can think about this later.
bq. Agreed, are there any other scenarios that we might have to take into
account?
I can only think of upgrades and range movements (bootstrap, move, removenode,
etc) right now.
We should also think better about possible failure scenarios and network
partitions. What happens if the node cannot renew locks in a remote DC due to a
temporary network partition but the repair is still running ? We should
probably cancel a repair if not able to renew the lock and also have some kind
of garbage collector to kill ongoing repair sessions without associated locks
to protect from disrespecting the configured {{dc_parallelism}} and
{{node_paralellism}}.
> Automatic repair scheduling
> ---------------------------
>
> Key: CASSANDRA-10070
> URL: https://issues.apache.org/jira/browse/CASSANDRA-10070
> Project: Cassandra
> Issue Type: Improvement
> Reporter: Marcus Olsson
> Assignee: Marcus Olsson
> Priority: Minor
> Fix For: 3.x
>
> Attachments: Distributed Repair Scheduling.doc
>
>
> Scheduling and running repairs in a Cassandra cluster is most often a
> required task, but this can both be hard for new users and it also requires a
> bit of manual configuration. There are good tools out there that can be used
> to simplify things, but wouldn't this be a good feature to have inside of
> Cassandra? To automatically schedule and run repairs, so that when you start
> up your cluster it basically maintains itself in terms of normal
> anti-entropy, with the possibility for manual configuration.
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)