[jira] [Commented] (CASSANDRA-8911) Consider Mutation-based Repairs

2016-10-05 Thread Anubhav Kale (JIRA)

[ 
https://issues.apache.org/jira/browse/CASSANDRA-8911?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15549228#comment-15549228
 ] 

Anubhav Kale commented on CASSANDRA-8911:
-

Have we tested this on large scale yet ? Just curious about the future of this 
ticket. Thanks !

> Consider Mutation-based Repairs
> ---
>
> Key: CASSANDRA-8911
> URL: https://issues.apache.org/jira/browse/CASSANDRA-8911
> Project: Cassandra
>  Issue Type: Improvement
>Reporter: Tyler Hobbs
>Assignee: Marcus Eriksson
> Fix For: 3.x
>
>
> We should consider a mutation-based repair to replace the existing streaming 
> repair.  While we're at it, we could do away with a lot of the complexity 
> around merkle trees.
> I have not planned this out in detail, but here's roughly what I'm thinking:
>  * Instead of building an entire merkle tree up front, just send the "leaves" 
> one-by-one.  Instead of dealing with token ranges, make the leaves primary 
> key ranges.  The PK ranges would need to be contiguous, so that the start of 
> each range would match the end of the previous range. (The first and last 
> leaves would need to be open-ended on one end of the PK range.) This would be 
> similar to doing a read with paging.
>  * Once one page of data is read, compute a hash of it and send it to the 
> other replicas along with the PK range that it covers and a row count.
>  * When the replicas receive the hash, the perform a read over the same PK 
> range (using a LIMIT of the row count + 1) and compare hashes (unless the row 
> counts don't match, in which case this can be skipped).
>  * If there is a mismatch, the replica will send a mutation covering that 
> page's worth of data (ignoring the row count this time) to the source node.
> Here are the advantages that I can think of:
>  * With the current repair behavior of streaming, vnode-enabled clusters may 
> need to stream hundreds of small SSTables.  This results in increased compact
> ion load on the receiving node.  With the mutation-based approach, memtables 
> would naturally merge these.
>  * It's simple to throttle.  For example, you could give a number of rows/sec 
> that should be repaired.
>  * It's easy to see what PK range has been repaired so far.  This could make 
> it simpler to resume a repair that fails midway.
>  * Inconsistencies start to be repaired almost right away.
>  * Less special code \(?\)
>  * Wide partitions are no longer a problem.
> There are a few problems I can think of:
>  * Counters.  I don't know if this can be made safe, or if they need to be 
> skipped.
>  * To support incremental repair, we need to be able to read from only 
> repaired sstables.  Probably not too difficult to do.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (CASSANDRA-8911) Consider Mutation-based Repairs

2016-08-08 Thread Marcus Eriksson (JIRA)

[ 
https://issues.apache.org/jira/browse/CASSANDRA-8911?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15411511#comment-15411511
 ] 

Marcus Eriksson commented on CASSANDRA-8911:


Thanks for the early review

bq. I don't entirely get why we need a separate HUGE response, can't we just 
page back the data from the mismatching replica in the MBRResponse similar to 
ordinary responses?
There are two reasons, first one is that we save one round trip during repair 
of a page. Second reason is that it is easier do diff the remote data with the 
local data if we are not paging. (ie, we know we are comparing the entire 
result, not just a page of it)
bq. While it's handy to have a rows/s metric for repaired rows, I think it 
might be more useful for operators to have a repair throttle knob in MB/S (in 
line with compaction/stream throughput) rather than rows/s, since the load 
imposed by a repairing row can vary between different tables, so it might be a 
bit trickier to do capacity planning based on rows/s.
I have not focused on metrics at all yet, and yes we should add more.
bq. Currently we're throttling only at the coordinator, but it would probably 
be interesting to use the same rate limiter to also throttle participant 
reads/writes.
hmm, maybe makes sense to only throttle on the replicas
bq. Can we make MBROnHeapUnfilteredPartitions a lazy iterator that caches rows 
while its traversed?
good point, will fix
bq. Also, can we leverage MBROnHeapUnfilteredPartitions (or similar) to cache 
iterated rows on MBRVerbHandler?
yep
bq. Try to generify unfiltered and filtered methods/classes into common 
classes/methods when possible (UnfilteredPager, UnfilteredPagersIterator, 
getUnfilteredRangeSlice, etc) (this is probably on your TODO but just a 
friendly reminder)
yeah, a bunch of todos in the code around this
bq. Right now the service runs a full repair for a table which probably makes 
sense for triggered repairs, but when we make that continuous we will probably 
need to interleave subrange repair of different tables to ensure fairness (so 
big tables don't starvate small table repairs).
Yes, splitting the local ranges in smaller parts based on the size of the 
table, then sequentially repairing each part probably makes sense
bq. While repairing other replicas right away will probably get nodes 
consistent faster, I wonder if we can make any simplification by repairing only 
the coordinator node but since the diffs are already there I don't see any 
reason not to do it.
yes we should repair the replicas as well since we know what the replicas need 
to be in sync - also, for incremental repair a range is only repaired if we 
know that the replicas are in sync


> Consider Mutation-based Repairs
> ---
>
> Key: CASSANDRA-8911
> URL: https://issues.apache.org/jira/browse/CASSANDRA-8911
> Project: Cassandra
>  Issue Type: Improvement
>Reporter: Tyler Hobbs
>Assignee: Marcus Eriksson
> Fix For: 3.x
>
>
> We should consider a mutation-based repair to replace the existing streaming 
> repair.  While we're at it, we could do away with a lot of the complexity 
> around merkle trees.
> I have not planned this out in detail, but here's roughly what I'm thinking:
>  * Instead of building an entire merkle tree up front, just send the "leaves" 
> one-by-one.  Instead of dealing with token ranges, make the leaves primary 
> key ranges.  The PK ranges would need to be contiguous, so that the start of 
> each range would match the end of the previous range. (The first and last 
> leaves would need to be open-ended on one end of the PK range.) This would be 
> similar to doing a read with paging.
>  * Once one page of data is read, compute a hash of it and send it to the 
> other replicas along with the PK range that it covers and a row count.
>  * When the replicas receive the hash, the perform a read over the same PK 
> range (using a LIMIT of the row count + 1) and compare hashes (unless the row 
> counts don't match, in which case this can be skipped).
>  * If there is a mismatch, the replica will send a mutation covering that 
> page's worth of data (ignoring the row count this time) to the source node.
> Here are the advantages that I can think of:
>  * With the current repair behavior of streaming, vnode-enabled clusters may 
> need to stream hundreds of small SSTables.  This results in increased compact
> ion load on the receiving node.  With the mutation-based approach, memtables 
> would naturally merge these.
>  * It's simple to throttle.  For example, you could give a number of rows/sec 
> that should be repaired.
>  * It's easy to see what PK range has been repaired so far.  This could make 
> it simpler to resume a repair that fails midway.
>  * Inconsistencies start to be repaired almost right away.
>  * Less special code \(?\)

[jira] [Commented] (CASSANDRA-8911) Consider Mutation-based Repairs

2016-08-05 Thread Paulo Motta (JIRA)

[ 
https://issues.apache.org/jira/browse/CASSANDRA-8911?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15410230#comment-15410230
 ] 

Paulo Motta commented on CASSANDRA-8911:


Had an initial look at the WIP branch and overall I think the approach and 
implementation looks good and not very far from being ready to move into 
testing/benchmarks, great job! See some preliminary comments below:
* I don't entirely get why we need a separate HUGE response, can't we just page 
back the data from the mismatching replica in the MBRResponse similar to 
ordinary responses?
* While it's handy to have a rows/s metric for repaired rows, I think it might 
be more useful for operators to have a repair throttle knob in MB/S (in line 
with compaction/stream throughput) rather than rows/s, since the load imposed 
by a repairing row can vary between different tables, so it might be a bit 
trickier to do capacity planning based on rows/s.
* Currently we're throttling only at the coordinator, but it would probably be 
interesting to use the same rate limiter to also throttle participant 
reads/writes.
* Can we make MBROnHeapUnfilteredPartitions a lazy iterator that caches rows 
while its traversed?
** Also, can we leverage MBROnHeapUnfilteredPartitions (or similar)  to cache 
iterated rows on MBRVerbHandler?
* Try to generify unfiltered and filtered methods/classes into common 
classes/methods when possible (UnfilteredPager, UnfilteredPagersIterator, 
getUnfilteredRangeSlice, etc) (this is probably on your TODO but just a 
friendly reminder)
* Right now the service runs a full repair for a table which probably makes 
sense for triggered repairs, but when we make that continuous we wil probably 
need to interleave subrange repair of different tables to ensure fairness (so 
big tables don't starvate small table repairs).
* While repairing other replicas right away will probably get nodes consistent 
faster, I wonder if we can make any simplification by repairing only the 
coordinator node but since the diffs are already there I don't see any reason 
not to do it.
* We should probably batch read-repair mutations to other replicas, but we can 
maybe do that in a separate ticket.

Minor nits:
* It would probably be good to bound number of retries and throw an exception 
if that is exceeded
* On MBRDataRangeTest.wrappingTest, I didn't get why are keys (1,0)..(1,9) not 
returned?

I think the initial dtests look good, would be nice to perhaps extend with 
tests exercising collection and/or range tombstones. It would be maybe good to 
add unit tests exercising some edge cases, like repair pages ending before a 
partition key is finished (probably with the help of the mocking classes 
introduced by CASSANDRA-12016).

For the performance tests, it would probably be good to compare execution time 
and repair efficiency (stored rows/mismatched rows) of full repair with little 
and many mismatching rows, with scenarios with and without traffic in the 
background. We can probably decrease the write timeout and disable hinted 
handoff to cause dropped mutations in order to introduce some noise.

Nice work, please let me know if I can help in any of the remaining tasks or 
tests.

> Consider Mutation-based Repairs
> ---
>
> Key: CASSANDRA-8911
> URL: https://issues.apache.org/jira/browse/CASSANDRA-8911
> Project: Cassandra
>  Issue Type: Improvement
>Reporter: Tyler Hobbs
>Assignee: Marcus Eriksson
> Fix For: 3.x
>
>
> We should consider a mutation-based repair to replace the existing streaming 
> repair.  While we're at it, we could do away with a lot of the complexity 
> around merkle trees.
> I have not planned this out in detail, but here's roughly what I'm thinking:
>  * Instead of building an entire merkle tree up front, just send the "leaves" 
> one-by-one.  Instead of dealing with token ranges, make the leaves primary 
> key ranges.  The PK ranges would need to be contiguous, so that the start of 
> each range would match the end of the previous range. (The first and last 
> leaves would need to be open-ended on one end of the PK range.) This would be 
> similar to doing a read with paging.
>  * Once one page of data is read, compute a hash of it and send it to the 
> other replicas along with the PK range that it covers and a row count.
>  * When the replicas receive the hash, the perform a read over the same PK 
> range (using a LIMIT of the row count + 1) and compare hashes (unless the row 
> counts don't match, in which case this can be skipped).
>  * If there is a mismatch, the replica will send a mutation covering that 
> page's worth of data (ignoring the row count this time) to the source node.
> Here are the advantages that I can think of:
>  * With the current repair behavior of streaming, vnode-enabled clusters may 
> need to stream hundreds 

[jira] [Commented] (CASSANDRA-8911) Consider Mutation-based Repairs

2016-05-05 Thread Paulo Motta (JIRA)

[ 
https://issues.apache.org/jira/browse/CASSANDRA-8911?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15272986#comment-15272986
 ] 

Paulo Motta commented on CASSANDRA-8911:


thanks for the input! IMO the number of threads for running repair should 
probably be configurable. my initial concern was if this was going to be a 
shared pool for all tables, or continuously-running globally-throttled 
pool-per-table, because if we need to prioritize between different 
tables/ranges on a shared pool this is already going to be provided by 
CASSANDRA-10070 so we can probably reuse that to automatize MBRs. but as you 
said, this is probably a side-concern that can be further discussed on 
CASSANDRA-10070 or another ticket.

> Consider Mutation-based Repairs
> ---
>
> Key: CASSANDRA-8911
> URL: https://issues.apache.org/jira/browse/CASSANDRA-8911
> Project: Cassandra
>  Issue Type: Improvement
>Reporter: Tyler Hobbs
>Assignee: Marcus Eriksson
> Fix For: 3.x
>
>
> We should consider a mutation-based repair to replace the existing streaming 
> repair.  While we're at it, we could do away with a lot of the complexity 
> around merkle trees.
> I have not planned this out in detail, but here's roughly what I'm thinking:
>  * Instead of building an entire merkle tree up front, just send the "leaves" 
> one-by-one.  Instead of dealing with token ranges, make the leaves primary 
> key ranges.  The PK ranges would need to be contiguous, so that the start of 
> each range would match the end of the previous range. (The first and last 
> leaves would need to be open-ended on one end of the PK range.) This would be 
> similar to doing a read with paging.
>  * Once one page of data is read, compute a hash of it and send it to the 
> other replicas along with the PK range that it covers and a row count.
>  * When the replicas receive the hash, the perform a read over the same PK 
> range (using a LIMIT of the row count + 1) and compare hashes (unless the row 
> counts don't match, in which case this can be skipped).
>  * If there is a mismatch, the replica will send a mutation covering that 
> page's worth of data (ignoring the row count this time) to the source node.
> Here are the advantages that I can think of:
>  * With the current repair behavior of streaming, vnode-enabled clusters may 
> need to stream hundreds of small SSTables.  This results in increased compact
> ion load on the receiving node.  With the mutation-based approach, memtables 
> would naturally merge these.
>  * It's simple to throttle.  For example, you could give a number of rows/sec 
> that should be repaired.
>  * It's easy to see what PK range has been repaired so far.  This could make 
> it simpler to resume a repair that fails midway.
>  * Inconsistencies start to be repaired almost right away.
>  * Less special code \(?\)
>  * Wide partitions are no longer a problem.
> There are a few problems I can think of:
>  * Counters.  I don't know if this can be made safe, or if they need to be 
> skipped.
>  * To support incremental repair, we need to be able to read from only 
> repaired sstables.  Probably not too difficult to do.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (CASSANDRA-8911) Consider Mutation-based Repairs

2016-05-05 Thread Jeremiah Jordan (JIRA)

[ 
https://issues.apache.org/jira/browse/CASSANDRA-8911?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15272946#comment-15272946
 ] 

Jeremiah Jordan commented on CASSANDRA-8911:


If we are going to discuss it here, then for the single vs multi-thread thing, 
I know that DataStax OpsCenter does multi-threaded for their current repair 
service, because depending on your cluster size/data per node single threaded 
can't finish in 7 days.

> Consider Mutation-based Repairs
> ---
>
> Key: CASSANDRA-8911
> URL: https://issues.apache.org/jira/browse/CASSANDRA-8911
> Project: Cassandra
>  Issue Type: Improvement
>Reporter: Tyler Hobbs
>Assignee: Marcus Eriksson
> Fix For: 3.x
>
>
> We should consider a mutation-based repair to replace the existing streaming 
> repair.  While we're at it, we could do away with a lot of the complexity 
> around merkle trees.
> I have not planned this out in detail, but here's roughly what I'm thinking:
>  * Instead of building an entire merkle tree up front, just send the "leaves" 
> one-by-one.  Instead of dealing with token ranges, make the leaves primary 
> key ranges.  The PK ranges would need to be contiguous, so that the start of 
> each range would match the end of the previous range. (The first and last 
> leaves would need to be open-ended on one end of the PK range.) This would be 
> similar to doing a read with paging.
>  * Once one page of data is read, compute a hash of it and send it to the 
> other replicas along with the PK range that it covers and a row count.
>  * When the replicas receive the hash, the perform a read over the same PK 
> range (using a LIMIT of the row count + 1) and compare hashes (unless the row 
> counts don't match, in which case this can be skipped).
>  * If there is a mismatch, the replica will send a mutation covering that 
> page's worth of data (ignoring the row count this time) to the source node.
> Here are the advantages that I can think of:
>  * With the current repair behavior of streaming, vnode-enabled clusters may 
> need to stream hundreds of small SSTables.  This results in increased compact
> ion load on the receiving node.  With the mutation-based approach, memtables 
> would naturally merge these.
>  * It's simple to throttle.  For example, you could give a number of rows/sec 
> that should be repaired.
>  * It's easy to see what PK range has been repaired so far.  This could make 
> it simpler to resume a repair that fails midway.
>  * Inconsistencies start to be repaired almost right away.
>  * Less special code \(?\)
>  * Wide partitions are no longer a problem.
> There are a few problems I can think of:
>  * Counters.  I don't know if this can be made safe, or if they need to be 
> skipped.
>  * To support incremental repair, we need to be able to read from only 
> repaired sstables.  Probably not too difficult to do.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (CASSANDRA-8911) Consider Mutation-based Repairs

2016-05-05 Thread Paulo Motta (JIRA)

[ 
https://issues.apache.org/jira/browse/CASSANDRA-8911?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15272917#comment-15272917
 ] 

Paulo Motta commented on CASSANDRA-8911:


bq. not sure here, would be nice to be able to prioritise a given range for a 
table, say we lose an sstable for example, automatically repairing the range 
that sstable covered immediately.

bq. It should probably be just a single thread, one table after the other. And 
then maybe having a priority queue or something with ranges to repair 
immediately, wdyt Paulo Motta? This prio queue thing might be a bit of gold 
plating that we could do later.

Sounds good! While we will remove the need for coordination between inter-node 
repairs (if we replace traditional repairs with this in the future), we will 
still need some level of management/scheduling for MBRs that we need to 
consider when designing/implementing the repair scheduling/management framework 
of CASSANDRA-10070, so it can also be useful to MBR in the future. With this 
said, we should probably focus on core MBR functionality here and take any 
auto-repair considerations to CASSANDRA-10070 so we can have a single interface 
for managing repairs (MBRs or not).

As for incremental release of this, I agree that we should have a minimum level 
of confidence that this is promising (at least for some use cases) before 
releasing even if it's only experimental, but throwing it into the wild early 
will certainly give us some important feedback that we can use to validate and 
improve this. Furthermore if this proves out worthy it will probably live 
alongside traditional repairs for a long time (maybe forever for 
counters/DTCS?).

Some early bikeshedding here (since it's not ready for review yet), when 
possible I think we should add new options to the existing repair interfaces 
rather than creating specific interfaces for MBR (for instance, {{nodetool 
repair --mutation-based}} instead of {{nodetool mutationbasedrepair}}, 
{{StorageService.repairAsync}} instead of {{CFS.enableMutationBasedRepair}}), 
since this is just a different implementation of the same functionality (unless 
there`s something that does not fit well or is not covered by existing 
interfaces) and will also allow existing tools to try out MBR with little 
change.

> Consider Mutation-based Repairs
> ---
>
> Key: CASSANDRA-8911
> URL: https://issues.apache.org/jira/browse/CASSANDRA-8911
> Project: Cassandra
>  Issue Type: Improvement
>Reporter: Tyler Hobbs
>Assignee: Marcus Eriksson
> Fix For: 3.x
>
>
> We should consider a mutation-based repair to replace the existing streaming 
> repair.  While we're at it, we could do away with a lot of the complexity 
> around merkle trees.
> I have not planned this out in detail, but here's roughly what I'm thinking:
>  * Instead of building an entire merkle tree up front, just send the "leaves" 
> one-by-one.  Instead of dealing with token ranges, make the leaves primary 
> key ranges.  The PK ranges would need to be contiguous, so that the start of 
> each range would match the end of the previous range. (The first and last 
> leaves would need to be open-ended on one end of the PK range.) This would be 
> similar to doing a read with paging.
>  * Once one page of data is read, compute a hash of it and send it to the 
> other replicas along with the PK range that it covers and a row count.
>  * When the replicas receive the hash, the perform a read over the same PK 
> range (using a LIMIT of the row count + 1) and compare hashes (unless the row 
> counts don't match, in which case this can be skipped).
>  * If there is a mismatch, the replica will send a mutation covering that 
> page's worth of data (ignoring the row count this time) to the source node.
> Here are the advantages that I can think of:
>  * With the current repair behavior of streaming, vnode-enabled clusters may 
> need to stream hundreds of small SSTables.  This results in increased compact
> ion load on the receiving node.  With the mutation-based approach, memtables 
> would naturally merge these.
>  * It's simple to throttle.  For example, you could give a number of rows/sec 
> that should be repaired.
>  * It's easy to see what PK range has been repaired so far.  This could make 
> it simpler to resume a repair that fails midway.
>  * Inconsistencies start to be repaired almost right away.
>  * Less special code \(?\)
>  * Wide partitions are no longer a problem.
> There are a few problems I can think of:
>  * Counters.  I don't know if this can be made safe, or if they need to be 
> skipped.
>  * To support incremental repair, we need to be able to read from only 
> repaired sstables.  Probably not too difficult to do.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (CASSANDRA-8911) Consider Mutation-based Repairs

2016-05-05 Thread Joshua McKenzie (JIRA)

[ 
https://issues.apache.org/jira/browse/CASSANDRA-8911?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15272748#comment-15272748
 ] 

Joshua McKenzie commented on CASSANDRA-8911:


bq. you don't think we should release this incrementally?
I'm all for us releasing something incrementally, I'm just concerned about the 
idea of us committing code w/out a clear idea of the benefits it adds and the 
strong possibility of having to remove it in the future.

I'd expect we should be able to show the value of each step in the chain you've 
mentioned. I dislike the idea of us adding this complexity and another knob 
into the code-base if we're not clear that it has value either over or 
alongside the traditional repairs.

> Consider Mutation-based Repairs
> ---
>
> Key: CASSANDRA-8911
> URL: https://issues.apache.org/jira/browse/CASSANDRA-8911
> Project: Cassandra
>  Issue Type: Improvement
>Reporter: Tyler Hobbs
>Assignee: Marcus Eriksson
> Fix For: 3.x
>
>
> We should consider a mutation-based repair to replace the existing streaming 
> repair.  While we're at it, we could do away with a lot of the complexity 
> around merkle trees.
> I have not planned this out in detail, but here's roughly what I'm thinking:
>  * Instead of building an entire merkle tree up front, just send the "leaves" 
> one-by-one.  Instead of dealing with token ranges, make the leaves primary 
> key ranges.  The PK ranges would need to be contiguous, so that the start of 
> each range would match the end of the previous range. (The first and last 
> leaves would need to be open-ended on one end of the PK range.) This would be 
> similar to doing a read with paging.
>  * Once one page of data is read, compute a hash of it and send it to the 
> other replicas along with the PK range that it covers and a row count.
>  * When the replicas receive the hash, the perform a read over the same PK 
> range (using a LIMIT of the row count + 1) and compare hashes (unless the row 
> counts don't match, in which case this can be skipped).
>  * If there is a mismatch, the replica will send a mutation covering that 
> page's worth of data (ignoring the row count this time) to the source node.
> Here are the advantages that I can think of:
>  * With the current repair behavior of streaming, vnode-enabled clusters may 
> need to stream hundreds of small SSTables.  This results in increased compact
> ion load on the receiving node.  With the mutation-based approach, memtables 
> would naturally merge these.
>  * It's simple to throttle.  For example, you could give a number of rows/sec 
> that should be repaired.
>  * It's easy to see what PK range has been repaired so far.  This could make 
> it simpler to resume a repair that fails midway.
>  * Inconsistencies start to be repaired almost right away.
>  * Less special code \(?\)
>  * Wide partitions are no longer a problem.
> There are a few problems I can think of:
>  * Counters.  I don't know if this can be made safe, or if they need to be 
> skipped.
>  * To support incremental repair, we need to be able to read from only 
> repaired sstables.  Probably not too difficult to do.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (CASSANDRA-8911) Consider Mutation-based Repairs

2016-05-05 Thread Jeremiah Jordan (JIRA)

[ 
https://issues.apache.org/jira/browse/CASSANDRA-8911?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15272689#comment-15272689
 ] 

Jeremiah Jordan commented on CASSANDRA-8911:


bq. not sure here, would be nice to be able to prioritise a given range for a 
table, say we lose an sstable for example, automatically repairing the range 
that sstable covered immediately.
bq. It should probably be just a single thread, one table after the other. And 
then maybe having a priority queue or something with ranges to repair 
immediately, wdyt Paulo Motta? This prio queue thing might be a bit of gold 
plating that we could do later.

This discussion should probably be on the other ticket?

> Consider Mutation-based Repairs
> ---
>
> Key: CASSANDRA-8911
> URL: https://issues.apache.org/jira/browse/CASSANDRA-8911
> Project: Cassandra
>  Issue Type: Improvement
>Reporter: Tyler Hobbs
>Assignee: Marcus Eriksson
> Fix For: 3.x
>
>
> We should consider a mutation-based repair to replace the existing streaming 
> repair.  While we're at it, we could do away with a lot of the complexity 
> around merkle trees.
> I have not planned this out in detail, but here's roughly what I'm thinking:
>  * Instead of building an entire merkle tree up front, just send the "leaves" 
> one-by-one.  Instead of dealing with token ranges, make the leaves primary 
> key ranges.  The PK ranges would need to be contiguous, so that the start of 
> each range would match the end of the previous range. (The first and last 
> leaves would need to be open-ended on one end of the PK range.) This would be 
> similar to doing a read with paging.
>  * Once one page of data is read, compute a hash of it and send it to the 
> other replicas along with the PK range that it covers and a row count.
>  * When the replicas receive the hash, the perform a read over the same PK 
> range (using a LIMIT of the row count + 1) and compare hashes (unless the row 
> counts don't match, in which case this can be skipped).
>  * If there is a mismatch, the replica will send a mutation covering that 
> page's worth of data (ignoring the row count this time) to the source node.
> Here are the advantages that I can think of:
>  * With the current repair behavior of streaming, vnode-enabled clusters may 
> need to stream hundreds of small SSTables.  This results in increased compact
> ion load on the receiving node.  With the mutation-based approach, memtables 
> would naturally merge these.
>  * It's simple to throttle.  For example, you could give a number of rows/sec 
> that should be repaired.
>  * It's easy to see what PK range has been repaired so far.  This could make 
> it simpler to resume a repair that fails midway.
>  * Inconsistencies start to be repaired almost right away.
>  * Less special code \(?\)
>  * Wide partitions are no longer a problem.
> There are a few problems I can think of:
>  * Counters.  I don't know if this can be made safe, or if they need to be 
> skipped.
>  * To support incremental repair, we need to be able to read from only 
> repaired sstables.  Probably not too difficult to do.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (CASSANDRA-8911) Consider Mutation-based Repairs

2016-05-05 Thread Marcus Eriksson (JIRA)

[ 
https://issues.apache.org/jira/browse/CASSANDRA-8911?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15272675#comment-15272675
 ] 

Marcus Eriksson commented on CASSANDRA-8911:


bq. is the idea to have one thread per table running continuously, or a 
fixed-size pool for all tables? If the second option we may still need to 
prioritize which table should run MBR next.

not sure here, would be nice to be able to prioritise a given range for a 
table, say we lose an sstable for example, automatically repairing the range 
that sstable covered immediately.

It should probably be just a single thread, one table after the other. And then 
maybe having a priority queue or something with ranges to repair immediately, 
wdyt [~pauloricardomg]? This prio queue thing might be a bit of gold plating 
that we could do later.

> Consider Mutation-based Repairs
> ---
>
> Key: CASSANDRA-8911
> URL: https://issues.apache.org/jira/browse/CASSANDRA-8911
> Project: Cassandra
>  Issue Type: Improvement
>Reporter: Tyler Hobbs
>Assignee: Marcus Eriksson
> Fix For: 3.x
>
>
> We should consider a mutation-based repair to replace the existing streaming 
> repair.  While we're at it, we could do away with a lot of the complexity 
> around merkle trees.
> I have not planned this out in detail, but here's roughly what I'm thinking:
>  * Instead of building an entire merkle tree up front, just send the "leaves" 
> one-by-one.  Instead of dealing with token ranges, make the leaves primary 
> key ranges.  The PK ranges would need to be contiguous, so that the start of 
> each range would match the end of the previous range. (The first and last 
> leaves would need to be open-ended on one end of the PK range.) This would be 
> similar to doing a read with paging.
>  * Once one page of data is read, compute a hash of it and send it to the 
> other replicas along with the PK range that it covers and a row count.
>  * When the replicas receive the hash, the perform a read over the same PK 
> range (using a LIMIT of the row count + 1) and compare hashes (unless the row 
> counts don't match, in which case this can be skipped).
>  * If there is a mismatch, the replica will send a mutation covering that 
> page's worth of data (ignoring the row count this time) to the source node.
> Here are the advantages that I can think of:
>  * With the current repair behavior of streaming, vnode-enabled clusters may 
> need to stream hundreds of small SSTables.  This results in increased compact
> ion load on the receiving node.  With the mutation-based approach, memtables 
> would naturally merge these.
>  * It's simple to throttle.  For example, you could give a number of rows/sec 
> that should be repaired.
>  * It's easy to see what PK range has been repaired so far.  This could make 
> it simpler to resume a repair that fails midway.
>  * Inconsistencies start to be repaired almost right away.
>  * Less special code \(?\)
>  * Wide partitions are no longer a problem.
> There are a few problems I can think of:
>  * Counters.  I don't know if this can be made safe, or if they need to be 
> skipped.
>  * To support incremental repair, we need to be able to read from only 
> repaired sstables.  Probably not too difficult to do.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (CASSANDRA-8911) Consider Mutation-based Repairs

2016-05-05 Thread Marcus Eriksson (JIRA)

[ 
https://issues.apache.org/jira/browse/CASSANDRA-8911?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15272661#comment-15272661
 ] 

Marcus Eriksson commented on CASSANDRA-8911:


bq. We shouldn't commit anything to the code-base if it doesn't work in 
production environments
That is not what I meant, the nodetool command should of course work as 
advertised, but we might remove it if the benefit over old style repairs are 
too small for example.

[~JoshuaMcKenzie] so, you don't think we should release this incrementally? 
Being able to release small but still usable things is a good thing imo.

> Consider Mutation-based Repairs
> ---
>
> Key: CASSANDRA-8911
> URL: https://issues.apache.org/jira/browse/CASSANDRA-8911
> Project: Cassandra
>  Issue Type: Improvement
>Reporter: Tyler Hobbs
>Assignee: Marcus Eriksson
> Fix For: 3.x
>
>
> We should consider a mutation-based repair to replace the existing streaming 
> repair.  While we're at it, we could do away with a lot of the complexity 
> around merkle trees.
> I have not planned this out in detail, but here's roughly what I'm thinking:
>  * Instead of building an entire merkle tree up front, just send the "leaves" 
> one-by-one.  Instead of dealing with token ranges, make the leaves primary 
> key ranges.  The PK ranges would need to be contiguous, so that the start of 
> each range would match the end of the previous range. (The first and last 
> leaves would need to be open-ended on one end of the PK range.) This would be 
> similar to doing a read with paging.
>  * Once one page of data is read, compute a hash of it and send it to the 
> other replicas along with the PK range that it covers and a row count.
>  * When the replicas receive the hash, the perform a read over the same PK 
> range (using a LIMIT of the row count + 1) and compare hashes (unless the row 
> counts don't match, in which case this can be skipped).
>  * If there is a mismatch, the replica will send a mutation covering that 
> page's worth of data (ignoring the row count this time) to the source node.
> Here are the advantages that I can think of:
>  * With the current repair behavior of streaming, vnode-enabled clusters may 
> need to stream hundreds of small SSTables.  This results in increased compact
> ion load on the receiving node.  With the mutation-based approach, memtables 
> would naturally merge these.
>  * It's simple to throttle.  For example, you could give a number of rows/sec 
> that should be repaired.
>  * It's easy to see what PK range has been repaired so far.  This could make 
> it simpler to resume a repair that fails midway.
>  * Inconsistencies start to be repaired almost right away.
>  * Less special code \(?\)
>  * Wide partitions are no longer a problem.
> There are a few problems I can think of:
>  * Counters.  I don't know if this can be made safe, or if they need to be 
> skipped.
>  * To support incremental repair, we need to be able to read from only 
> repaired sstables.  Probably not too difficult to do.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (CASSANDRA-8911) Consider Mutation-based Repairs

2016-05-05 Thread Joshua McKenzie (JIRA)

[ 
https://issues.apache.org/jira/browse/CASSANDRA-8911?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15272630#comment-15272630
 ] 

Joshua McKenzie commented on CASSANDRA-8911:


bq. Reason I prefer to release it incrementally like this is that I don't want 
us to waste a lot of time if the approach does not really work in real life
We shouldn't commit anything to the code-base if it doesn't work in production 
environments, incremental release and experimental feature or not.

Certainly we can test this to determine if it's viable from a performance 
perspective and give recommendations regarding its limitations in general usage 
before committing.

> Consider Mutation-based Repairs
> ---
>
> Key: CASSANDRA-8911
> URL: https://issues.apache.org/jira/browse/CASSANDRA-8911
> Project: Cassandra
>  Issue Type: Improvement
>Reporter: Tyler Hobbs
>Assignee: Marcus Eriksson
> Fix For: 3.x
>
>
> We should consider a mutation-based repair to replace the existing streaming 
> repair.  While we're at it, we could do away with a lot of the complexity 
> around merkle trees.
> I have not planned this out in detail, but here's roughly what I'm thinking:
>  * Instead of building an entire merkle tree up front, just send the "leaves" 
> one-by-one.  Instead of dealing with token ranges, make the leaves primary 
> key ranges.  The PK ranges would need to be contiguous, so that the start of 
> each range would match the end of the previous range. (The first and last 
> leaves would need to be open-ended on one end of the PK range.) This would be 
> similar to doing a read with paging.
>  * Once one page of data is read, compute a hash of it and send it to the 
> other replicas along with the PK range that it covers and a row count.
>  * When the replicas receive the hash, the perform a read over the same PK 
> range (using a LIMIT of the row count + 1) and compare hashes (unless the row 
> counts don't match, in which case this can be skipped).
>  * If there is a mismatch, the replica will send a mutation covering that 
> page's worth of data (ignoring the row count this time) to the source node.
> Here are the advantages that I can think of:
>  * With the current repair behavior of streaming, vnode-enabled clusters may 
> need to stream hundreds of small SSTables.  This results in increased compact
> ion load on the receiving node.  With the mutation-based approach, memtables 
> would naturally merge these.
>  * It's simple to throttle.  For example, you could give a number of rows/sec 
> that should be repaired.
>  * It's easy to see what PK range has been repaired so far.  This could make 
> it simpler to resume a repair that fails midway.
>  * Inconsistencies start to be repaired almost right away.
>  * Less special code \(?\)
>  * Wide partitions are no longer a problem.
> There are a few problems I can think of:
>  * Counters.  I don't know if this can be made safe, or if they need to be 
> skipped.
>  * To support incremental repair, we need to be able to read from only 
> repaired sstables.  Probably not too difficult to do.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (CASSANDRA-8911) Consider Mutation-based Repairs

2016-05-04 Thread Paulo Motta (JIRA)

[ 
https://issues.apache.org/jira/browse/CASSANDRA-8911?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15270947#comment-15270947
 ] 

Paulo Motta commented on CASSANDRA-8911:


bq.  the nice thing about this is that it requires no coordination or 
scheduling at all - all nodes should run this (slowly) all the time

is the idea to have one thread per table running continuously, or a fixed-size 
pool for all tables? If the second option we may still need to prioritize which 
table should run MBR next.

> Consider Mutation-based Repairs
> ---
>
> Key: CASSANDRA-8911
> URL: https://issues.apache.org/jira/browse/CASSANDRA-8911
> Project: Cassandra
>  Issue Type: Improvement
>Reporter: Tyler Hobbs
>Assignee: Marcus Eriksson
> Fix For: 3.x
>
>
> We should consider a mutation-based repair to replace the existing streaming 
> repair.  While we're at it, we could do away with a lot of the complexity 
> around merkle trees.
> I have not planned this out in detail, but here's roughly what I'm thinking:
>  * Instead of building an entire merkle tree up front, just send the "leaves" 
> one-by-one.  Instead of dealing with token ranges, make the leaves primary 
> key ranges.  The PK ranges would need to be contiguous, so that the start of 
> each range would match the end of the previous range. (The first and last 
> leaves would need to be open-ended on one end of the PK range.) This would be 
> similar to doing a read with paging.
>  * Once one page of data is read, compute a hash of it and send it to the 
> other replicas along with the PK range that it covers and a row count.
>  * When the replicas receive the hash, the perform a read over the same PK 
> range (using a LIMIT of the row count + 1) and compare hashes (unless the row 
> counts don't match, in which case this can be skipped).
>  * If there is a mismatch, the replica will send a mutation covering that 
> page's worth of data (ignoring the row count this time) to the source node.
> Here are the advantages that I can think of:
>  * With the current repair behavior of streaming, vnode-enabled clusters may 
> need to stream hundreds of small SSTables.  This results in increased compact
> ion load on the receiving node.  With the mutation-based approach, memtables 
> would naturally merge these.
>  * It's simple to throttle.  For example, you could give a number of rows/sec 
> that should be repaired.
>  * It's easy to see what PK range has been repaired so far.  This could make 
> it simpler to resume a repair that fails midway.
>  * Inconsistencies start to be repaired almost right away.
>  * Less special code \(?\)
>  * Wide partitions are no longer a problem.
> There are a few problems I can think of:
>  * Counters.  I don't know if this can be made safe, or if they need to be 
> skipped.
>  * To support incremental repair, we need to be able to read from only 
> repaired sstables.  Probably not too difficult to do.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (CASSANDRA-8911) Consider Mutation-based Repairs

2016-05-04 Thread Marcus Eriksson (JIRA)

[ 
https://issues.apache.org/jira/browse/CASSANDRA-8911?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15270926#comment-15270926
 ] 

Marcus Eriksson commented on CASSANDRA-8911:


thanks [~pauloricardomg] it is not ready for review yet so finish your other 
things first :)

Regarding CASSANDRA-10070 the nice thing about this is that it requires no 
coordination or scheduling at all - all nodes should run this (slowly) all the 
time. What we need to figure out is how to automatically size the repair pages 
and rate limiting/error handling etc.

> Consider Mutation-based Repairs
> ---
>
> Key: CASSANDRA-8911
> URL: https://issues.apache.org/jira/browse/CASSANDRA-8911
> Project: Cassandra
>  Issue Type: Improvement
>Reporter: Tyler Hobbs
>Assignee: Marcus Eriksson
> Fix For: 3.x
>
>
> We should consider a mutation-based repair to replace the existing streaming 
> repair.  While we're at it, we could do away with a lot of the complexity 
> around merkle trees.
> I have not planned this out in detail, but here's roughly what I'm thinking:
>  * Instead of building an entire merkle tree up front, just send the "leaves" 
> one-by-one.  Instead of dealing with token ranges, make the leaves primary 
> key ranges.  The PK ranges would need to be contiguous, so that the start of 
> each range would match the end of the previous range. (The first and last 
> leaves would need to be open-ended on one end of the PK range.) This would be 
> similar to doing a read with paging.
>  * Once one page of data is read, compute a hash of it and send it to the 
> other replicas along with the PK range that it covers and a row count.
>  * When the replicas receive the hash, the perform a read over the same PK 
> range (using a LIMIT of the row count + 1) and compare hashes (unless the row 
> counts don't match, in which case this can be skipped).
>  * If there is a mismatch, the replica will send a mutation covering that 
> page's worth of data (ignoring the row count this time) to the source node.
> Here are the advantages that I can think of:
>  * With the current repair behavior of streaming, vnode-enabled clusters may 
> need to stream hundreds of small SSTables.  This results in increased compact
> ion load on the receiving node.  With the mutation-based approach, memtables 
> would naturally merge these.
>  * It's simple to throttle.  For example, you could give a number of rows/sec 
> that should be repaired.
>  * It's easy to see what PK range has been repaired so far.  This could make 
> it simpler to resume a repair that fails midway.
>  * Inconsistencies start to be repaired almost right away.
>  * Less special code \(?\)
>  * Wide partitions are no longer a problem.
> There are a few problems I can think of:
>  * Counters.  I don't know if this can be made safe, or if they need to be 
> skipped.
>  * To support incremental repair, we need to be able to read from only 
> repaired sstables.  Probably not too difficult to do.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (CASSANDRA-8911) Consider Mutation-based Repairs

2016-05-04 Thread Paulo Motta (JIRA)

[ 
https://issues.apache.org/jira/browse/CASSANDRA-8911?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15270884#comment-15270884
 ] 

Paulo Motta commented on CASSANDRA-8911:


I'd be interested in reviewing this, but since it's quite a big/relevant change 
another reviewer would probably be welcome (specially in the read-path side, 
since I'm not totally comfortable with it). Have some priority things on my 
plate right now, so will have a better look and post comments next week.

For
bq.* continuously run this
we could probably rely on CASSANDRA-10070, which should provide facilities for 
automatizing repairs independent of implementation.

> Consider Mutation-based Repairs
> ---
>
> Key: CASSANDRA-8911
> URL: https://issues.apache.org/jira/browse/CASSANDRA-8911
> Project: Cassandra
>  Issue Type: Improvement
>Reporter: Tyler Hobbs
>Assignee: Marcus Eriksson
> Fix For: 3.x
>
>
> We should consider a mutation-based repair to replace the existing streaming 
> repair.  While we're at it, we could do away with a lot of the complexity 
> around merkle trees.
> I have not planned this out in detail, but here's roughly what I'm thinking:
>  * Instead of building an entire merkle tree up front, just send the "leaves" 
> one-by-one.  Instead of dealing with token ranges, make the leaves primary 
> key ranges.  The PK ranges would need to be contiguous, so that the start of 
> each range would match the end of the previous range. (The first and last 
> leaves would need to be open-ended on one end of the PK range.) This would be 
> similar to doing a read with paging.
>  * Once one page of data is read, compute a hash of it and send it to the 
> other replicas along with the PK range that it covers and a row count.
>  * When the replicas receive the hash, the perform a read over the same PK 
> range (using a LIMIT of the row count + 1) and compare hashes (unless the row 
> counts don't match, in which case this can be skipped).
>  * If there is a mismatch, the replica will send a mutation covering that 
> page's worth of data (ignoring the row count this time) to the source node.
> Here are the advantages that I can think of:
>  * With the current repair behavior of streaming, vnode-enabled clusters may 
> need to stream hundreds of small SSTables.  This results in increased compact
> ion load on the receiving node.  With the mutation-based approach, memtables 
> would naturally merge these.
>  * It's simple to throttle.  For example, you could give a number of rows/sec 
> that should be repaired.
>  * It's easy to see what PK range has been repaired so far.  This could make 
> it simpler to resume a repair that fails midway.
>  * Inconsistencies start to be repaired almost right away.
>  * Less special code \(?\)
>  * Wide partitions are no longer a problem.
> There are a few problems I can think of:
>  * Counters.  I don't know if this can be made safe, or if they need to be 
> skipped.
>  * To support incremental repair, we need to be able to read from only 
> repaired sstables.  Probably not too difficult to do.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (CASSANDRA-8911) Consider Mutation-based Repairs

2016-05-04 Thread Marcus Eriksson (JIRA)

[ 
https://issues.apache.org/jira/browse/CASSANDRA-8911?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15270842#comment-15270842
 ] 

Marcus Eriksson commented on CASSANDRA-8911:


No, it still has a few problems - first, we wont be able to eliminate as many 
sstables during queries based on the min/max clustering values since the all 
cover a large time span, and second, since "all" sstables will contain very old 
data, we can't do the "drop this sstable as it only contains expired 
data"-optimization during compaction since that old data might still be newer 
than the data in other sstables.

> Consider Mutation-based Repairs
> ---
>
> Key: CASSANDRA-8911
> URL: https://issues.apache.org/jira/browse/CASSANDRA-8911
> Project: Cassandra
>  Issue Type: Improvement
>Reporter: Tyler Hobbs
>Assignee: Marcus Eriksson
> Fix For: 3.x
>
>
> We should consider a mutation-based repair to replace the existing streaming 
> repair.  While we're at it, we could do away with a lot of the complexity 
> around merkle trees.
> I have not planned this out in detail, but here's roughly what I'm thinking:
>  * Instead of building an entire merkle tree up front, just send the "leaves" 
> one-by-one.  Instead of dealing with token ranges, make the leaves primary 
> key ranges.  The PK ranges would need to be contiguous, so that the start of 
> each range would match the end of the previous range. (The first and last 
> leaves would need to be open-ended on one end of the PK range.) This would be 
> similar to doing a read with paging.
>  * Once one page of data is read, compute a hash of it and send it to the 
> other replicas along with the PK range that it covers and a row count.
>  * When the replicas receive the hash, the perform a read over the same PK 
> range (using a LIMIT of the row count + 1) and compare hashes (unless the row 
> counts don't match, in which case this can be skipped).
>  * If there is a mismatch, the replica will send a mutation covering that 
> page's worth of data (ignoring the row count this time) to the source node.
> Here are the advantages that I can think of:
>  * With the current repair behavior of streaming, vnode-enabled clusters may 
> need to stream hundreds of small SSTables.  This results in increased compact
> ion load on the receiving node.  With the mutation-based approach, memtables 
> would naturally merge these.
>  * It's simple to throttle.  For example, you could give a number of rows/sec 
> that should be repaired.
>  * It's easy to see what PK range has been repaired so far.  This could make 
> it simpler to resume a repair that fails midway.
>  * Inconsistencies start to be repaired almost right away.
>  * Less special code \(?\)
>  * Wide partitions are no longer a problem.
> There are a few problems I can think of:
>  * Counters.  I don't know if this can be made safe, or if they need to be 
> skipped.
>  * To support incremental repair, we need to be able to read from only 
> repaired sstables.  Probably not too difficult to do.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (CASSANDRA-8911) Consider Mutation-based Repairs

2016-05-04 Thread T Jake Luciani (JIRA)

[ 
https://issues.apache.org/jira/browse/CASSANDRA-8911?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15270811#comment-15270811
 ] 

T Jake Luciani commented on CASSANDRA-8911:
---

I'll volunteer to write some large scale tests for this feature

> Consider Mutation-based Repairs
> ---
>
> Key: CASSANDRA-8911
> URL: https://issues.apache.org/jira/browse/CASSANDRA-8911
> Project: Cassandra
>  Issue Type: Improvement
>Reporter: Tyler Hobbs
>Assignee: Marcus Eriksson
> Fix For: 3.x
>
>
> We should consider a mutation-based repair to replace the existing streaming 
> repair.  While we're at it, we could do away with a lot of the complexity 
> around merkle trees.
> I have not planned this out in detail, but here's roughly what I'm thinking:
>  * Instead of building an entire merkle tree up front, just send the "leaves" 
> one-by-one.  Instead of dealing with token ranges, make the leaves primary 
> key ranges.  The PK ranges would need to be contiguous, so that the start of 
> each range would match the end of the previous range. (The first and last 
> leaves would need to be open-ended on one end of the PK range.) This would be 
> similar to doing a read with paging.
>  * Once one page of data is read, compute a hash of it and send it to the 
> other replicas along with the PK range that it covers and a row count.
>  * When the replicas receive the hash, the perform a read over the same PK 
> range (using a LIMIT of the row count + 1) and compare hashes (unless the row 
> counts don't match, in which case this can be skipped).
>  * If there is a mismatch, the replica will send a mutation covering that 
> page's worth of data (ignoring the row count this time) to the source node.
> Here are the advantages that I can think of:
>  * With the current repair behavior of streaming, vnode-enabled clusters may 
> need to stream hundreds of small SSTables.  This results in increased compact
> ion load on the receiving node.  With the mutation-based approach, memtables 
> would naturally merge these.
>  * It's simple to throttle.  For example, you could give a number of rows/sec 
> that should be repaired.
>  * It's easy to see what PK range has been repaired so far.  This could make 
> it simpler to resume a repair that fails midway.
>  * Inconsistencies start to be repaired almost right away.
>  * Less special code \(?\)
>  * Wide partitions are no longer a problem.
> There are a few problems I can think of:
>  * Counters.  I don't know if this can be made safe, or if they need to be 
> skipped.
>  * To support incremental repair, we need to be able to read from only 
> repaired sstables.  Probably not too difficult to do.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (CASSANDRA-8911) Consider Mutation-based Repairs

2016-05-04 Thread Jon Haddad (JIRA)

[ 
https://issues.apache.org/jira/browse/CASSANDRA-8911?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15270790#comment-15270790
 ] 

Jon Haddad commented on CASSANDRA-8911:
---

Regarding DTCS, wouldn't that have been solved with CASSANDRA-11056, if it had 
been merged?

> Consider Mutation-based Repairs
> ---
>
> Key: CASSANDRA-8911
> URL: https://issues.apache.org/jira/browse/CASSANDRA-8911
> Project: Cassandra
>  Issue Type: Improvement
>Reporter: Tyler Hobbs
>Assignee: Marcus Eriksson
> Fix For: 3.x
>
>
> We should consider a mutation-based repair to replace the existing streaming 
> repair.  While we're at it, we could do away with a lot of the complexity 
> around merkle trees.
> I have not planned this out in detail, but here's roughly what I'm thinking:
>  * Instead of building an entire merkle tree up front, just send the "leaves" 
> one-by-one.  Instead of dealing with token ranges, make the leaves primary 
> key ranges.  The PK ranges would need to be contiguous, so that the start of 
> each range would match the end of the previous range. (The first and last 
> leaves would need to be open-ended on one end of the PK range.) This would be 
> similar to doing a read with paging.
>  * Once one page of data is read, compute a hash of it and send it to the 
> other replicas along with the PK range that it covers and a row count.
>  * When the replicas receive the hash, the perform a read over the same PK 
> range (using a LIMIT of the row count + 1) and compare hashes (unless the row 
> counts don't match, in which case this can be skipped).
>  * If there is a mismatch, the replica will send a mutation covering that 
> page's worth of data (ignoring the row count this time) to the source node.
> Here are the advantages that I can think of:
>  * With the current repair behavior of streaming, vnode-enabled clusters may 
> need to stream hundreds of small SSTables.  This results in increased compact
> ion load on the receiving node.  With the mutation-based approach, memtables 
> would naturally merge these.
>  * It's simple to throttle.  For example, you could give a number of rows/sec 
> that should be repaired.
>  * It's easy to see what PK range has been repaired so far.  This could make 
> it simpler to resume a repair that fails midway.
>  * Inconsistencies start to be repaired almost right away.
>  * Less special code \(?\)
>  * Wide partitions are no longer a problem.
> There are a few problems I can think of:
>  * Counters.  I don't know if this can be made safe, or if they need to be 
> skipped.
>  * To support incremental repair, we need to be able to read from only 
> repaired sstables.  Probably not too difficult to do.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (CASSANDRA-8911) Consider Mutation-based Repairs

2016-05-04 Thread Marcus Eriksson (JIRA)

[ 
https://issues.apache.org/jira/browse/CASSANDRA-8911?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15270766#comment-15270766
 ] 

Marcus Eriksson commented on CASSANDRA-8911:


Have pushed a bunch new commits to the 
[branch|https://github.com/krummas/cassandra/commits/marcuse/8911]

Apart from a bunch of fixes, it adds a nodetool command to trigger a repair {{$ 
nodetool -p7100 mutationbasedrepair stresscql datatest -w 2000 -r 8000}} where 
-w is the page size and -r is the number of rows per second to repair.

Also pushed a bunch of dtests 
[here|https://github.com/krummas/cassandra-dtest/commits/marcuse/8911]

My plan for this is:
* Add a separate memtable without commitlog, instead we record the last 
repaired page when we flush this memtable, so if we lose the memtable, we will 
start repairing from the point where we flushed the memtable last time. We can 
probably also skip serving reads from this separate memtable.
* Make it impossible to start if you run DTCS
* Clean up code, get it reviewed etc.
* Release it as an "experimental" feature
* Create new tickets:
**  make it incremental
** continuously run this
** investigate how to handle gc grace
** make it safe to use with DTCS

Reason I prefer to release it incrementally like this is that I don't want us 
to waste a lot of time if the approach does not really work in real life

> Consider Mutation-based Repairs
> ---
>
> Key: CASSANDRA-8911
> URL: https://issues.apache.org/jira/browse/CASSANDRA-8911
> Project: Cassandra
>  Issue Type: Improvement
>Reporter: Tyler Hobbs
>Assignee: Marcus Eriksson
> Fix For: 3.x
>
>
> We should consider a mutation-based repair to replace the existing streaming 
> repair.  While we're at it, we could do away with a lot of the complexity 
> around merkle trees.
> I have not planned this out in detail, but here's roughly what I'm thinking:
>  * Instead of building an entire merkle tree up front, just send the "leaves" 
> one-by-one.  Instead of dealing with token ranges, make the leaves primary 
> key ranges.  The PK ranges would need to be contiguous, so that the start of 
> each range would match the end of the previous range. (The first and last 
> leaves would need to be open-ended on one end of the PK range.) This would be 
> similar to doing a read with paging.
>  * Once one page of data is read, compute a hash of it and send it to the 
> other replicas along with the PK range that it covers and a row count.
>  * When the replicas receive the hash, the perform a read over the same PK 
> range (using a LIMIT of the row count + 1) and compare hashes (unless the row 
> counts don't match, in which case this can be skipped).
>  * If there is a mismatch, the replica will send a mutation covering that 
> page's worth of data (ignoring the row count this time) to the source node.
> Here are the advantages that I can think of:
>  * With the current repair behavior of streaming, vnode-enabled clusters may 
> need to stream hundreds of small SSTables.  This results in increased compact
> ion load on the receiving node.  With the mutation-based approach, memtables 
> would naturally merge these.
>  * It's simple to throttle.  For example, you could give a number of rows/sec 
> that should be repaired.
>  * It's easy to see what PK range has been repaired so far.  This could make 
> it simpler to resume a repair that fails midway.
>  * Inconsistencies start to be repaired almost right away.
>  * Less special code \(?\)
>  * Wide partitions are no longer a problem.
> There are a few problems I can think of:
>  * Counters.  I don't know if this can be made safe, or if they need to be 
> skipped.
>  * To support incremental repair, we need to be able to read from only 
> repaired sstables.  Probably not too difficult to do.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (CASSANDRA-8911) Consider Mutation-based Repairs

2016-04-22 Thread Aleksey Yeschenko (JIRA)

[ 
https://issues.apache.org/jira/browse/CASSANDRA-8911?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15253684#comment-15253684
 ] 

Aleksey Yeschenko commented on CASSANDRA-8911:
--

bq. Avoid breaking DTCS etc, since all mutations go into the same memtable

They are now, but don't have to be with this ticket. Should probably just have 
its own set of memtables for repair, so that we can avoid messing up compaction 
strategies, and in general isolate regular write path from repair mutation 
write path for load control purposes. For the same reason, should not be 
reusing {{MUTATION}} verb.

> Consider Mutation-based Repairs
> ---
>
> Key: CASSANDRA-8911
> URL: https://issues.apache.org/jira/browse/CASSANDRA-8911
> Project: Cassandra
>  Issue Type: Improvement
>Reporter: Tyler Hobbs
>Assignee: Marcus Eriksson
> Fix For: 3.x
>
>
> We should consider a mutation-based repair to replace the existing streaming 
> repair.  While we're at it, we could do away with a lot of the complexity 
> around merkle trees.
> I have not planned this out in detail, but here's roughly what I'm thinking:
>  * Instead of building an entire merkle tree up front, just send the "leaves" 
> one-by-one.  Instead of dealing with token ranges, make the leaves primary 
> key ranges.  The PK ranges would need to be contiguous, so that the start of 
> each range would match the end of the previous range. (The first and last 
> leaves would need to be open-ended on one end of the PK range.) This would be 
> similar to doing a read with paging.
>  * Once one page of data is read, compute a hash of it and send it to the 
> other replicas along with the PK range that it covers and a row count.
>  * When the replicas receive the hash, the perform a read over the same PK 
> range (using a LIMIT of the row count + 1) and compare hashes (unless the row 
> counts don't match, in which case this can be skipped).
>  * If there is a mismatch, the replica will send a mutation covering that 
> page's worth of data (ignoring the row count this time) to the source node.
> Here are the advantages that I can think of:
>  * With the current repair behavior of streaming, vnode-enabled clusters may 
> need to stream hundreds of small SSTables.  This results in increased compact
> ion load on the receiving node.  With the mutation-based approach, memtables 
> would naturally merge these.
>  * It's simple to throttle.  For example, you could give a number of rows/sec 
> that should be repaired.
>  * It's easy to see what PK range has been repaired so far.  This could make 
> it simpler to resume a repair that fails midway.
>  * Inconsistencies start to be repaired almost right away.
>  * Less special code \(?\)
>  * Wide partitions are no longer a problem.
> There are a few problems I can think of:
>  * Counters.  I don't know if this can be made safe, or if they need to be 
> skipped.
>  * To support incremental repair, we need to be able to read from only 
> repaired sstables.  Probably not too difficult to do.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (CASSANDRA-8911) Consider Mutation-based Repairs

2016-04-22 Thread Marcus Eriksson (JIRA)

[ 
https://issues.apache.org/jira/browse/CASSANDRA-8911?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15253574#comment-15253574
 ] 

Marcus Eriksson commented on CASSANDRA-8911:


wip branch for this here: 
https://github.com/krummas/cassandra/commits/marcuse/8911

It mostly follows what [~thobbs] outlined above:
* The repairing node pages through its local data with page size = 
{{WINDOW_SIZE}}, [calculating a 
hash|https://github.com/krummas/cassandra/blob/marcuse/8911/src/java/org/apache/cassandra/repair/mutation/MBRService.java#L149-L155]
 for a page and sends the hash to its replicas. We figure out the {{start}} and 
{{end}} "keys" (partition key + clustering) that the remote nodes should 
compare the hash for. Repairing node sends a 
[RepairPage|https://github.com/krummas/cassandra/blob/marcuse/8911/src/java/org/apache/cassandra/repair/mutation/MBRRepairPage.java#L47-L52]
 containing the needed information.
* The replicas [read 
up|https://github.com/krummas/cassandra/blob/marcuse/8911/src/java/org/apache/cassandra/repair/mutation/MBRVerbHandler.java#L58]
 the local data between the {{start}}/{{end}} keys above with a limit of 
{{WINDOW_SIZE * 2}}
** If the hashes match, we reply that the data matched
** If we hit the limit when reading within the {{start}}/{{end}}, we consider 
this a "huge" response and we handle that separately - we reply to the 
repairing node that we have many rows between {{start}}/{{end}}, and the 
repairing node will page back the data from that node. This can happen if the 
repairing node has lost an sstable for example.
** If the hashes don't match, we reply with the data and the repairing node 
will diff its data within the window with the remote data and only write the 
differences to the memtable

Regarding page cache pollution I think we should handle this as normal reads, 
first, the intention is to read through the data slowly so we won't blow out 
all the 'real' pages in a short time, and second, we will read the data twice 
within a very short time span if there is a mismatch, so the page cache should 
make the impact of this smaller.

*Discussion points:*
* How do we make repair invisible to the operator? (continuous process, 
calculate how many rows/s we need to repair etc)
* Can we handle gcgs in a better way with this?
* Can we avoid anticompaction? Do we really need it to be incremental? 
(probably, but having a single thread page through the entire dataset should be 
something the cluster can handle)

TODO:
* Properly handle the "huge" responses above - need to have a way for the 
remote paging to return {{UnfilteredPartitionIterator}}s
* Make it incremental - currently reads all data and puts the differences in 
the regular memtable. We could probably have a separate memtable that is 
flushed to a repaired sstable.
* Avoid breaking DTCS etc, since all mutations go into the same memtable, the 
flushed sstable will cover a big time window. Best solution to this would 
probably be to make flushing write to several sstables as that would help with 
other DTCS issues as well (read repair, USING TIMESTAMP)
* Write tests / benchmarks / ...
* More metrics - we can get very accurate metrics on how much data was actually 
diffing during the repair

If anyone wants to try it out, there is a JMX method 
{{enableMutationBasedRepair}} on the {{ColumnFamilyStoreMBean}} to enable it 
(it pages through and repairs all the data once). 

> Consider Mutation-based Repairs
> ---
>
> Key: CASSANDRA-8911
> URL: https://issues.apache.org/jira/browse/CASSANDRA-8911
> Project: Cassandra
>  Issue Type: Improvement
>Reporter: Tyler Hobbs
>Assignee: Marcus Eriksson
> Fix For: 3.x
>
>
> We should consider a mutation-based repair to replace the existing streaming 
> repair.  While we're at it, we could do away with a lot of the complexity 
> around merkle trees.
> I have not planned this out in detail, but here's roughly what I'm thinking:
>  * Instead of building an entire merkle tree up front, just send the "leaves" 
> one-by-one.  Instead of dealing with token ranges, make the leaves primary 
> key ranges.  The PK ranges would need to be contiguous, so that the start of 
> each range would match the end of the previous range. (The first and last 
> leaves would need to be open-ended on one end of the PK range.) This would be 
> similar to doing a read with paging.
>  * Once one page of data is read, compute a hash of it and send it to the 
> other replicas along with the PK range that it covers and a row count.
>  * When the replicas receive the hash, the perform a read over the same PK 
> range (using a LIMIT of the row count + 1) and compare hashes (unless the row 
> counts don't match, in which case this can be skipped).
>  * If there is a mismatch, the replica will send a mutation covering that 

[jira] [Commented] (CASSANDRA-8911) Consider Mutation-based Repairs

2015-03-31 Thread Robert Coli (JIRA)

[ 
https://issues.apache.org/jira/browse/CASSANDRA-8911?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14389420#comment-14389420
 ] 

Robert Coli commented on CASSANDRA-8911:


Mutation-based repair seems to help us in the case that we have to mark a 
ranges in various sstables unrepaired due to bitrot (CASSANDRA-8703 et al) : we 
get a CRC error on the read path and can know exactly which single-row-key-wide 
range needs to be repaired by a mutation. As I understand it, in the status 
quo, we would have to re-repair all ranges in all sstables which we are marking 
unrepaired.

 Consider Mutation-based Repairs
 ---

 Key: CASSANDRA-8911
 URL: https://issues.apache.org/jira/browse/CASSANDRA-8911
 Project: Cassandra
  Issue Type: Improvement
  Components: Core
Reporter: Tyler Hobbs

 We should consider a mutation-based repair to replace the existing streaming 
 repair.  While we're at it, we could do away with a lot of the complexity 
 around merkle trees.
 I have not planned this out in detail, but here's roughly what I'm thinking:
  * Instead of building an entire merkle tree up front, just send the leaves 
 one-by-one.  Instead of dealing with token ranges, make the leaves primary 
 key ranges.  The PK ranges would need to be contiguous, so that the start of 
 each range would match the end of the previous range. (The first and last 
 leaves would need to be open-ended on one end of the PK range.) This would be 
 similar to doing a read with paging.
  * Once one page of data is read, compute a hash of it and send it to the 
 other replicas along with the PK range that it covers and a row count.
  * When the replicas receive the hash, the perform a read over the same PK 
 range (using a LIMIT of the row count + 1) and compare hashes (unless the row 
 counts don't match, in which case this can be skipped).
  * If there is a mismatch, the replica will send a mutation covering that 
 page's worth of data (ignoring the row count this time) to the source node.
 Here are the advantages that I can think of:
  * With the current repair behavior of streaming, vnode-enabled clusters may 
 need to stream hundreds of small SSTables.  This results in increased compact
 ion load on the receiving node.  With the mutation-based approach, memtables 
 would naturally merge these.
  * It's simple to throttle.  For example, you could give a number of rows/sec 
 that should be repaired.
  * It's easy to see what PK range has been repaired so far.  This could make 
 it simpler to resume a repair that fails midway.
  * Inconsistencies start to be repaired almost right away.
  * Less special code \(?\)
  * Wide partitions are no longer a problem.
 There are a few problems I can think of:
  * Counters.  I don't know if this can be made safe, or if they need to be 
 skipped.
  * To support incremental repair, we need to be able to read from only 
 repaired sstables.  Probably not too difficult to do.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (CASSANDRA-8911) Consider Mutation-based Repairs

2015-03-05 Thread Marcus Eriksson (JIRA)

[ 
https://issues.apache.org/jira/browse/CASSANDRA-8911?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14349135#comment-14349135
 ] 

Marcus Eriksson commented on CASSANDRA-8911:


and I think we could still do it incrementally by just iterating over the keys 
in the unrepaired sstables. If we do this we would need to make sure we have a 
separate(?) memtable for these reads to flush that data into a repaired sstable

 Consider Mutation-based Repairs
 ---

 Key: CASSANDRA-8911
 URL: https://issues.apache.org/jira/browse/CASSANDRA-8911
 Project: Cassandra
  Issue Type: Improvement
  Components: Core
Reporter: Tyler Hobbs

 We should consider a mutation-based repair to replace the existing streaming 
 repair.  While we're at it, we could do away with a lot of the complexity 
 around merkle trees.
 I have not planned this out in detail, but here's roughly what I'm thinking:
  * Instead of building an entire merkle tree up front, just send the leaves 
 one-by-one.  Instead of dealing with token ranges, make the leaves primary 
 key ranges.  The PK ranges would need to be contiguous, so that the start of 
 each range would match the end of the previous range. (The first and last 
 leaves would need to be open-ended on one end of the PK range.) This would be 
 similar to doing a read with paging.
  * Once one page of data is read, compute a hash of it and send it to the 
 other replicas along with the PK range that it covers and a row count.
  * When the replicas receive the hash, the perform a read over the same PK 
 range (using a LIMIT of the row count + 1) and compare hashes (unless the row 
 counts don't match, in which case this can be skipped).
  * If there is a mismatch, the replica will send a mutation covering that 
 page's worth of data (ignoring the row count this time) to the source node.
 Here are the advantages that I can think of:
  * With the current repair behavior of streaming, vnode-enabled clusters may 
 need to stream hundreds of small SSTables.  This results in increased compact
 ion load on the receiving node.  With the mutation-based approach, memtables 
 would naturally merge these.
  * It's simple to throttle.  For example, you could give a number of rows/sec 
 that should be repaired.
  * It's easy to see what PK range has been repaired so far.  This could make 
 it simpler to resume a repair that fails midway.
  * Inconsistencies start to be repaired almost right away.
  * Less special code \(?\)
  * Wide partitions are no longer a problem.
 There are a few problems I can think of:
  * Counters.  I don't know if this can be made safe, or if they need to be 
 skipped.
  * To support incremental repair, we need to be able to read from only 
 repaired sstables.  Probably not too difficult to do.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (CASSANDRA-8911) Consider Mutation-based Repairs

2015-03-05 Thread T Jake Luciani (JIRA)

[ 
https://issues.apache.org/jira/browse/CASSANDRA-8911?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14349113#comment-14349113
 ] 

T Jake Luciani commented on CASSANDRA-8911:
---

I am all for this idea. Esp since it works around the wide row issue.  I've 
found in practice that just paging across all partitions at CL.ALL  finishes 
faster than a full repair and this approach will perform much better than that 
and handle wide row case.

Another issue we have to worry about (I think [~iamaleksey] mentioned) is 
avoiding polluting the page cache.

 Consider Mutation-based Repairs
 ---

 Key: CASSANDRA-8911
 URL: https://issues.apache.org/jira/browse/CASSANDRA-8911
 Project: Cassandra
  Issue Type: Improvement
  Components: Core
Reporter: Tyler Hobbs

 We should consider a mutation-based repair to replace the existing streaming 
 repair.  While we're at it, we could do away with a lot of the complexity 
 around merkle trees.
 I have not planned this out in detail, but here's roughly what I'm thinking:
  * Instead of building an entire merkle tree up front, just send the leaves 
 one-by-one.  Instead of dealing with token ranges, make the leaves primary 
 key ranges.  The PK ranges would need to be contiguous, so that the start of 
 each range would match the end of the previous range. (The first and last 
 leaves would need to be open-ended on one end of the PK range.) This would be 
 similar to doing a read with paging.
  * Once one page of data is read, compute a hash of it and send it to the 
 other replicas along with the PK range that it covers and a row count.
  * When the replicas receive the hash, the perform a read over the same PK 
 range (using a LIMIT of the row count + 1) and compare hashes (unless the row 
 counts don't match, in which case this can be skipped).
  * If there is a mismatch, the replica will send a mutation covering that 
 page's worth of data (ignoring the row count this time) to the source node.
 Here are the advantages that I can think of:
  * With the current repair behavior of streaming, vnode-enabled clusters may 
 need to stream hundreds of small SSTables.  This results in increased compact
 ion load on the receiving node.  With the mutation-based approach, memtables 
 would naturally merge these.
  * It's simple to throttle.  For example, you could give a number of rows/sec 
 that should be repaired.
  * It's easy to see what PK range has been repaired so far.  This could make 
 it simpler to resume a repair that fails midway.
  * Inconsistencies start to be repaired almost right away.
  * Less special code \(?\)
  * Wide partitions are no longer a problem.
 There are a few problems I can think of:
  * Counters.  I don't know if this can be made safe, or if they need to be 
 skipped.
  * To support incremental repair, we need to be able to read from only 
 repaired sstables.  Probably not too difficult to do.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (CASSANDRA-8911) Consider Mutation-based Repairs

2015-03-05 Thread T Jake Luciani (JIRA)

[ 
https://issues.apache.org/jira/browse/CASSANDRA-8911?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14349170#comment-14349170
 ] 

T Jake Luciani commented on CASSANDRA-8911:
---

I think this would also give us the ability to constantly repair in the 
background. we could use the hints log for which Primary keys to repair.

 Consider Mutation-based Repairs
 ---

 Key: CASSANDRA-8911
 URL: https://issues.apache.org/jira/browse/CASSANDRA-8911
 Project: Cassandra
  Issue Type: Improvement
  Components: Core
Reporter: Tyler Hobbs

 We should consider a mutation-based repair to replace the existing streaming 
 repair.  While we're at it, we could do away with a lot of the complexity 
 around merkle trees.
 I have not planned this out in detail, but here's roughly what I'm thinking:
  * Instead of building an entire merkle tree up front, just send the leaves 
 one-by-one.  Instead of dealing with token ranges, make the leaves primary 
 key ranges.  The PK ranges would need to be contiguous, so that the start of 
 each range would match the end of the previous range. (The first and last 
 leaves would need to be open-ended on one end of the PK range.) This would be 
 similar to doing a read with paging.
  * Once one page of data is read, compute a hash of it and send it to the 
 other replicas along with the PK range that it covers and a row count.
  * When the replicas receive the hash, the perform a read over the same PK 
 range (using a LIMIT of the row count + 1) and compare hashes (unless the row 
 counts don't match, in which case this can be skipped).
  * If there is a mismatch, the replica will send a mutation covering that 
 page's worth of data (ignoring the row count this time) to the source node.
 Here are the advantages that I can think of:
  * With the current repair behavior of streaming, vnode-enabled clusters may 
 need to stream hundreds of small SSTables.  This results in increased compact
 ion load on the receiving node.  With the mutation-based approach, memtables 
 would naturally merge these.
  * It's simple to throttle.  For example, you could give a number of rows/sec 
 that should be repaired.
  * It's easy to see what PK range has been repaired so far.  This could make 
 it simpler to resume a repair that fails midway.
  * Inconsistencies start to be repaired almost right away.
  * Less special code \(?\)
  * Wide partitions are no longer a problem.
 There are a few problems I can think of:
  * Counters.  I don't know if this can be made safe, or if they need to be 
 skipped.
  * To support incremental repair, we need to be able to read from only 
 repaired sstables.  Probably not too difficult to do.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (CASSANDRA-8911) Consider Mutation-based Repairs

2015-03-05 Thread Tyler Hobbs (JIRA)

[ 
https://issues.apache.org/jira/browse/CASSANDRA-8911?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14349083#comment-14349083
 ] 

Tyler Hobbs commented on CASSANDRA-8911:


Jonathan's problematic scenario in that ticket assumes that we are building 
merkle trees in parallel across multiple replicas.  Instead, this would take a 
sequential approach, which allows the original/source node to determine a 
proper leaf size while hashing.

 Consider Mutation-based Repairs
 ---

 Key: CASSANDRA-8911
 URL: https://issues.apache.org/jira/browse/CASSANDRA-8911
 Project: Cassandra
  Issue Type: Improvement
  Components: Core
Reporter: Tyler Hobbs

 We should consider a mutation-based repair to replace the existing streaming 
 repair.  While we're at it, we could do away with a lot of the complexity 
 around merkle trees.
 I have not planned this out in detail, but here's roughly what I'm thinking:
  * Instead of building an entire merkle tree up front, just send the leaves 
 one-by-one.  Instead of dealing with token ranges, make the leaves primary 
 key ranges.  The PK ranges would need to be contiguous, so that the start of 
 each range would match the end of the previous range. (The first and last 
 leaves would need to be open-ended on one end of the PK range.) This would be 
 similar to doing a read with paging.
  * Once one page of data is read, compute a hash of it and send it to the 
 other replicas along with the PK range that it covers and a row count.
  * When the replicas receive the hash, the perform a read over the same PK 
 range (using a LIMIT of the row count + 1) and compare hashes (unless the row 
 counts don't match, in which case this can be skipped).
  * If there is a mismatch, the replica will send a mutation covering that 
 page's worth of data (ignoring the row count this time) to the source node.
 Here are the advantages that I can think of:
  * With the current repair behavior of streaming, vnode-enabled clusters may 
 need to stream hundreds of small SSTables.  This results in increased compact
 ion load on the receiving node.  With the mutation-based approach, memtables 
 would naturally merge these.
  * It's simple to throttle.  For example, you could give a number of rows/sec 
 that should be repaired.
  * It's easy to see what PK range has been repaired so far.  This could make 
 it simpler to resume a repair that fails midway.
  * Inconsistencies start to be repaired almost right away.
  * Less special code \(?\)
  * Wide partitions are no longer a problem.
 There are a few problems I can think of:
  * Counters.  I don't know if this can be made safe, or if they need to be 
 skipped.
  * To support incremental repair, we need to be able to read from only 
 repaired sstables.  Probably not too difficult to do.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (CASSANDRA-8911) Consider Mutation-based Repairs

2015-03-04 Thread Aleksey Yeschenko (JIRA)

[ 
https://issues.apache.org/jira/browse/CASSANDRA-8911?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14347886#comment-14347886
 ] 

Aleksey Yeschenko commented on CASSANDRA-8911:
--

FWIW we've looked into that before. See CASSANDRA-3362.

 Consider Mutation-based Repairs
 ---

 Key: CASSANDRA-8911
 URL: https://issues.apache.org/jira/browse/CASSANDRA-8911
 Project: Cassandra
  Issue Type: Improvement
  Components: Core
Reporter: Tyler Hobbs

 We should consider a mutation-based repair to replace the existing streaming 
 repair.  While we're at it, we could do away with a lot of the complexity 
 around merkle trees.
 I have not planned this out in detail, but here's roughly what I'm thinking:
  * Instead of building an entire merkle tree up front, just send the leaves 
 one-by-one.  Instead of dealing with token ranges, make the leaves primary 
 key ranges.  The PK ranges would need to be contiguous, so that the start of 
 each range would match the end of the previous range. (The first and last 
 leaves would need to be open-ended on one end of the PK range.) This would be 
 similar to doing a read with paging.
  * Once one page of data is read, compute a hash of it and send it to the 
 other replicas along with the PK range that it covers and a row count.
  * When the replicas receive the hash, the perform a read over the same PK 
 range (using a LIMIT of the row count + 1) and compare hashes (unless the row 
 counts don't match, in which case this can be skipped).
  * If there is a mismatch, the replica will send a mutation covering that 
 page's worth of data (ignoring the row count this time) to the source node.
 Here are the advantages that I can think of:
  * With the current repair behavior of streaming, vnode-enabled clusters may 
 need to stream hundreds of small SSTables.  This results in increased compact
 ion load on the receiving node.  With the mutation-based approach, memtables 
 would naturally merge these.
  * It's simple to throttle.  For example, you could give a number of rows/sec 
 that should be repaired.
  * It's easy to see what PK range has been repaired so far.  This could make 
 it simpler to resume a repair that fails midway.
  * Inconsistencies start to be repaired almost right away.
  * Less special code \(?\)
  * Wide partitions are no longer a problem.
 There are a few problems I can think of:
  * Counters.  I don't know if this can be made safe, or if they need to be 
 skipped.
  * To support incremental repair, we need to be able to read from only 
 repaired sstables.  Probably not too difficult to do.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)