[ 
https://issues.apache.org/jira/browse/CASSANDRA-6696?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15062333#comment-15062333
 ] 

Carl Yeksigian commented on CASSANDRA-6696:
-------------------------------------------

Forgot to add myself as a watcher, so I didn't see the comments.

{quote}
Idea is that we have one thread per disk writing, but I guess the thread count 
should be DatabaseDescriptor.getFlushWriters() per disk and the flushExecutor 
thread count should be 1 - we want to quickly hand off to the single 
flushExecutor when flushing and then run the per disk writing in the 
perDiskFlushExecutor. Do you have any other suggestion on how to model this?
{quote}
I think the change to have {{flushWriters}} per disk makes sense, but we should 
set the default to 1 instead of # of disks; we should also update the comment 
in the cassandra.yaml.

{quote}
We wont split on incoming streams based on token range in CASSANDRA-10540 - 
remote node will most likely already have sstables split based on its local 
ranges and those should match any ranges we own, so we can simply write it to 
disk, then the new sstable will get added to the correct compaction strategy 
(if it fits, otherwise it does a round in "L0")
{quote}
Makes sense to me.

{quote}
What do you mean? Splitter for a partition?
{quote}
Reading that took me awhile to figure out what I was trying to say; it was 
about sstableofflinerelevel. Looking over it again, I see that it is handling 
the different ranges correctly, so we can ignore that.

Other than the slight changes around flush writers, the rest of it looks good 
to me.

> Partition sstables by token range
> ---------------------------------
>
>                 Key: CASSANDRA-6696
>                 URL: https://issues.apache.org/jira/browse/CASSANDRA-6696
>             Project: Cassandra
>          Issue Type: Improvement
>            Reporter: sankalp kohli
>            Assignee: Marcus Eriksson
>              Labels: compaction, correctness, dense-storage, 
> jbod-aware-compaction, performance
>             Fix For: 3.2
>
>
> In JBOD, when someone gets a bad drive, the bad drive is replaced with a new 
> empty one and repair is run. 
> This can cause deleted data to come back in some cases. Also this is true for 
> corrupt stables in which we delete the corrupt stable and run repair. 
> Here is an example:
> Say we have 3 nodes A,B and C and RF=3 and GC grace=10days. 
> row=sankalp col=sankalp is written 20 days back and successfully went to all 
> three nodes. 
> Then a delete/tombstone was written successfully for the same row column 15 
> days back. 
> Since this tombstone is more than gc grace, it got compacted in Nodes A and B 
> since it got compacted with the actual data. So there is no trace of this row 
> column in node A and B.
> Now in node C, say the original data is in drive1 and tombstone is in drive2. 
> Compaction has not yet reclaimed the data and tombstone.  
> Drive2 becomes corrupt and was replaced with new empty drive. 
> Due to the replacement, the tombstone in now gone and row=sankalp col=sankalp 
> has come back to life. 
> Now after replacing the drive we run repair. This data will be propagated to 
> all nodes. 
> Note: This is still a problem even if we run repair every gc grace. 
>  



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Reply via email to