[ 
https://issues.apache.org/jira/browse/CASSANDRA-6696?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13987591#comment-13987591
 ] 

Benedict commented on CASSANDRA-6696:
-------------------------------------

bq. only merge with the individual L1s once the density of the relevant portion 
of L0 is > ~0.5 per vnode

I mean when the amount of data we would flush into the next level would on 
average be data equal to 50% of the size limit of the lower level. But that is 
too high (see below)

bq. current default size is 160M

I was reading stale docs that set it at 5Mb. Somewhere inbetween seems sensible 
- 20Mb? That way we'd get 1.6Gb into 80 files; if we have 768 vnodes and we set 
the ratio for flushing down into the lower level at 0.1 we'd _on average_ merge 
straight into L1, but in reality this would only happen for those vnodes with 
sufficient density, and those without would pause until sufficient density of 
data appeared.

The only slight complication to this is what we do if there then become files 
containing enough data to get merged into one L1, but another portion is much 
too small to be efficient to merge down - in this case I'd suggest simply 
remerging out the data that would be inefficient to merge into L0, until it 
hits our merge threshold (or is >= in size to the data already present in L1, 
if L1 is not very full). Alternatively we could, for simplicity, simply always 
merge as soon as the average for any file exceeds our threshold, but I'm not 
convinced this is a great strategy.



> Drive replacement in JBOD can cause data to reappear. 
> ------------------------------------------------------
>
>                 Key: CASSANDRA-6696
>                 URL: https://issues.apache.org/jira/browse/CASSANDRA-6696
>             Project: Cassandra
>          Issue Type: Improvement
>          Components: Core
>            Reporter: sankalp kohli
>            Assignee: Marcus Eriksson
>             Fix For: 3.0
>
>
> In JBOD, when someone gets a bad drive, the bad drive is replaced with a new 
> empty one and repair is run. 
> This can cause deleted data to come back in some cases. Also this is true for 
> corrupt stables in which we delete the corrupt stable and run repair. 
> Here is an example:
> Say we have 3 nodes A,B and C and RF=3 and GC grace=10days. 
> row=sankalp col=sankalp is written 20 days back and successfully went to all 
> three nodes. 
> Then a delete/tombstone was written successfully for the same row column 15 
> days back. 
> Since this tombstone is more than gc grace, it got compacted in Nodes A and B 
> since it got compacted with the actual data. So there is no trace of this row 
> column in node A and B.
> Now in node C, say the original data is in drive1 and tombstone is in drive2. 
> Compaction has not yet reclaimed the data and tombstone.  
> Drive2 becomes corrupt and was replaced with new empty drive. 
> Due to the replacement, the tombstone in now gone and row=sankalp col=sankalp 
> has come back to life. 
> Now after replacing the drive we run repair. This data will be propagated to 
> all nodes. 
> Note: This is still a problem even if we run repair every gc grace. 
>  



--
This message was sent by Atlassian JIRA
(v6.2#6252)

Reply via email to