[
https://issues.apache.org/jira/browse/CASSANDRA-6696?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13945631#comment-13945631
]
Benedict commented on CASSANDRA-6696:
-------------------------------------
Last thoughts for the day: only major downside to this approach is that we are
now guaranteeing no better than single disk performance for all operations on a
given partition. So if there are particularly large and fragmented partitions,
they could see read performance decline notably. One possible solution to this
would be split by clustering part (if any), instead of partition key, but
determine the clustering part range split as a function of the partition hash,
so that the distribution of data as a whole is still random (i.e. each
partition has a different clustering distribution across the disks). This would
make the initial flush more complex, and might require more merging on reads,
but compaction could still be easily constrained to one disk. This is just a
poorly formed thought I'm throwing out there for consideration, and possibly
outside of scope for this ticket.
Either way, I'm not certain that splitting ranges based on disk size is such a
great idea. As a follow on ticket it might be sensible to permit two category
of disks: archive for slow and cold data, and live disks for faster data.
Splitting by capacity seems likely to create undesirable performance
characteristics, as two similarly performant disks with different capacities
would lead to worse performance for the data residing on the larger disks.
On the whole I'm +1 this change anyway, the more I think about it. I had been
vaguely considering something along these lines to optimise flush performance,
but it seems we can get this for free along with improving correctness, which
is great.
> Drive replacement in JBOD can cause data to reappear.
> ------------------------------------------------------
>
> Key: CASSANDRA-6696
> URL: https://issues.apache.org/jira/browse/CASSANDRA-6696
> Project: Cassandra
> Issue Type: Improvement
> Components: Core
> Reporter: sankalp kohli
> Assignee: Marcus Eriksson
> Fix For: 3.0
>
>
> In JBOD, when someone gets a bad drive, the bad drive is replaced with a new
> empty one and repair is run.
> This can cause deleted data to come back in some cases. Also this is true for
> corrupt stables in which we delete the corrupt stable and run repair.
> Here is an example:
> Say we have 3 nodes A,B and C and RF=3 and GC grace=10days.
> row=sankalp col=sankalp is written 20 days back and successfully went to all
> three nodes.
> Then a delete/tombstone was written successfully for the same row column 15
> days back.
> Since this tombstone is more than gc grace, it got compacted in Nodes A and B
> since it got compacted with the actual data. So there is no trace of this row
> column in node A and B.
> Now in node C, say the original data is in drive1 and tombstone is in drive2.
> Compaction has not yet reclaimed the data and tombstone.
> Drive2 becomes corrupt and was replaced with new empty drive.
> Due to the replacement, the tombstone in now gone and row=sankalp col=sankalp
> has come back to life.
> Now after replacing the drive we run repair. This data will be propagated to
> all nodes.
> Note: This is still a problem even if we run repair every gc grace.
>
--
This message was sent by Atlassian JIRA
(v6.2#6252)