[
https://issues.apache.org/jira/browse/CASSANDRA-3620?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13169558#comment-13169558
]
Dominic Williams commented on CASSANDRA-3620:
---------------------------------------------
Sounds cool but I don't understand how it works.. You can't remove a tombstone
until you know all replicas have it right? i.e. you need to be sure all
replicas have a tombstone irrespective of whether repair has previously
completed?
> Proposal for distributed deletes - use "Reaper Model" rather than GCSeconds
> and scheduled repairs
> -------------------------------------------------------------------------------------------------
>
> Key: CASSANDRA-3620
> URL: https://issues.apache.org/jira/browse/CASSANDRA-3620
> Project: Cassandra
> Issue Type: Improvement
> Components: Core
> Reporter: Dominic Williams
> Labels: GCSeconds,, deletes,, distributed_deletes,,
> merkle_trees, repair,
> Original Estimate: 504h
> Remaining Estimate: 504h
>
> Here is a proposal for an improved system for handling distributed deletes.
> h2. The Problem
> There are various issues with repair:
> * Repair is expensive anyway
> * Repair jobs are often made more expensive than they should be by other
> issues (nodes dropping requests, hinted handoff not working, downtime etc)
> * Repair processes can often fail and need restarting, for example in cloud
> environments where network issues make a node disappear
> from the ring for a brief moment
> * When you fail to run repair within GCSeconds, either by error or because of
> issues with Cassandra, data written to a node that did not see a later delete
> can reappear (and a node might miss a delete for several reasons including
> being down or simply dropping requests during load shedding)
> * If you cannot run repair and have to increase GCSeconds to prevent deleted
> data reappearing, in some cases the growing tombstone overhead can
> significantly degrade performance
> Because of the foregoing, in high throughput environments it can be very
> difficult to make repair a cron job. It can be preferable to keep a terminal
> open and run repair jobs one by one, making sure they succeed and keeping and
> eye on overall load to reduce system impact. This isn't desirable, and
> problems are exacerbated when there are lots of column families in a database
> or it is necessary to run a column family with a low GCSeconds to reduce
> tombstone load (because there are many write/deletes to that column family).
> The database owner must run repair within the GCSeconds window, or increase
> GCSeconds, to avoid potentially losing delete operations.
> It would be much better if there was no ongoing requirement to run repair to
> ensure deletes aren't lost, and no GCSeconds window. Ideally repair would be
> an optional maintenance utility used in special cases, or to ensure ONE reads
> get consistent data.
> h2. "Reaper Model" Proposal
> # Tombstones do not expire, and there is no GCSeconds
> # Tombstones have associated ACK lists, which record the replicas that have
> acknowledged them
> # Tombstones are only deleted (or marked for compaction) when they have been
> acknowledged by all replicas
> # When a tombstone is deleted, it is added to a fast "relic" index of MD5
> hashes of cf-key-name[-subName]-ackList. The relic index makes it possible
> for a reaper to acknowledge a tombstone after it is deleted
> # Background "reaper" threads constantly stream ACK requests to other nodes,
> and stream back ACK responses back to requests they have received (throttling
> their usage of CPU and bandwidth so as not to affect performance)
> # If a reaper receives a request to ACK a tombstone that does not exist, it
> creates the tombstone and adds an ACK for the requestor, and replies with an
> ACK
> NOTES
> * The existence of entries in the relic index do not affect normal query
> performance
> * If a node goes down, and comes up after a configurable relic entry timeout,
> the worst that can happen is that a tombstone that hasn't received all its
> acknowledgements is re-created across the replicas when the reaper requests
> their acknowledgements (which is no big deal since this does not corrupt data)
> * Since early removal of entries in the relic index does not cause
> corruption, it can be kept small, or even kept in memory
> * Simple to implement and predictable
> h3. Planned Benefits
> * Operations are finely grained (reaper interruption is not an issue)
> * The labour & administration overhead associated with running repair can be
> removed
> * Reapers can utilize "spare" cycles and run constantly in background to
> prevent the load spikes and performance issues associated with repair
> * There will no longer be the threat of corruption if repair can't be run for
> some reason (for example because of a new adopter's lack of Cassandra
> expertise, a cron script failing, or Cassandra bugs preventing repair being
> run etc)
> * Deleting tombstones earlier, thereby reducing the number involved in query
> processing, will often dramatically improve performance
--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators:
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira