[
https://issues.apache.org/jira/browse/CASSANDRA-3620?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13594359#comment-13594359
]
Aleksey Yeschenko commented on CASSANDRA-3620:
----------------------------------------------
(referring to "extend the coordinator's ack-wait callback" idea here, not the
"tracking full-repair" one, which is blocked by CASSANDRA-2405 atm and may or
may not be possible)
> Proposal for distributed deletes - fully automatic "Reaper Model" rather than
> GCSeconds and manual repairs
> ----------------------------------------------------------------------------------------------------------
>
> Key: CASSANDRA-3620
> URL: https://issues.apache.org/jira/browse/CASSANDRA-3620
> Project: Cassandra
> Issue Type: Improvement
> Components: Core
> Reporter: Dominic Williams
> Assignee: Aleksey Yeschenko
> Labels: GCSeconds,, deletes,, distributed_deletes,,
> merkle_trees, repair,
> Fix For: 2.0
>
> Original Estimate: 504h
> Remaining Estimate: 504h
>
> Proposal for an improved system for handling distributed deletes, which
> removes the requirement to regularly run repair processes to maintain
> performance and data integrity.
> h2. The Problem
> There are various issues with repair:
> * Repair is expensive to run
> * Repair jobs are often made more expensive than they should be by other
> issues (nodes dropping requests, hinted handoff not working, downtime etc)
> * Repair processes can often fail and need restarting, for example in cloud
> environments where network issues make a node disappear from the ring for a
> brief moment
> * When you fail to run repair within GCSeconds, either by error or because of
> issues with Cassandra, data written to a node that did not see a later delete
> can reappear (and a node might miss a delete for several reasons including
> being down or simply dropping requests during load shedding)
> * If you cannot run repair and have to increase GCSeconds to prevent deleted
> data reappearing, in some cases the growing tombstone overhead can
> significantly degrade performance
> Because of the foregoing, in high throughput environments it can be very
> difficult to make repair a cron job. It can be preferable to keep a terminal
> open and run repair jobs one by one, making sure they succeed and keeping and
> eye on overall load to reduce system impact. This isn't desirable, and
> problems are exacerbated when there are lots of column families in a database
> or it is necessary to run a column family with a low GCSeconds to reduce
> tombstone load (because there are many write/deletes to that column family).
> The database owner must run repair within the GCSeconds window, or increase
> GCSeconds, to avoid potentially losing delete operations.
> It would be much better if there was no ongoing requirement to run repair to
> ensure deletes aren't lost, and no GCSeconds window. Ideally repair would be
> an optional maintenance utility used in special cases, or to ensure ONE reads
> get consistent data.
> h2. "Reaper Model" Proposal
> # Tombstones do not expire, and there is no GCSeconds
> # Tombstones have associated ACK lists, which record the replicas that have
> acknowledged them
> # Tombstones are deleted (or marked for compaction) when they have been
> acknowledged by all replicas
> # When a tombstone is deleted, it is added to a "relic" index. The relic
> index makes it possible for a reaper to acknowledge a tombstone after it is
> deleted
> # The ACK lists and relic index are held in memory for speed
> # Background "reaper" threads constantly stream ACK requests to other nodes,
> and stream back ACK responses back to requests they have received (throttling
> their usage of CPU and bandwidth so as not to affect performance)
> # If a reaper receives a request to ACK a tombstone that does not exist, it
> creates the tombstone and adds an ACK for the requestor, and replies with an
> ACK. This is the worst that can happen, and does not cause data corruption.
> ADDENDUM
> The proposal to hold the ACK and relic lists in memory was added after the
> first posting. Please see comments for full reasons. Furthermore, a proposal
> for enhancements to repair was posted to comments, which would cause
> tombstones to be scavenged when repair completes (the author had assumed this
> was the case anyway, but it seems at time of writing they are only scavenged
> during compaction on GCSeconds timeout). The proposals are not exclusive and
> this proposal is extended to include the possible enhancements to repair
> described.
> NOTES
> * If a node goes down for a prolonged period, the worst that can happen is
> that some tombstones are recreated across the cluster when it restarts, which
> does not corrupt data (and this will only occur with a very small number of
> tombstones)
> * The system is simple to implement and predictable
> * With the reaper model, repair would become an optional process for
> optimizing the database to increase the consistency seen by
> ConsistencyLevel.ONE reads, and for fixing up nodes, for example after an
> sstable was lost
> h3. Planned Benefits
> * Reaper threads can utilize "spare" cycles to constantly scavenge tombstones
> in the background thereby greatly reducing tombstone load, improving query
> performance, reducing the system resources needed by processes such as
> compaction, and making performance generally more predictable
> * The reaper model means that GCSeconds is no longer necessary, which removes
> the threat of data corruption if repair can't be run successfully within that
> period (for example if repair can't be run because of a new adopter's lack of
> Cassandra expertise, a cron script failing, or Cassandra bugs or other
> technical issues)
> * Reaper threads are fully automatic, work in the background and perform
> finely grained operations where interruption has little effect. This is much
> better for database administrators than having to manually run and manage
> repair, whether for the purposes of preventing data corruption or for
> optimizing performance, which in addition to wasting operator time also often
> creates load spikes and has to be restarted after failure.
--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira