[ 
https://issues.apache.org/jira/browse/CASSANDRA-3620?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dominic Williams updated CASSANDRA-3620:
----------------------------------------

    Description: 
Proposal for an improved system for handling distributed deletes, which removes 
the requirement to regularly run repair processes to maintain performance and 
data integrity. 

h2. The Problem

There are various issues with repair:

* Repair is expensive to run
* Repair jobs are often made more expensive than they should be by other issues 
(nodes dropping requests, hinted handoff not working, downtime etc)
* Repair processes can often fail and need restarting, for example in cloud 
environments where network issues make a node disappear from the ring for a 
brief moment
* When you fail to run repair within GCSeconds, either by error or because of 
issues with Cassandra, data written to a node that did not see a later delete 
can reappear (and a node might miss a delete for several reasons including 
being down or simply dropping requests during load shedding)
* If you cannot run repair and have to increase GCSeconds to prevent deleted 
data reappearing, in some cases the growing tombstone overhead can 
significantly degrade performance

Because of the foregoing, in high throughput environments it can be very 
difficult to make repair a cron job. It can be preferable to keep a terminal 
open and run repair jobs one by one, making sure they succeed and keeping and 
eye on overall load to reduce system impact. This isn't desirable, and problems 
are exacerbated when there are lots of column families in a database or it is 
necessary to run a column family with a low GCSeconds to reduce tombstone load 
(because there are many write/deletes to that column family). The database 
owner must run repair within the GCSeconds window, or increase GCSeconds, to 
avoid potentially losing delete operations. 

It would be much better if there was no ongoing requirement to run repair to 
ensure deletes aren't lost, and no GCSeconds window. Ideally repair would be an 
optional maintenance utility used in special cases, or to ensure ONE reads get 
consistent data. 

h2. "Reaper Model" Proposal

# Tombstones do not expire, and there is no GCSeconds
# Tombstones have associated ACK lists, which record the replicas that have 
acknowledged them
# Tombstones are deleted (or marked for compaction) when they have been 
acknowledged by all replicas
# When a tombstone is deleted, it is added to a "relic" index. The relic index 
makes it possible for a reaper to acknowledge a tombstone after it is deleted
# The ACK lists and relic index are held in memory for speed
# Background "reaper" threads constantly stream ACK requests to other nodes, 
and stream back ACK responses back to requests they have received (throttling 
their usage of CPU and bandwidth so as not to affect performance)
# If a reaper receives a request to ACK a tombstone that does not exist, it 
creates the tombstone and adds an ACK for the requestor, and replies with an 
ACK. This is the worst that can happen, and does not cause data corruption. 

ADDENDUM

The proposal to hold the ACK and relic lists in memory was added after the 
first posting. Please see comments for full reasons. Furthermore, a proposal 
for enhancements to repair was posted to comments, which would cause tombstones 
to be scavenged when repair completes (the author had assumed this was the case 
anyway, but it seems at time of writing they are only scavenged during 
compaction on GCSeconds timeout). The proposals are not exclusive and this 
proposal is extended to include the possible enhancements to repair described.

NOTES

* If a node goes down for a prolonged period, the worst that can happen is that 
some tombstones are recreated across the cluster when it restarts, which does 
not corrupt data (and this will only occur with a very small number of 
tombstones)
* The system is simple to implement and predictable 
* With the reaper model, repair would become an optional process for optimizing 
the database to increase the consistency seen by ConsistencyLevel.ONE reads, 
and for fixing up nodes, for example after an sstable was lost

h3. Planned Benefits

* Reaper threads can utilize "spare" cycles to constantly scavenge tombstones 
in the background thereby greatly reducing tombstone load, improving query 
performance, reducing the system resources needed by processes such as 
compaction, and making performance generally more predictable 
* The reaper model means that GCSeconds is no longer necessary, which removes 
the threat of data corruption if repair can't be run successfully within that 
period (for example if repair can't be run because of a new adopter's lack of 
Cassandra expertise, a cron script failing, or Cassandra bugs or other 
technical issues)
* Reaper threads are fully automatic, work in the background and perform finely 
grained operations where interruption has little effect. This is much better 
for database administrators than having to manually run and manage repair, 
whether for the purposes of preventing data corruption or for optimizing 
performance, which in addition to wasting operator time also often creates load 
spikes and has to be restarted after failure.  

  was:
Proposal for an improved system for handling distributed deletes, which removes 
the requirement to run repair regular processes to maintain performance and 
data integrity. 

h2. The Problem

There are various issues with repair:

* Repair is expensive anyway
* Repair jobs are often made more expensive than they should be by other issues 
(nodes dropping requests, hinted handoff not working, downtime etc)
* Repair processes can often fail and need restarting, for example in cloud 
environments where network issues make a node disappear 
from the ring for a brief moment
* When you fail to run repair within GCSeconds, either by error or because of 
issues with Cassandra, data written to a node that did not see a later delete 
can reappear (and a node might miss a delete for several reasons including 
being down or simply dropping requests during load shedding)
* If you cannot run repair and have to increase GCSeconds to prevent deleted 
data reappearing, in some cases the growing tombstone overhead can 
significantly degrade performance

Because of the foregoing, in high throughput environments it can be very 
difficult to make repair a cron job. It can be preferable to keep a terminal 
open and run repair jobs one by one, making sure they succeed and keeping and 
eye on overall load to reduce system impact. This isn't desirable, and problems 
are exacerbated when there are lots of column families in a database or it is 
necessary to run a column family with a low GCSeconds to reduce tombstone load 
(because there are many write/deletes to that column family). The database 
owner must run repair within the GCSeconds window, or increase GCSeconds, to 
avoid potentially losing delete operations. 

It would be much better if there was no ongoing requirement to run repair to 
ensure deletes aren't lost, and no GCSeconds window. Ideally repair would be an 
optional maintenance utility used in special cases, or to ensure ONE reads get 
consistent data. 

h2. "Reaper Model" Proposal

# Tombstones do not expire, and there is no GCSeconds
# Tombstones have associated ACK lists, which record the replicas that have 
acknowledged them
# Tombstones are only deleted (or marked for compaction) when they have been 
acknowledged by all replicas
# When a tombstone is deleted, it is added to a fast "relic" index of MD5 
hashes of cf-key-name[-subName]-ackList. The relic index makes it possible for 
a reaper to acknowledge a tombstone after it is deleted
# Background "reaper" threads constantly stream ACK requests to other nodes, 
and stream back ACK responses back to requests they have received (throttling 
their usage of CPU and bandwidth so as not to affect performance)
# If a reaper receives a request to ACK a tombstone that does not exist, it 
creates the tombstone and adds an ACK for the requestor, and replies with an 
ACK 

NOTES

* The existence of entries in the relic index do not affect normal query 
performance
* If a node goes down, and comes up after a configurable relic entry timeout, 
the worst that can happen is that a tombstone that hasn't received all its 
acknowledgements is re-created across the replicas when the reaper requests 
their acknowledgements (which is no big deal since this does not corrupt data)
* Since early removal of entries in the relic index does not cause corruption, 
it can be kept small, or even kept in memory
* Simple to implement and predictable 

h3. Planned Benefits

* Operations are finely grained (reaper interruption is not an issue)
* The labour & administration overhead associated with running repair can be 
removed
* Reapers can utilize "spare" cycles and run constantly in background to 
prevent the load spikes and performance issues associated with repair
* There will no longer be the threat of corruption if repair can't be run for 
some reason (for example because of a new adopter's lack of Cassandra 
expertise, a cron script failing, or Cassandra bugs preventing repair being run 
etc)
* Deleting tombstones earlier, thereby reducing the number involved in query 
processing, will often dramatically improve performance



    
> Proposal for distributed deletes - fully automatic "Reaper Model" rather than 
> GCSeconds and manual repairs
> ----------------------------------------------------------------------------------------------------------
>
>                 Key: CASSANDRA-3620
>                 URL: https://issues.apache.org/jira/browse/CASSANDRA-3620
>             Project: Cassandra
>          Issue Type: Improvement
>          Components: Core
>            Reporter: Dominic Williams
>              Labels: GCSeconds,, deletes,, distributed_deletes,, 
> merkle_trees, repair,
>   Original Estimate: 504h
>  Remaining Estimate: 504h
>
> Proposal for an improved system for handling distributed deletes, which 
> removes the requirement to regularly run repair processes to maintain 
> performance and data integrity. 
> h2. The Problem
> There are various issues with repair:
> * Repair is expensive to run
> * Repair jobs are often made more expensive than they should be by other 
> issues (nodes dropping requests, hinted handoff not working, downtime etc)
> * Repair processes can often fail and need restarting, for example in cloud 
> environments where network issues make a node disappear from the ring for a 
> brief moment
> * When you fail to run repair within GCSeconds, either by error or because of 
> issues with Cassandra, data written to a node that did not see a later delete 
> can reappear (and a node might miss a delete for several reasons including 
> being down or simply dropping requests during load shedding)
> * If you cannot run repair and have to increase GCSeconds to prevent deleted 
> data reappearing, in some cases the growing tombstone overhead can 
> significantly degrade performance
> Because of the foregoing, in high throughput environments it can be very 
> difficult to make repair a cron job. It can be preferable to keep a terminal 
> open and run repair jobs one by one, making sure they succeed and keeping and 
> eye on overall load to reduce system impact. This isn't desirable, and 
> problems are exacerbated when there are lots of column families in a database 
> or it is necessary to run a column family with a low GCSeconds to reduce 
> tombstone load (because there are many write/deletes to that column family). 
> The database owner must run repair within the GCSeconds window, or increase 
> GCSeconds, to avoid potentially losing delete operations. 
> It would be much better if there was no ongoing requirement to run repair to 
> ensure deletes aren't lost, and no GCSeconds window. Ideally repair would be 
> an optional maintenance utility used in special cases, or to ensure ONE reads 
> get consistent data. 
> h2. "Reaper Model" Proposal
> # Tombstones do not expire, and there is no GCSeconds
> # Tombstones have associated ACK lists, which record the replicas that have 
> acknowledged them
> # Tombstones are deleted (or marked for compaction) when they have been 
> acknowledged by all replicas
> # When a tombstone is deleted, it is added to a "relic" index. The relic 
> index makes it possible for a reaper to acknowledge a tombstone after it is 
> deleted
> # The ACK lists and relic index are held in memory for speed
> # Background "reaper" threads constantly stream ACK requests to other nodes, 
> and stream back ACK responses back to requests they have received (throttling 
> their usage of CPU and bandwidth so as not to affect performance)
> # If a reaper receives a request to ACK a tombstone that does not exist, it 
> creates the tombstone and adds an ACK for the requestor, and replies with an 
> ACK. This is the worst that can happen, and does not cause data corruption. 
> ADDENDUM
> The proposal to hold the ACK and relic lists in memory was added after the 
> first posting. Please see comments for full reasons. Furthermore, a proposal 
> for enhancements to repair was posted to comments, which would cause 
> tombstones to be scavenged when repair completes (the author had assumed this 
> was the case anyway, but it seems at time of writing they are only scavenged 
> during compaction on GCSeconds timeout). The proposals are not exclusive and 
> this proposal is extended to include the possible enhancements to repair 
> described.
> NOTES
> * If a node goes down for a prolonged period, the worst that can happen is 
> that some tombstones are recreated across the cluster when it restarts, which 
> does not corrupt data (and this will only occur with a very small number of 
> tombstones)
> * The system is simple to implement and predictable 
> * With the reaper model, repair would become an optional process for 
> optimizing the database to increase the consistency seen by 
> ConsistencyLevel.ONE reads, and for fixing up nodes, for example after an 
> sstable was lost
> h3. Planned Benefits
> * Reaper threads can utilize "spare" cycles to constantly scavenge tombstones 
> in the background thereby greatly reducing tombstone load, improving query 
> performance, reducing the system resources needed by processes such as 
> compaction, and making performance generally more predictable 
> * The reaper model means that GCSeconds is no longer necessary, which removes 
> the threat of data corruption if repair can't be run successfully within that 
> period (for example if repair can't be run because of a new adopter's lack of 
> Cassandra expertise, a cron script failing, or Cassandra bugs or other 
> technical issues)
> * Reaper threads are fully automatic, work in the background and perform 
> finely grained operations where interruption has little effect. This is much 
> better for database administrators than having to manually run and manage 
> repair, whether for the purposes of preventing data corruption or for 
> optimizing performance, which in addition to wasting operator time also often 
> creates load spikes and has to be restarted after failure.  

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

Reply via email to