[ 
https://issues.apache.org/jira/browse/CASSANDRA-10727?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sharvanath Pathak updated CASSANDRA-10727:
------------------------------------------
    Description: 
There have been proposals for getting rid of the GC grace seconds, and 
automating the GC of tombstones by waiting for acks from all the nodes about 
the receipt of the tombstone. 
1. CASSANDRA-3620
2. CASSANDRA-6192

This mechanism has two major benefits in my opinion:
* Since the GC of tomstones can be much more agressive, it minimizes the number 
of tombstones in the system. Thereby, increasing the performance of read 
operations.
* Eliminates the possibility of resurrection of keys in case a node is comes up 
after being down for more than GC grace seconds.

As per CASSANDRA-3620, the main issue with the proposal seems to be its 
potential race with the hinted handoff. Seems like we can have a good solution 
to that race. 

The solution is essentially to record the hint locations. So we before writing 
any hints, we write a record on the alive replicas saying a hint was written at 
so and so node. Now the GC will wait for an ack from all the nodes, and also 
for all the related hints to be replayed and purged before it clears the 
corresponding tombstone. 

On potential problem with this scheme is that if the hints are written on the 
coordinator node the same way they are being done right now, this process will 
have to wait for a large number of nodes to be up before the GC could be 
performed. However, this can be easily solve by writing the hints to a node 
which is determined based on the key token. For instance, write the hint to the 
node that comes up next to the replicas in the token ring. 

Writing the hints in the way described in the last paragraph actually seems 
like agood idea anyway, because it minimizes the number of nodes that have to 
replay hints when a node comes up. The Dynamo paper actually describes this 
pattern for hinted handoffs as well. 

Lastly, it might also have a race with any concurrent read repairs. However, it 
can be solved the same way, by writing the repairs in progress for a key and 
then aborting them before the GC is performed.

  was:
There have been proposals for getting rid of the GC grace seconds, and 
automating the GC of tombstones by waiting for acks from all the nodes about 
the receipt of the tombstone. 
1. CASSANDRA-3620
2. CASSANDRA-6192

This mechanism has two major benefits in my opinion:
* Since the GC of tomstones can be much more agressive, it minimizes the number 
of tombstones in the system. Thereby, increasing the performance of read 
operations.
* Eliminates the possibility of resurrection of keys in case a node is comes up 
after being down for more than GC grace seconds.

As per CASSANDRA-3620, the main issue with the proposal seems to be its 
potential race with the hinted handoff. Seems like we can have a good solution 
to that race. 

The solution is essentially to record the hint locations. So we before writing 
any hints, we write a record saying a hint was written at so and so node. Now 
the GC will wait for an ack from all the nodes, and also for all the related 
hints to be replayed and purged before it clears the corresponding tombstone. 

On potential problem with this scheme is that if the hints are written on the 
coordinator node the same way they are being done right now, this process will 
have to wait for a large number of nodes to be up before the GC could be 
performed. However, this can be easily solve by writing the hints to a node 
which is determined based on the key token. For instance, write the hint to the 
node that comes up next to the replicas in the token ring. 

Writing the hints in the way described in the last paragraph actually seems 
like agood idea anyway, because it minimizes the number of nodes that have to 
replay hints when a node comes up. The Dynamo paper actually describes this 
pattern for hinted handoffs as well. 

Lastly, it might also have a race with any concurrent read repairs. However, it 
can be solved the same way, by writing the repairs in progress for a key and 
then aborting them before the GC is performed.


> Solution for getting rid of GC grace seconds
> --------------------------------------------
>
>                 Key: CASSANDRA-10727
>                 URL: https://issues.apache.org/jira/browse/CASSANDRA-10727
>             Project: Cassandra
>          Issue Type: Improvement
>            Reporter: Sharvanath Pathak
>
> There have been proposals for getting rid of the GC grace seconds, and 
> automating the GC of tombstones by waiting for acks from all the nodes about 
> the receipt of the tombstone. 
> 1. CASSANDRA-3620
> 2. CASSANDRA-6192
> This mechanism has two major benefits in my opinion:
> * Since the GC of tomstones can be much more agressive, it minimizes the 
> number of tombstones in the system. Thereby, increasing the performance of 
> read operations.
> * Eliminates the possibility of resurrection of keys in case a node is comes 
> up after being down for more than GC grace seconds.
> As per CASSANDRA-3620, the main issue with the proposal seems to be its 
> potential race with the hinted handoff. Seems like we can have a good 
> solution to that race. 
> The solution is essentially to record the hint locations. So we before 
> writing any hints, we write a record on the alive replicas saying a hint was 
> written at so and so node. Now the GC will wait for an ack from all the 
> nodes, and also for all the related hints to be replayed and purged before it 
> clears the corresponding tombstone. 
> On potential problem with this scheme is that if the hints are written on the 
> coordinator node the same way they are being done right now, this process 
> will have to wait for a large number of nodes to be up before the GC could be 
> performed. However, this can be easily solve by writing the hints to a node 
> which is determined based on the key token. For instance, write the hint to 
> the node that comes up next to the replicas in the token ring. 
> Writing the hints in the way described in the last paragraph actually seems 
> like agood idea anyway, because it minimizes the number of nodes that have to 
> replay hints when a node comes up. The Dynamo paper actually describes this 
> pattern for hinted handoffs as well. 
> Lastly, it might also have a race with any concurrent read repairs. However, 
> it can be solved the same way, by writing the repairs in progress for a key 
> and then aborting them before the GC is performed.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Reply via email to