belliottsmith commented on PR #50:
URL: https://github.com/apache/cassandra-accord/pull/50#issuecomment-1633110193

   Here is a brief overview of state eviction, in case it helps. I will put 
some form of this in as javadoc.
   
   **ExclusiveSyncPoint**
   We first ensure we know all transactions that might be agreed with lower t0 
than some point. We do this with an “ExclusiveSyncPoint” which is a 
pseudo-transaction with its own t0. This pseudo-transaction conflicts in only 
one direction, taking every transaction as a dependency but not vice-versa.
   
   - On receipt of an ExclusiveSyncPoint, a replica records the 
ExclusiveSyncPoint's t0, and will in future respond to preaccept for t0’ < t0 
with a “reject” flag alongside its pre-accept (slow path) response
   - In reply to the ExclusiveSyncPoint’s coordinator we include the set of 
transactions the replica had witnessed with t0’ < t0
   - A coordinator that receives a “reject” flag will propose that its command 
is not executed
   - An ExclusiveSyncPoint coordinator that has a quorum of replies in each 
participating shard knows that its dependencies are a superset of those that 
might agree to execute. It durably records the dependencies via the slow path. 
We do not assign a t, and do not filter dependencies to remove those with 
higher t. If recovered before durable, we may assign a different set of deps, 
but they will also fulfil the superset criteria.
   
   Now, any transaction with t0’ < t0 that attempts recovery and isn’t in the 
dependency set of the ExclusiveSyncPoint will know it cannot have reached 
consensus and will propose the command does not execute. This requires no 
special-casing, as this is what we do anyway for any normal transaction.
   
   We can take any durable ExclusiveSyncPoint and wait for its dependency’s 
outcomes to be known at some >= majority of replicas. Depending on the model of 
the protocol this might simply mean they are all Stable. Since we execute a 
transaction at the coordinator, for us it means they are all PreApplied (i.e. 
the coordinator execution is known to the replica; it may not have been Applied 
due to some dependency not having yet Applied). For simplicity, though, we wait 
until the ExclusiveSyncPoint itself has Applied, which implies its dependencies 
have also, but this is a stronger property.
   
   We then mark all of the participating keys (or ranges of keys) as durable up 
to t0. It is also possible for a transaction coordinator to mark its 
transaction as durable once it receives a majority of acknowledgements. We 
disseminate this information (eventually) to all replicas of all shards, so 
each shard has some bound on what is known to be durable across the whole 
system.
   
   **Garbage Collection**
   There are a few states that are precursors to garbage collection:
   - Shard Durable: a majority of replicas from a given shard have durably 
recorded its outcome
   - Globally Durable: a majority of replicas from all participating shards 
have durably recorded its outcome
   - Redundant: a transaction is shard-durable, and the shard has some other 
committed transaction that records it as a dependency
   
   Given certain states we enact a given GC phase:
   - Shard durable: We partially truncate expunging everything except that 
needed by other shards to execute (e.g. the result of some boolean expression 
that would no longer be possible to evaluate)
   - Redundant: we stop tracking it as a conflict for dependencies. A 
transaction that durably depends upon it will be returned in its place - this 
might be an ExclusiveSyncPoint that has applied at a majority, so it will not 
slow down execution to depend upon its execution (it is not adopted as a 
dependency on any other shards it may overlap).
   - Globally durable: we fully truncate the transaction expunging everything 
except the fact that the transaction was applied/invalidated
   - Globally durable via one or more ExclusiveSyncPoints: erase the state 
entirely. The ExclusiveSyncPoint’s keys and t0 provide a durable record of the 
transaction’s logical truncation.
   
   However, to minimise work, for Redundant, Partially Truncated and Truncated 
we try to ensure all non-faulty replicas of the relevant shard have recorded 
the outcome; and for Erased we target all non-faulty replicas of all 
participating shards. This isn’t a correctness requirement, but a practical 
concern. We don’t want to delete state that a healthy replica could use to 
catch up, because synchronising data stores is costly.
   
   **Complications**
   There are several complications, however, stemming from partial replication 
and invalidations (i.e. where we propose a no-op because the transaction was 
not found on some majority of replicas of some shard).
   
   Since erasing requires a global condition to be met, partial replication 
makes this difficult. So for the moment we cheat, and we provide every shard a 
minimal summary of all participating keys, that is sufficient to locate every 
participating shard and determine whether the transaction is globally durable 
via ExclusiveSyncPoints. However, we only distribute this on PreAccept, Commit 
and Recover, so it might not be known to all replicas.
   
   Invalidation also makes things difficult because we may not know the 
participating keys. In this case we may have invalidated them only for the keys 
we know of, but the invalidation must survive until any other keys have been 
invalidated. Otherwise we may not be able to distinguish between e.g. a 
successful fast-path execution and an invalidation when the other shard comes 
to recover. So in this case we have to wait until we discover the full set of 
keys, or until all shards in the cluster are durable past the t0 of the 
invalidated transaction.
   
   Invalidation also makes catching-up harder for replicas that did not apply 
an ExclusiveSyncPoint before it was marked durable. This is because, once 
erased, it is impossible to tell whether a transaction that has not been 
committed was Invalidated or whether we are simply stale and never witnessed 
it. But, we don’t need to maintain a separate set - we just need to check if 
there is some durable ExclusiveSyncPoint that covers the transaction remotely 
and that we are not aware of, and treat ourselves as stale. This part has not 
yet been implemented.
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to