belliottsmith commented on PR #50: URL: https://github.com/apache/cassandra-accord/pull/50#issuecomment-1633110193
Here is a brief overview of state eviction, in case it helps. I will put some form of this in as javadoc. **ExclusiveSyncPoint** We first ensure we know all transactions that might be agreed with lower t0 than some point. We do this with an “ExclusiveSyncPoint” which is a pseudo-transaction with its own t0. This pseudo-transaction conflicts in only one direction, taking every transaction as a dependency but not vice-versa. - On receipt of an ExclusiveSyncPoint, a replica records the ExclusiveSyncPoint's t0, and will in future respond to preaccept for t0’ < t0 with a “reject” flag alongside its pre-accept (slow path) response - In reply to the ExclusiveSyncPoint’s coordinator we include the set of transactions the replica had witnessed with t0’ < t0 - A coordinator that receives a “reject” flag will propose that its command is not executed - An ExclusiveSyncPoint coordinator that has a quorum of replies in each participating shard knows that its dependencies are a superset of those that might agree to execute. It durably records the dependencies via the slow path. We do not assign a t, and do not filter dependencies to remove those with higher t. If recovered before durable, we may assign a different set of deps, but they will also fulfil the superset criteria. Now, any transaction with t0’ < t0 that attempts recovery and isn’t in the dependency set of the ExclusiveSyncPoint will know it cannot have reached consensus and will propose the command does not execute. This requires no special-casing, as this is what we do anyway for any normal transaction. We can take any durable ExclusiveSyncPoint and wait for its dependency’s outcomes to be known at some >= majority of replicas. Depending on the model of the protocol this might simply mean they are all Stable. Since we execute a transaction at the coordinator, for us it means they are all PreApplied (i.e. the coordinator execution is known to the replica; it may not have been Applied due to some dependency not having yet Applied). For simplicity, though, we wait until the ExclusiveSyncPoint itself has Applied, which implies its dependencies have also, but this is a stronger property. We then mark all of the participating keys (or ranges of keys) as durable up to t0. It is also possible for a transaction coordinator to mark its transaction as durable once it receives a majority of acknowledgements. We disseminate this information (eventually) to all replicas of all shards, so each shard has some bound on what is known to be durable across the whole system. **Garbage Collection** There are a few states that are precursors to garbage collection: - Shard Durable: a majority of replicas from a given shard have durably recorded its outcome - Globally Durable: a majority of replicas from all participating shards have durably recorded its outcome - Redundant: a transaction is shard-durable, and the shard has some other committed transaction that records it as a dependency Given certain states we enact a given GC phase: - Shard durable: We partially truncate expunging everything except that needed by other shards to execute (e.g. the result of some boolean expression that would no longer be possible to evaluate) - Redundant: we stop tracking it as a conflict for dependencies. A transaction that durably depends upon it will be returned in its place - this might be an ExclusiveSyncPoint that has applied at a majority, so it will not slow down execution to depend upon its execution (it is not adopted as a dependency on any other shards it may overlap). - Globally durable: we fully truncate the transaction expunging everything except the fact that the transaction was applied/invalidated - Globally durable via one or more ExclusiveSyncPoints: erase the state entirely. The ExclusiveSyncPoint’s keys and t0 provide a durable record of the transaction’s logical truncation. However, to minimise work, for Redundant, Partially Truncated and Truncated we try to ensure all non-faulty replicas of the relevant shard have recorded the outcome; and for Erased we target all non-faulty replicas of all participating shards. This isn’t a correctness requirement, but a practical concern. We don’t want to delete state that a healthy replica could use to catch up, because synchronising data stores is costly. **Complications** There are several complications, however, stemming from partial replication and invalidations (i.e. where we propose a no-op because the transaction was not found on some majority of replicas of some shard). Since erasing requires a global condition to be met, partial replication makes this difficult. So for the moment we cheat, and we provide every shard a minimal summary of all participating keys, that is sufficient to locate every participating shard and determine whether the transaction is globally durable via ExclusiveSyncPoints. However, we only distribute this on PreAccept, Commit and Recover, so it might not be known to all replicas. Invalidation also makes things difficult because we may not know the participating keys. In this case we may have invalidated them only for the keys we know of, but the invalidation must survive until any other keys have been invalidated. Otherwise we may not be able to distinguish between e.g. a successful fast-path execution and an invalidation when the other shard comes to recover. So in this case we have to wait until we discover the full set of keys, or until all shards in the cluster are durable past the t0 of the invalidated transaction. Invalidation also makes catching-up harder for replicas that did not apply an ExclusiveSyncPoint before it was marked durable. This is because, once erased, it is impossible to tell whether a transaction that has not been committed was Invalidated or whether we are simply stale and never witnessed it. But, we don’t need to maintain a separate set - we just need to check if there is some durable ExclusiveSyncPoint that covers the transaction remotely and that we are not aware of, and treat ourselves as stale. This part has not yet been implemented. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected] --------------------------------------------------------------------- To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]

