Sam Tunnicliffe created CASSANDRA-21455:
-------------------------------------------
Summary: Unable to catch up TCM Log from peer with gaps in log
sequence
Key: CASSANDRA-21455
URL: https://issues.apache.org/jira/browse/CASSANDRA-21455
Project: Apache Cassandra
Issue Type: Improvement
Components: Transactional Cluster Metadata
Reporter: Sam Tunnicliffe
Assignee: Jon Meredith
The triggering scenario for this is a non-CMS peer receiving a snapshot when
catching up from a peer/CMS node, leaving a gap near the tail of its local
metadata log. If that node then subsequently receives catchup requests itself
the heuristic that prioritises sending recent log entries over recent snapshots
is counterproductive.
A node receives log entries with epochs A, B, misses entries C, D, E and
catches up from a peer, receiving snapshot at epoch F
Later, when another node asks this peer for its log since epoch B:
1. listSnapshotsSince(B) returns exactly 1 snapshot (at epoch F)
2. Since snapshotEpochs.size() <= 1, the code tries getEntries(B) - gets empty
(or entries only up to B with nothing after)
3. Empty entries are vacuously isContinuous() → returns LogState(null, [])
without the snapshot
4. The requesting node is stuck - it keeps getting empty responses and never
advances past the gap
Proposed fix is in LogReader.getLogState. After confirming entries are
continuous, check if there's a snapshot beyond the latest entry. If so, fall
through to serve the snapshot instead.{quote}
--
This message was sent by Atlassian Jira
(v8.20.10#820010)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]