[
https://issues.apache.org/jira/browse/CASSANDRA-21115?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=18053698#comment-18053698
]
Paulo Motta commented on CASSANDRA-21115:
-----------------------------------------
Thanks for the review [[email protected]] ! Addressed review comments
and updated ticket description to reflect the two scenarios this can affect
(it's not only during node crashes, but also prevents other nodes from starting
repair in the first round).
https://pre-ci.cassandra.apache.org/job/cassandra/317/
> Initial auto-repairs can be skipped by too soon check
> -----------------------------------------------------
>
> Key: CASSANDRA-21115
> URL: https://issues.apache.org/jira/browse/CASSANDRA-21115
> Project: Apache Cassandra
> Issue Type: Bug
> Components: Consistency/Repair
> Reporter: Paulo Motta
> Assignee: Paulo Motta
> Priority: Normal
> Time Spent: 0.5h
> Remaining Estimate: 0h
>
> *Problem*
> When a repair history record is created, both repair_start_ts and
> repair_finish_ts are initialized to the same timestamp. tooSoonToRunRepair()
> reads repair_finish_ts and if it falls within min_repair_interval,
> immediately returns "too soon" and aborts. This prevents myTurnToRunRepair()
> from executing entirely, skipping both the turn-to-run check and the
> incomplete repair detection.
> *When this occurs*
> 1. Cross-node initialization: Node A calls insertNewRepairHistory() and
> creates a history record for Node B with start_ts = finish_ts = now(). When
> Node B attempts repair, it sees this fresh timestamp and incorrectly skips,
> thinking it just completed a repair.
> 2. First repair interruption: Node starts its first repair (updating
> start_ts) but crashes or fails before completion (finish_ts unchanged). After
> restart, tooSoonToRunRepair() sees the initialization timestamp in finish_ts
> and may skip the incomplete repair.
> *Fix*
> Add a check in tooSoonToRunRepair(): if repair_start_ts >= repair_finish_ts,
> the repair is either unstarted or incomplete. Return false immediately to
> allow it to proceed, bypassing the interval check.
> *Impact*
> Nodes skip their initial repair attempts and wait unnecessarily until
> min_repair_interval elapses from record creation, delaying the first repair
> cycle and allowing data inconsistencies to accumulate.
--
This message was sent by Atlassian Jira
(v8.20.10#820010)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]