[
https://issues.apache.org/jira/browse/CASSANDRA-21115?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Paulo Motta updated CASSANDRA-21115:
------------------------------------
Description:
*Problem*
When a repair history record is created, both repair_start_ts and
repair_finish_ts are initialized to the same timestamp. tooSoonToRunRepair()
reads repair_finish_ts and if it falls within min_repair_interval, immediately
returns "too soon" and aborts. This prevents myTurnToRunRepair() from executing
entirely, skipping both the turn-to-run check and the incomplete repair
detection.
*When this occurs*
1. Cross-node initialization: Node A calls insertNewRepairHistory() and
creates a history record for Node B with start_ts = finish_ts = now(). When
Node B attempts repair, it sees this fresh timestamp and incorrectly skips,
thinking it just completed a repair.
2. First repair interruption: Node starts its first repair (updating
start_ts) but crashes or fails before completion (finish_ts unchanged). After
restart, tooSoonToRunRepair() sees the initialization timestamp in finish_ts
and may skip the incomplete repair.
*Fix*
Add a check in tooSoonToRunRepair(): if repair_start_ts >= repair_finish_ts,
the repair is either unstarted or incomplete. Return false immediately to allow
it to proceed, bypassing the interval check.
*Impact*
Nodes skip their initial repair attempts and wait unnecessarily until
min_repair_interval elapses from record creation, delaying the first repair
cycle and allowing data inconsistencies to accumulate.
was:
*Problem*
When a repair history record is created, both repair_start_ts and
repair_finish_ts are initialized to the same timestamp. tooSoonToRunRepair()
reads repair_finish_ts and if it falls within min_repair_interval, immediately
returns "too soon" and aborts. This can prevent repair from starting or being
resumed.
*When this occurs*
1. Cross-node initialization: Node A calls insertNewRepairHistory() and
creates a history record for Node B with start_ts = finish_ts = now(). When
Node B attempts repair, it sees this fresh timestamp and incorrectly skips,
thinking it just completed a repair.
2. First repair interruption: Node starts its first repair (updating
start_ts) but crashes or fails before completion (finish_ts unchanged). After
restart, tooSoonToRunRepair() sees the initialization timestamp in finish_ts
and may skip the incomplete repair.
*Fix*
Add a check in tooSoonToRunRepair(): if repair_start_ts >= repair_finish_ts,
the repair is either unstarted or incomplete. Return false immediately to allow
it to proceed, bypassing the interval check.
*Impact*
Nodes skip their initial repair attempts and wait unnecessarily until
min_repair_interval elapses from record creation, delaying the first repair
cycle and allowing data inconsistencies to accumulate.
> Initial auto-repairs can be skipped by too soon check
> -----------------------------------------------------
>
> Key: CASSANDRA-21115
> URL: https://issues.apache.org/jira/browse/CASSANDRA-21115
> Project: Apache Cassandra
> Issue Type: Bug
> Components: Consistency/Repair
> Reporter: Paulo Motta
> Assignee: Paulo Motta
> Priority: Normal
> Time Spent: 0.5h
> Remaining Estimate: 0h
>
> *Problem*
> When a repair history record is created, both repair_start_ts and
> repair_finish_ts are initialized to the same timestamp. tooSoonToRunRepair()
> reads repair_finish_ts and if it falls within min_repair_interval,
> immediately returns "too soon" and aborts. This prevents myTurnToRunRepair()
> from executing entirely, skipping both the turn-to-run check and the
> incomplete repair detection.
> *When this occurs*
> 1. Cross-node initialization: Node A calls insertNewRepairHistory() and
> creates a history record for Node B with start_ts = finish_ts = now(). When
> Node B attempts repair, it sees this fresh timestamp and incorrectly skips,
> thinking it just completed a repair.
> 2. First repair interruption: Node starts its first repair (updating
> start_ts) but crashes or fails before completion (finish_ts unchanged). After
> restart, tooSoonToRunRepair() sees the initialization timestamp in finish_ts
> and may skip the incomplete repair.
> *Fix*
> Add a check in tooSoonToRunRepair(): if repair_start_ts >= repair_finish_ts,
> the repair is either unstarted or incomplete. Return false immediately to
> allow it to proceed, bypassing the interval check.
> *Impact*
> Nodes skip their initial repair attempts and wait unnecessarily until
> min_repair_interval elapses from record creation, delaying the first repair
> cycle and allowing data inconsistencies to accumulate.
--
This message was sent by Atlassian Jira
(v8.20.10#820010)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]