[ 
https://issues.apache.org/jira/browse/CASSANDRA-21115?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Paulo Motta updated CASSANDRA-21115:
------------------------------------
    Description: 
*Problem*

When a repair history record is created, both repair_start_ts and 
repair_finish_ts are initialized to the same timestamp. tooSoonToRunRepair() 
reads repair_finish_ts and if it falls within min_repair_interval, immediately 
returns "too soon" and aborts. This prevents the incomplete repair detection in 
myTurnToRunRepair() (which checks repair_start_ts >= repair_finish_ts) from 
executing.

*When this occurs*

  1. Cross-node initialization: Node A calls insertNewRepairHistory() and 
creates a history record for Node B with start_ts = finish_ts = now(). When 
Node B attempts repair, it sees this fresh timestamp and incorrectly skips, 
thinking it just completed a repair.
  2. First repair interruption: Node starts its first repair (updating 
start_ts) but crashes or fails before completion (finish_ts unchanged). After 
restart, tooSoonToRunRepair() sees the initialization timestamp in finish_ts 
and may skip the incomplete repair.

*Fix*

Add a check in tooSoonToRunRepair(): if repair_start_ts >= repair_finish_ts, 
the repair is either unstarted or incomplete. Return false immediately to allow 
it to proceed, bypassing the interval check.

*Impact*

Nodes skip their initial repair attempts and wait unnecessarily until 
min_repair_interval elapses from record creation, delaying the first repair 
cycle and allowing data inconsistencies to accumulate.

  was:
*Problem*

tooSoonToRunRepair() executes before myTurnToRunRepair() and uses 
repair_finish_ts to determine if min_repair_interval has elapsed. When this 
check fails, execution returns early and the incomplete repair detection logic 
in myTurnToRunRepair() (which checks repair_start_ts >= repair_finish_ts) never 
executes.

*Scenarios*

  1. Unstarted repairs: insertNewRepairHistory() initializes both timestamps to 
the same value (start_ts = finish_ts = now()). When a node reads this record, 
it incorrectly interprets it as a recently completed repair and skips if within 
min_repair_interval.
  2. Incomplete repairs: When repair starts, only start_ts is updated. If the 
repair fails or crashes without completing, finish_ts remains unchanged. 
Subsequent attempts check this finish_ts against min_repair_interval and skip 
the repair before detecting it's incomplete.

*Fix*

Check if repair_start_ts >= repair_finish_ts within tooSoonToRunRepair() before 
evaluating the interval. If true, return false immediately to allow the repair 
to proceed.

*Impact*

Repairs can be delayed beyond min_repair_interval when nodes create history 
records for each other, when repairs fail to complete, or after node restarts 
during repair. This allows inconsistencies to accumulate longer than configured.


> Initial auto-repairs can be skipped by too soon check
> -----------------------------------------------------
>
>                 Key: CASSANDRA-21115
>                 URL: https://issues.apache.org/jira/browse/CASSANDRA-21115
>             Project: Apache Cassandra
>          Issue Type: Bug
>          Components: Consistency/Repair
>            Reporter: Paulo Motta
>            Assignee: Paulo Motta
>            Priority: Normal
>          Time Spent: 0.5h
>  Remaining Estimate: 0h
>
> *Problem*
> When a repair history record is created, both repair_start_ts and 
> repair_finish_ts are initialized to the same timestamp. tooSoonToRunRepair() 
> reads repair_finish_ts and if it falls within min_repair_interval, 
> immediately returns "too soon" and aborts. This prevents the incomplete 
> repair detection in myTurnToRunRepair() (which checks repair_start_ts >= 
> repair_finish_ts) from executing.
> *When this occurs*
>   1. Cross-node initialization: Node A calls insertNewRepairHistory() and 
> creates a history record for Node B with start_ts = finish_ts = now(). When 
> Node B attempts repair, it sees this fresh timestamp and incorrectly skips, 
> thinking it just completed a repair.
>   2. First repair interruption: Node starts its first repair (updating 
> start_ts) but crashes or fails before completion (finish_ts unchanged). After 
> restart, tooSoonToRunRepair() sees the initialization timestamp in finish_ts 
> and may skip the incomplete repair.
> *Fix*
> Add a check in tooSoonToRunRepair(): if repair_start_ts >= repair_finish_ts, 
> the repair is either unstarted or incomplete. Return false immediately to 
> allow it to proceed, bypassing the interval check.
> *Impact*
> Nodes skip their initial repair attempts and wait unnecessarily until 
> min_repair_interval elapses from record creation, delaying the first repair 
> cycle and allowing data inconsistencies to accumulate.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to