[ 
https://issues.apache.org/jira/browse/CASSANDRA-21115?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Paulo Motta updated CASSANDRA-21115:
------------------------------------
    Description: 
*Problem*

tooSoonToRunRepair() executes before myTurnToRunRepair() and uses 
repair_finish_ts to determine if min_repair_interval has elapsed. When this 
check fails, execution returns early and the incomplete repair detection logic 
in myTurnToRunRepair() (which checks repair_start_ts >= repair_finish_ts) never 
executes.

*Scenarios*

  1. Unstarted repairs: insertNewRepairHistory() initializes both timestamps to 
the same value (start_ts = finish_ts = now()). When a node reads this record, 
it incorrectly interprets it as a recently completed repair and skips if within 
min_repair_interval.
  2. Incomplete repairs: When repair starts, only start_ts is updated. If the 
repair fails or crashes without completing, finish_ts remains unchanged. 
Subsequent attempts check this finish_ts against min_repair_interval and skip 
the repair before detecting it's incomplete.

*Fix*

Check if repair_start_ts >= repair_finish_ts within tooSoonToRunRepair() before 
evaluating the interval. If true, return false immediately to allow the repair 
to proceed.

*Impact*

Repairs can be delayed beyond min_repair_interval when nodes create history 
records for each other, when repairs fail to complete, or after node restarts 
during repair. This allows inconsistencies to accumulate longer than configured.

  was:
When a node starts its very first auto-repair and crashes before completing it, 
the repair won't be resumed properly after restart. Instead, it gets skipped by 
the "too soon to repair" check for up to 24 hours.

*What happens*

  1. Node joins the cluster, no repair history exists yet
  2. insertNewRepairHistory() creates a record with both repair_start_ts and 
repair_finish_ts set to the current time (let's call it T1)
  3. When repair actually starts, only repair_start_ts gets updated to T2
  4. Node crashes mid-repair
  5. On restart, tooSoonToRunRepair() is called before myTurnToRunRepair()
  6. It queries repair_finish_ts which is still T1 (the record creation time, 
not an actual repair completion)
  7. If less than 24h have passed since T1, the check returns "too soon" and 
bails out
  8. The logic in myTurnToRunRepair() that detects ongoing repairs 
(repair_start_ts > repair_finish_ts) never gets a chance to run

*Expected behavior*

  A repair that was in progress should be resumed after restart, regardless of 
the min_repair_interval setting. The "too soon" check should not apply to 
incomplete repairs.



 *How to reproduce*

  1. Set up a fresh node with auto-repair enabled
  2. Wait for the first repair to start
  3. Kill the node before repair completes
  4. Restart the node within 24 hours
  5. Observe that repair is skipped with "Too soon to run repair" in the logs


> Incomplete or unstarted repairs can be skipped by too soon check
> ----------------------------------------------------------------
>
>                 Key: CASSANDRA-21115
>                 URL: https://issues.apache.org/jira/browse/CASSANDRA-21115
>             Project: Apache Cassandra
>          Issue Type: Bug
>          Components: Consistency/Repair
>            Reporter: Paulo Motta
>            Assignee: Paulo Motta
>            Priority: Normal
>          Time Spent: 0.5h
>  Remaining Estimate: 0h
>
> *Problem*
> tooSoonToRunRepair() executes before myTurnToRunRepair() and uses 
> repair_finish_ts to determine if min_repair_interval has elapsed. When this 
> check fails, execution returns early and the incomplete repair detection 
> logic in myTurnToRunRepair() (which checks repair_start_ts >= 
> repair_finish_ts) never executes.
> *Scenarios*
>   1. Unstarted repairs: insertNewRepairHistory() initializes both timestamps 
> to the same value (start_ts = finish_ts = now()). When a node reads this 
> record, it incorrectly interprets it as a recently completed repair and skips 
> if within min_repair_interval.
>   2. Incomplete repairs: When repair starts, only start_ts is updated. If the 
> repair fails or crashes without completing, finish_ts remains unchanged. 
> Subsequent attempts check this finish_ts against min_repair_interval and skip 
> the repair before detecting it's incomplete.
> *Fix*
> Check if repair_start_ts >= repair_finish_ts within tooSoonToRunRepair() 
> before evaluating the interval. If true, return false immediately to allow 
> the repair to proceed.
> *Impact*
> Repairs can be delayed beyond min_repair_interval when nodes create history 
> records for each other, when repairs fail to complete, or after node restarts 
> during repair. This allows inconsistencies to accumulate longer than 
> configured.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to