[ 
https://issues.apache.org/jira/browse/CASSANDRA-21115?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Paulo Motta updated CASSANDRA-21115:
------------------------------------
    Description: 
*Problem*

When a repair history record is created, both repair_start_ts and 
repair_finish_ts are initialized to the same timestamp. tooSoonToRunRepair() 
reads repair_finish_ts and if it falls within min_repair_interval, immediately 
returns "too soon" and aborts. This can prevent repair from starting or being 
resumed.

*When this occurs*

  1. Cross-node initialization: Node A calls insertNewRepairHistory() and 
creates a history record for Node B with start_ts = finish_ts = now(). When 
Node B attempts repair, it sees this fresh timestamp and incorrectly skips, 
thinking it just completed a repair.
  2. First repair interruption: Node starts its first repair (updating 
start_ts) but crashes or fails before completion (finish_ts unchanged). After 
restart, tooSoonToRunRepair() sees the initialization timestamp in finish_ts 
and may skip the incomplete repair.

*Fix*

Add a check in tooSoonToRunRepair(): if repair_start_ts >= repair_finish_ts, 
the repair is either unstarted or incomplete. Return false immediately to allow 
it to proceed, bypassing the interval check.

*Impact*

Nodes skip their initial repair attempts and wait unnecessarily until 
min_repair_interval elapses from record creation, delaying the first repair 
cycle and allowing data inconsistencies to accumulate.

  was:
*Problem*

When a repair history record is created, both repair_start_ts and 
repair_finish_ts are initialized to the same timestamp. tooSoonToRunRepair() 
reads repair_finish_ts and if it falls within min_repair_interval, immediately 
returns "too soon" and aborts. This prevents the incomplete repair detection in 
myTurnToRunRepair() (which checks repair_start_ts >= repair_finish_ts) from 
executing.

*When this occurs*

  1. Cross-node initialization: Node A calls insertNewRepairHistory() and 
creates a history record for Node B with start_ts = finish_ts = now(). When 
Node B attempts repair, it sees this fresh timestamp and incorrectly skips, 
thinking it just completed a repair.
  2. First repair interruption: Node starts its first repair (updating 
start_ts) but crashes or fails before completion (finish_ts unchanged). After 
restart, tooSoonToRunRepair() sees the initialization timestamp in finish_ts 
and may skip the incomplete repair.

*Fix*

Add a check in tooSoonToRunRepair(): if repair_start_ts >= repair_finish_ts, 
the repair is either unstarted or incomplete. Return false immediately to allow 
it to proceed, bypassing the interval check.

*Impact*

Nodes skip their initial repair attempts and wait unnecessarily until 
min_repair_interval elapses from record creation, delaying the first repair 
cycle and allowing data inconsistencies to accumulate.


> Initial auto-repairs can be skipped by too soon check
> -----------------------------------------------------
>
>                 Key: CASSANDRA-21115
>                 URL: https://issues.apache.org/jira/browse/CASSANDRA-21115
>             Project: Apache Cassandra
>          Issue Type: Bug
>          Components: Consistency/Repair
>            Reporter: Paulo Motta
>            Assignee: Paulo Motta
>            Priority: Normal
>          Time Spent: 0.5h
>  Remaining Estimate: 0h
>
> *Problem*
> When a repair history record is created, both repair_start_ts and 
> repair_finish_ts are initialized to the same timestamp. tooSoonToRunRepair() 
> reads repair_finish_ts and if it falls within min_repair_interval, 
> immediately returns "too soon" and aborts. This can prevent repair from 
> starting or being resumed.
> *When this occurs*
>   1. Cross-node initialization: Node A calls insertNewRepairHistory() and 
> creates a history record for Node B with start_ts = finish_ts = now(). When 
> Node B attempts repair, it sees this fresh timestamp and incorrectly skips, 
> thinking it just completed a repair.
>   2. First repair interruption: Node starts its first repair (updating 
> start_ts) but crashes or fails before completion (finish_ts unchanged). After 
> restart, tooSoonToRunRepair() sees the initialization timestamp in finish_ts 
> and may skip the incomplete repair.
> *Fix*
> Add a check in tooSoonToRunRepair(): if repair_start_ts >= repair_finish_ts, 
> the repair is either unstarted or incomplete. Return false immediately to 
> allow it to proceed, bypassing the interval check.
> *Impact*
> Nodes skip their initial repair attempts and wait unnecessarily until 
> min_repair_interval elapses from record creation, delaying the first repair 
> cycle and allowing data inconsistencies to accumulate.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to