Ashish Kumar created HDDS-15658:
-----------------------------------

             Summary: No validation that moveReplicationTimeout < moveTimeout, 
can produce expired delete deadlines
                 Key: HDDS-15658
                 URL: https://issues.apache.org/jira/browse/HDDS-15658
             Project: Apache Ozone
          Issue Type: Improvement
            Reporter: Ashish Kumar


The delete deadline is computed as:
{code:java}
 containerInfo, replicaIndex, datanode, true, moveStartTime + moveTimeout); 
{code}
The replication deadline is:
{code:java}
replicaIndex, src, tgt, now + replicationTimeout); {code}
{{{}replicationDeadline = now + replicationTimeout{}}}, {{{}deleteDeadline = 
moveStart + moveTimeout{}}}.

For deletion to have any valid time window, {{moveTimeout > 
replicationTimeout}} must hold. But there is no config validation enforcing 
this. If an operator configures:
moveReplicationTimeout = 60min
moveTimeout = 60min (or even < 60min)
Then:
 * Replication can take up to 60 min
 * {{deleteDeadline = moveStart + 60min,}} already at or PAST the deadline when 
replication finishes
 * The DN receives a delete command with an expired deadline and ignores it
 * The source replica is never deleted. Container remains permanently 
over-replicated
 * RM eventually cleans it up, but it could take the next RM cycle

 

Add validation at startup:
 
{code:java}
if (moveReplicationTimeout >= moveTimeout) {   
   throw new IllegalArgumentException("moveReplicationTimeout must be <         
     moveTimeout to allow time for deletion");   
} {code}
 



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to