[jira] [Created] (HDDS-14829) Make snap diff job status tracking more reliable

Saketa Chalamchala (Jira) Thu, 12 Mar 2026 08:38:01 -0700

Saketa Chalamchala created HDDS-14829:
-----------------------------------------


             Summary: Make snap diff job status tracking more reliable
                 Key: HDDS-14829
                 URL: https://issues.apache.org/jira/browse/HDDS-14829
             Project: Apache Ozone
          Issue Type: Bug
            Reporter: Saketa Chalamchala
            Assignee: Saketa Chalamchala


Current Snapshot diff config defaults: 

{code:java}
ozone.om.snapshot.diff.cleanup.service.run.interval = 1m  (Interval at which 
snapshot diff clean up service will run.)
ozone.om.snapshot.diff.job.default.wait.time = 1m  (Default wait time returned 
to client to wait before retrying snap diff request.)
{code}

The following scenario can happen:

{code:java}
T0: Replication job submits a snapshot diff job and is asked to check back in 
1min for the report. 
T0.2: The job fails immediately and the status of the job is updated as FAILED 
in the DB
T0.5: But immediately after that cleanup runs and removes the job from the DB
T1: After a minute the user runs the snapshot diff again as instructed in Step 
1 but since the job disappears from the DB a new job is submitted.
T1.2: The job fails immediately and the status of the job is updated as FAILED 
in the DB
T1.5: But immediately after that cleanup runs and removes the job from the DB
T2: After a minute the user runs the snapshot diff again as instructed in Step 
1 but since the job disappears from the DB a new job is submitted.
... the above pattern keeps repeating
{code}


- We should make snap diff response more reliable. We should let the client 
know if the job has failed and that they should retry. If we want to keep the 
status of the job for > 1min then Separating the submit/get API might be a good 
idea. 
- Improve auditing of snapshot diff jobs

Current response: 
{code:java}
Snapshot diff job is IN_PROGRESS. Please retry after 60000 ms.
{code}



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[jira] [Created] (HDDS-14829) Make snap diff job status tracking more reliable

Reply via email to