[ 
https://issues.apache.org/jira/browse/HDDS-14829?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Saketa Chalamchala updated HDDS-14829:
--------------------------------------
    Epic Link: HDDS-13747

> Make snap diff job status tracking more reliable
> ------------------------------------------------
>
>                 Key: HDDS-14829
>                 URL: https://issues.apache.org/jira/browse/HDDS-14829
>             Project: Apache Ozone
>          Issue Type: Bug
>            Reporter: Saketa Chalamchala
>            Assignee: Saketa Chalamchala
>            Priority: Major
>
> Current Snapshot diff config defaults: 
> {code:java}
> ozone.om.snapshot.diff.cleanup.service.run.interval = 1m  (Interval at which 
> snapshot diff clean up service will run.)
> ozone.om.snapshot.diff.job.default.wait.time = 1m  (Default wait time 
> returned to client to wait before retrying snap diff request.)
> {code}
> The following scenario can happen:
> {code:java}
> T0: Replication job submits a snapshot diff job and is asked to check back in 
> 1min for the report. 
> T0.2: The job fails immediately and the status of the job is updated as 
> FAILED in the DB
> T0.5: But immediately after that cleanup runs and removes the job from the DB
> T1: After a minute the user runs the snapshot diff again as instructed in 
> Step 1 but since the job disappears from the DB a new job is submitted.
> T1.2: The job fails immediately and the status of the job is updated as 
> FAILED in the DB
> T1.5: But immediately after that cleanup runs and removes the job from the DB
> T2: After a minute the user runs the snapshot diff again as instructed in 
> Step 1 but since the job disappears from the DB a new job is submitted.
> ... the above pattern keeps repeating
> {code}
> - We should make snap diff response more reliable. We should let the client 
> know if the job has failed and that they should retry. If we want to keep the 
> status of the job for > 1min then Separating the submit/get API might be a 
> good idea. 
> - Improve auditing of snapshot diff jobs
> Current response: 
> {code:java}
> Snapshot diff job is IN_PROGRESS. Please retry after 60000 ms.
> {code}



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to