[
https://issues.apache.org/jira/browse/HDDS-14829?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
ASF GitHub Bot updated HDDS-14829:
----------------------------------
Labels: pull-request-available (was: )
> Make snap diff job status tracking more reliable
> ------------------------------------------------
>
> Key: HDDS-14829
> URL: https://issues.apache.org/jira/browse/HDDS-14829
> Project: Apache Ozone
> Issue Type: Bug
> Reporter: Saketa Chalamchala
> Assignee: Saketa Chalamchala
> Priority: Major
> Labels: pull-request-available
>
> Current Snapshot diff config defaults:
> {code:java}
> ozone.om.snapshot.diff.cleanup.service.run.interval = 1m (Interval at which
> snapshot diff clean up service will run.)
> ozone.om.snapshot.diff.job.default.wait.time = 1m (Default wait time
> returned to client to wait before retrying snap diff request.)
> {code}
> The following scenario can happen:
> {code:java}
> T0: Replication job submits a snapshot diff job and is asked to check back in
> 1min for the report.
> T0.2: The job fails immediately and the status of the job is updated as
> FAILED in the DB
> T0.5: But immediately after that cleanup runs and removes the job from the DB
> T1: After a minute the user runs the snapshot diff again as instructed in
> Step 1 but since the job disappears from the DB a new job is submitted.
> T1.2: The job fails immediately and the status of the job is updated as
> FAILED in the DB
> T1.5: But immediately after that cleanup runs and removes the job from the DB
> T2: After a minute the user runs the snapshot diff again as instructed in
> Step 1 but since the job disappears from the DB a new job is submitted.
> ... the above pattern keeps repeating
> {code}
> - We should make snap diff response more reliable. We should let the client
> know if the job has failed and that they should retry. If we want to keep the
> status of the job for > 1min then Separating the submit/get API might be a
> good idea.
> - Improve auditing of snapshot diff jobs
> Current response:
> {code:java}
> Snapshot diff job is IN_PROGRESS. Please retry after 60000 ms.
> {code}
--
This message was sent by Atlassian Jira
(v8.20.10#820010)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]