Saketa Chalamchala created HDDS-14829:
-----------------------------------------
Summary: Make snap diff job status tracking more reliable
Key: HDDS-14829
URL: https://issues.apache.org/jira/browse/HDDS-14829
Project: Apache Ozone
Issue Type: Bug
Reporter: Saketa Chalamchala
Assignee: Saketa Chalamchala
Current Snapshot diff config defaults:
{code:java}
ozone.om.snapshot.diff.cleanup.service.run.interval = 1m (Interval at which
snapshot diff clean up service will run.)
ozone.om.snapshot.diff.job.default.wait.time = 1m (Default wait time returned
to client to wait before retrying snap diff request.)
{code}
The following scenario can happen:
{code:java}
T0: Replication job submits a snapshot diff job and is asked to check back in
1min for the report.
T0.2: The job fails immediately and the status of the job is updated as FAILED
in the DB
T0.5: But immediately after that cleanup runs and removes the job from the DB
T1: After a minute the user runs the snapshot diff again as instructed in Step
1 but since the job disappears from the DB a new job is submitted.
T1.2: The job fails immediately and the status of the job is updated as FAILED
in the DB
T1.5: But immediately after that cleanup runs and removes the job from the DB
T2: After a minute the user runs the snapshot diff again as instructed in Step
1 but since the job disappears from the DB a new job is submitted.
... the above pattern keeps repeating
{code}
- We should make snap diff response more reliable. We should let the client
know if the job has failed and that they should retry. If we want to keep the
status of the job for > 1min then Separating the submit/get API might be a good
idea.
- Improve auditing of snapshot diff jobs
Current response:
{code:java}
Snapshot diff job is IN_PROGRESS. Please retry after 60000 ms.
{code}
--
This message was sent by Atlassian Jira
(v8.20.10#820010)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]