Devesh Kumar Singh created HDDS-12475:
-----------------------------------------
Summary: Ozone Recon - Handle auto recovery of data during error
in reinitialising of all OM tasks when fetching full snapshot of OM DB
Key: HDDS-12475
URL: https://issues.apache.org/jira/browse/HDDS-12475
Project: Apache Ozone
Issue Type: Task
Components: Ozone Recon
Reporter: Devesh Kumar Singh
Assignee: Devesh Kumar Singh
Ozone Recon - Handle auto recovery of data during error in reinitialising of
all OM tasks when fetching full snapshot of OM DB
There could be few edge cases:
* If Recon was stopped for sometime and when it come back online, OM DB
compaction, during that downtime, may force recon to fetch full snapshot and
reinitialise all OM based recon tasks and in this flow if any of the OM tasks
reinitialisation fails, lastRunTaskStatus will confirm failure, but in next
run of sync OM iteration, failed task may go for delta updates but had missed
OM DB updates in last run of full snapshot completely.
* Even if Recon is up and running in cluster continuously, there is a
possibility that Recon may start lagging over a period of time if OM DB write
TPS is very high in cluster. In such a case, recon has a mechanism to fall back
on full snapshot and reinitialise all OM based recon tasks and in this flow if
any of the OM tasks reinitialisation fails, lastRunTaskStatus will confirm
failure, but in next run of sync OM iteration, failed task may go for delta
updates but had missed OM DB updates in last run of full snapshot completely.
And in both above edge cases, above failures may completely go silent and
unnoticed and even existing metrics like lastRunTaskStatus which recorded
failure may be overridden with next delta run status which may be success.
--
This message was sent by Atlassian Jira
(v8.20.10#820010)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]