devmadhuu opened a new pull request, #3947: URL: https://github.com/apache/ozone/pull/3947
Root Cause: When recon is down during the window a new container got created and went missing, recon doesn't have any way to know as data nodes were down and no one sending report to recon, and also recon will not sync with SCM DB due to a condition of if Container number diff between SCM and Recon is greater than threshold value defined in "ozone.recon.scm.container.threshold", then recon SCM rocks DB is updated and sync with scm Rocks DB, however this is only at Recon startup, after that Recon is only dependent on container reports from data nodes. A part from above, there is another design point that SCM rocks DB doesn't get updated with container data if "ozone.scm.ratis.enable" is set as true which is default true and data flushes to SCM rocks DB when ratis snapshot is taken which depends on "dfs.ratis.snapshot.threshold" value (default 10000), so even if we want to do a periodic sync with SCM rocks DB, it will not give any info except SCM Rocks DB is updated or flushed. Changes done to fix the issue: Used periodic sync to recon container cache from SCM container cache using SCM listContainer API exposed at rpc port. https://issues.apache.org/jira/browse/HDDS-3486 ## How was this patch tested? Tested manually using test cases as well as bringing recon down and add a new container in SCM and make it missing container. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected] --------------------------------------------------------------------- To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
