Hi Ozone devs, We would like to propose merging the HDDS-10239-container-reconciliation branch into master.
This feature adds a data checksum for all non-open containers, and enables manual reconciliation of Ratis container replicas that have diverged with different data checksums due to failures. - More details can be found in the original design doc: https://github.com/apache/ozone/blob/HDDS-10239-container-reconciliation/hadoop-hdds/docs/content/design/container-reconciliation.md - Parent Jira: https://issues.apache.org/jira/browse/HDDS-10239 - Branch merge checklist: https://cwiki.apache.org/confluence/display/OZONE/Storage+Container+Reconciliation+-+HDDS-10239 These are important notes about the current state of the feature: - Reconciliation must be triggered manually through an ozone admin container reconcile command. Integration with SCM's replication manager will come later, and will likely be done on a branch as well. - Checking container data checksums currently requires running ozone admin container info --json and filtering out the dataChecksum field of each replica. The following improvements are in flight to identify containers that need to be reconciled: - Recon's ability to identify Ratis containers with mismatched checksums - Backend is merged, front end is still in progress under HDDS-12395 <https://issues.apache.org/jira/browse/HDDS-12395>. - An improved reconcile CLI which includes an ozone admin container reconcile --status option to check replica information easier than ozone admin container info --json. - This is currently in progress under HDDS-12078 <https://issues.apache.org/jira/browse/HDDS-12078>. - Reconciliation does not currently change the state of any containers. If containers are unhealthy before reconciliation, and they are repaired to all have matching data checksums, they will remain marked as unhealthy. - Replicas will be able to move out of the unhealthy state once HDDS-11207 <https://issues.apache.org/jira/browse/HDDS-11207> is complete. This vote will be open for at least one week. Thanks, Aswin