Hi Ozone devs,

We would like to propose merging the HDDS-10239-container-reconciliation
branch into master.

This feature adds a data checksum for all non-open containers, and enables
manual reconciliation of Ratis container replicas that have diverged with
different data checksums due to failures.


   -

   More details can be found in the original design doc:
   
https://github.com/apache/ozone/blob/HDDS-10239-container-reconciliation/hadoop-hdds/docs/content/design/container-reconciliation.md
   -

   Parent Jira: https://issues.apache.org/jira/browse/HDDS-10239
   -

   Branch merge checklist:
   
https://cwiki.apache.org/confluence/display/OZONE/Storage+Container+Reconciliation+-+HDDS-10239


These are important notes about the current state of the feature:


   -

   Reconciliation must be triggered manually through an ozone admin
   container reconcile command. Integration with SCM's replication manager
   will come later, and will likely be done on a branch as well.
   -

   Checking container data checksums currently requires running ozone admin
   container info --json and filtering out the dataChecksum field of each
   replica. The following improvements are in flight to identify containers
   that need to be reconciled:
   -

      Recon's ability to identify Ratis containers with mismatched checksums
      -

         Backend is merged, front end is still in progress under HDDS-12395
         <https://issues.apache.org/jira/browse/HDDS-12395>.
         -

      An improved reconcile CLI which includes an ozone admin container
      reconcile --status option to check replica information easier than ozone
      admin container info --json.
      -

         This is currently in progress under HDDS-12078
         <https://issues.apache.org/jira/browse/HDDS-12078>.
         -

   Reconciliation does not currently change the state of any containers. If
   containers are unhealthy before reconciliation, and they are repaired to
   all have matching data checksums, they will remain marked as unhealthy.
   -

      Replicas will be able to move out of the unhealthy state once
      HDDS-11207 <https://issues.apache.org/jira/browse/HDDS-11207> is
      complete.


This vote will be open for at least one week.

Thanks,

Aswin

Reply via email to