Based on discussion in the community sync this week I want to add some more information. Code
For those interested in checking out the code, these are some of the major classes to start with: - ReconcileContainerTask: This is the command on the datanode that is received from SCM to reconcile a container with a datanode’s peers. It passes through the ReplicationSupervisor just like replication and reconstruction commands. - ContainerProtos.ContainerChecksumInfo: This is the proto format of the new file that is written into the containers with the merkle tree and list of deleted blocks. - ContainerMerkleTreeWriter: This class is used to build merkle trees chunk by chunk and generate a protobuf representation of the tree. - ContainerChecksumTreeManager: This class coordinates reads and writes of ContainerChecksumInfo for containers. The diff method determines which repairs should be done on a container based on a peer’s merkle tree. - KeyValueContainerCheck#scanData: This is the existing method called by the background and on-demand container data scanners to scan a container. It has been updated to build the merkle tree as it runs. - KeyValueHandler#reconcileContainer: This method updates the container based on the peer’s replica. - Major tests for reconciliation have been added to TestContainerCommandReconciliation (integration test) and TestContainerReconciliationWithMockDatanodes (unit test with mocked clients). - There are more tasks under the reconciliation jira to expand the types of faults being tested. Logging Logging was added on the datanodes to track reconciliation as it is happening. The datanode application log will print a summary of messages like this: 2025-06-10 20:13:14,570 [main] INFO keyvalue.KeyValueHandler (KeyValueHandler.java:reconcileContainer(1595)) - Beginning reconciliation for container 100 with peer bbc09073-ac0d-4b2f-afe4-1de5f9dc6f43(dn3/237.6.76.4). Current data checksum is dcce847d 2025-06-10 20:13:14,589 [main] WARN keyvalue.KeyValueHandler (KeyValueHandler.java:reconcileContainer(1681)) - Container 100 reconciled with peer bbc09073-ac0d-4b2f-afe4-1de5f9dc6f43(dn3/237.6.76.4). Data checksum updated from dcce847d to 16189e0b. Missing blocks repaired: 5/5 Missing chunks repaired: 0/0 Corrupt chunks repaired: 10/10 Time taken: 19 ms 2025-06-10 20:13:14,589 [main] WARN keyvalue.KeyValueHandler (KeyValueHandler.java:reconcileContainer(1704)) - Completed reconciliation for container 100 with 1/1 peers. 15 blocks were updated. Data checksum updated from dcce847d to 16189e0b This shows: - Reconciliation started between this datanode and one other peer for container 100 - After reconciliation with the peer completed, the data checksum of our container was updated - Compared to this peer, we needed to ingest 5 missing blocks and repair 10 corrupt chunks. All operations were successful - At the end we get a summary of how many changes were done to this container after consulting all the peers in the reconcile request. In this case there was only one peer. By enabling debug logging we can see the individual blocks and chunks that were repaired as well. In the dn-container.log file, dataChecksum is now included for every log line. We also get one new line in this log every time the checksum for a container is updated. In case logs roll off, a debug tool to inspect container’s checksum information locally on a datanode will be implemented in HDDS-13239 <https://issues.apache.org/jira/browse/HDDS-13239>. Metrics The metrics for reconciliation tasks are available as a part of ReplicationSupervisor class which includes: - numRequestedContainerReconciliations - Number of reconciliation tasks - numQueuedContainerReconciliations - Number of queued tasks - numTimeoutContainerReconciliations - Number of timed-out tasks - numSuccessContainerReconciliations- Number of Success - numFailureContainerReconciliations - Number of Failures - numSkippedContainerReconciliations - Number of Skipped Tasks Latency/Count metrics for the tasks exposed by CommandHandlerMetrics for ReconcileContainerCommandHandler: - TotalRunTimeMs - The total runtime of the command handler in milliseconds - AvgRunTimeMs - Average run time of the command handler in milliseconds - QueueWaitingTaskCount - The number of queued tasks waiting for execution - InvocationCount - The number of times the command handler has been invoked - CommandReceivedCount - The number of received SCM commands for each command type Other container reconciliation-related tasks are encapsulated in ContainerMerkleTreeMetrics: - numMerkleTreeWriteFailure - Number of Merkle tree write failure - numMerkleTreeReadFailure - Number of Merkle tree read failure - numMerkleTreeDiffFailure - Number of Merkle tree diff failure - numNoRepairContainerDiff - Number of container diff that doesn’t require repair - numRepairContainerDiff - Number of container diff that require repair - merkleTreeWriteLatencyNS- Merkle tree write latency - merkleTreeReadLatencyNS - Merkle tree read latency - merkleTreeCreateLatencyNS - Merkle tree creation latency - merkleTreeDiffLatencyNS - Merkle tree diff latency Thanks for reviewing Ethan