Based on discussion in the community sync this week I want to add some more
information.
Code

For those interested in checking out the code, these are some of the major
classes to start with:

   - ReconcileContainerTask: This is the command on the datanode that is
   received from SCM to reconcile a container with a datanode’s peers. It
   passes through the ReplicationSupervisor just like replication and
   reconstruction commands.
   - ContainerProtos.ContainerChecksumInfo: This is the proto format of the
   new file that is written into the containers with the merkle tree and list
   of deleted blocks.
   - ContainerMerkleTreeWriter: This class is used to build merkle trees
   chunk by chunk and generate a protobuf representation of the tree.
   - ContainerChecksumTreeManager: This class coordinates reads and writes
   of ContainerChecksumInfo for containers. The diff method determines
   which repairs should be done on a container based on a peer’s merkle tree.
   - KeyValueContainerCheck#scanData: This is the existing method called by
   the background and on-demand container data scanners to scan a container.
   It has been updated to build the merkle tree as it runs.
   - KeyValueHandler#reconcileContainer: This method updates the container
   based on the peer’s replica.
   - Major tests for reconciliation have been added to
   TestContainerCommandReconciliation (integration test) and
   TestContainerReconciliationWithMockDatanodes (unit test with mocked
   clients).
      - There are more tasks under the reconciliation jira to expand the
      types of faults being tested.

Logging

Logging was added on the datanodes to track reconciliation as it is
happening. The datanode application log will print a summary of messages
like this:

2025-06-10 20:13:14,570 [main] INFO  keyvalue.KeyValueHandler
(KeyValueHandler.java:reconcileContainer(1595)) - Beginning
reconciliation for container 100 with peer
bbc09073-ac0d-4b2f-afe4-1de5f9dc6f43(dn3/237.6.76.4). Current data
checksum is dcce847d
2025-06-10 20:13:14,589 [main] WARN  keyvalue.KeyValueHandler
(KeyValueHandler.java:reconcileContainer(1681)) - Container 100
reconciled with peer
bbc09073-ac0d-4b2f-afe4-1de5f9dc6f43(dn3/237.6.76.4). Data checksum
updated from dcce847d to 16189e0b.
Missing blocks repaired: 5/5
Missing chunks repaired: 0/0
Corrupt chunks repaired:  10/10
Time taken: 19 ms
2025-06-10 20:13:14,589 [main] WARN  keyvalue.KeyValueHandler
(KeyValueHandler.java:reconcileContainer(1704)) - Completed
reconciliation for container 100 with 1/1 peers. 15 blocks were
updated. Data checksum updated from dcce847d to 16189e0b

This shows:

   - Reconciliation started between this datanode and one other peer for
   container 100
   - After reconciliation with the peer completed, the data checksum of our
   container was updated
   - Compared to this peer, we needed to ingest 5 missing blocks and repair
   10 corrupt chunks. All operations were successful
   - At the end we get a summary of how many changes were done to this
   container after consulting all the peers in the reconcile request. In this
   case there was only one peer.
   By enabling debug logging we can see the individual blocks and chunks
   that were repaired as well.

In the dn-container.log file, dataChecksum is now included for every log
line. We also get one new line in this log every time the checksum for a
container is updated.

In case logs roll off, a debug tool to inspect container’s checksum
information locally on a datanode will be implemented in HDDS-13239
<https://issues.apache.org/jira/browse/HDDS-13239>.
Metrics

The metrics for reconciliation tasks are available as a part of
ReplicationSupervisor class which includes:

   - numRequestedContainerReconciliations - Number of reconciliation tasks
   - numQueuedContainerReconciliations - Number of queued tasks
   - numTimeoutContainerReconciliations - Number of timed-out tasks
   - numSuccessContainerReconciliations- Number of Success
   - numFailureContainerReconciliations - Number of Failures
   - numSkippedContainerReconciliations - Number of Skipped Tasks

Latency/Count metrics for the tasks exposed by CommandHandlerMetrics for
ReconcileContainerCommandHandler:

   - TotalRunTimeMs - The total runtime of the command handler in
   milliseconds
   - AvgRunTimeMs - Average run time of the command handler in milliseconds
   - QueueWaitingTaskCount - The number of queued tasks waiting for
   execution
   - InvocationCount - The number of times the command handler has been
   invoked
   - CommandReceivedCount - The number of received SCM commands for each
   command type

Other container reconciliation-related tasks are encapsulated in
ContainerMerkleTreeMetrics:

   - numMerkleTreeWriteFailure - Number of Merkle tree write failure
   - numMerkleTreeReadFailure - Number of Merkle tree read failure
   - numMerkleTreeDiffFailure - Number of Merkle tree diff failure
   - numNoRepairContainerDiff - Number of container diff that doesn’t
   require repair
   - numRepairContainerDiff - Number of container diff that require repair
   - merkleTreeWriteLatencyNS- Merkle tree write latency
   - merkleTreeReadLatencyNS - Merkle tree read latency
   - merkleTreeCreateLatencyNS - Merkle tree creation latency
   - merkleTreeDiffLatencyNS - Merkle tree diff latency


Thanks for reviewing

Ethan

Reply via email to