Re: [VOTE] Merge branch HDDS-10239-container-reconciliation into master

Tsz-Wo Nicholas Sze Tue, 17 Jun 2025 14:23:32 -0700

One more thing:  how about we call all the completed work Phase 1 and the
work after merge Phase 2?  Then, we should update HDDS-10239 to describe
what was done in Phase 1 and move the unresolved JIRAs to a new umbrella
Phase 2 JIRA.


Tsz-Wo


On Tue, Jun 17, 2025 at 12:17 PM Tsz-Wo Nicholas Sze <szets...@gmail.com>
wrote:

> +1
> Thanks for the great work!  The feature is going to be very useful.  Some
> questions/comments/suggestions below.
>
> "Healthy" related:
> - What does ChunkMerkleTree.isHealthy == false means?  Checksum
> mismatched?  Missing?
> - Rename ScanResult.isHealthy() to hasErrors() since all the (non-test)
> calls are !isHealthy().
> *  We should avoid the term "healthy".  It is not clear what it means when
> isHealthy==false.
>   It is better to have "boolean isChecksumMatched", "boolean hasError"
>   or enum HealthType {NO_ERROR, CHECKSUM_MISMATCHED, MISSING, IO_EXCEPTION}
>
> Others:
> - Why is ContainerChecksumTreeManager created in both BlockDeletingService
> and OzoneContainer?  Should they share it?
> - Use ContainerID instead of Long.
> - Add javadoc for all the maps using Long.  Describe what it is for (e.g.
> block id).
> - When converting a checksum (long) to a string, use hexadecimal.
> - Avoid Optional, which is slow and generates garbage.
> - Fix conflicts with the master.
> - Use proto 3 for new protos (but it may be hard to do.)
>
> Tsz-Wo
>
>
> On Tue, Jun 10, 2025 at 5:57 PM Ethan Rose <er...@apache.org> wrote:
>
>> Based on discussion in the community sync this week I want to add some
>> more
>> information.
>> Code
>>
>> For those interested in checking out the code, these are some of the major
>> classes to start with:
>>
>>    - ReconcileContainerTask: This is the command on the datanode that is
>>    received from SCM to reconcile a container with a datanode’s peers. It
>>    passes through the ReplicationSupervisor just like replication and
>>    reconstruction commands.
>>    - ContainerProtos.ContainerChecksumInfo: This is the proto format of
>> the
>>    new file that is written into the containers with the merkle tree and
>> list
>>    of deleted blocks.
>>    - ContainerMerkleTreeWriter: This class is used to build merkle trees
>>    chunk by chunk and generate a protobuf representation of the tree.
>>    - ContainerChecksumTreeManager: This class coordinates reads and writes
>>    of ContainerChecksumInfo for containers. The diff method determines
>>    which repairs should be done on a container based on a peer’s merkle
>> tree.
>>    - KeyValueContainerCheck#scanData: This is the existing method called
>> by
>>    the background and on-demand container data scanners to scan a
>> container.
>>    It has been updated to build the merkle tree as it runs.
>>    - KeyValueHandler#reconcileContainer: This method updates the container
>>    based on the peer’s replica.
>>    - Major tests for reconciliation have been added to
>>    TestContainerCommandReconciliation (integration test) and
>>    TestContainerReconciliationWithMockDatanodes (unit test with mocked
>>    clients).
>>       - There are more tasks under the reconciliation jira to expand the
>>       types of faults being tested.
>>
>> Logging
>>
>> Logging was added on the datanodes to track reconciliation as it is
>> happening. The datanode application log will print a summary of messages
>> like this:
>>
>> 2025-06-10 20:13:14,570 [main] INFO  keyvalue.KeyValueHandler
>> (KeyValueHandler.java:reconcileContainer(1595)) - Beginning
>> reconciliation for container 100 with peer
>> bbc09073-ac0d-4b2f-afe4-1de5f9dc6f43(dn3/237.6.76.4). Current data
>> checksum is dcce847d
>> 2025-06-10 20:13:14,589 [main] WARN  keyvalue.KeyValueHandler
>> (KeyValueHandler.java:reconcileContainer(1681)) - Container 100
>> reconciled with peer
>> bbc09073-ac0d-4b2f-afe4-1de5f9dc6f43(dn3/237.6.76.4). Data checksum
>> updated from dcce847d to 16189e0b.
>> Missing blocks repaired: 5/5
>> Missing chunks repaired: 0/0
>> Corrupt chunks repaired:  10/10
>> Time taken: 19 ms
>> 2025-06-10 20:13:14,589 [main] WARN  keyvalue.KeyValueHandler
>> (KeyValueHandler.java:reconcileContainer(1704)) - Completed
>> reconciliation for container 100 with 1/1 peers. 15 blocks were
>> updated. Data checksum updated from dcce847d to 16189e0b
>>
>> This shows:
>>
>>    - Reconciliation started between this datanode and one other peer for
>>    container 100
>>    - After reconciliation with the peer completed, the data checksum of
>> our
>>    container was updated
>>    - Compared to this peer, we needed to ingest 5 missing blocks and
>> repair
>>    10 corrupt chunks. All operations were successful
>>    - At the end we get a summary of how many changes were done to this
>>    container after consulting all the peers in the reconcile request. In
>> this
>>    case there was only one peer.
>>    By enabling debug logging we can see the individual blocks and chunks
>>    that were repaired as well.
>>
>> In the dn-container.log file, dataChecksum is now included for every log
>> line. We also get one new line in this log every time the checksum for a
>> container is updated.
>>
>> In case logs roll off, a debug tool to inspect container’s checksum
>> information locally on a datanode will be implemented in HDDS-13239
>> <https://issues.apache.org/jira/browse/HDDS-13239>.
>> Metrics
>>
>> The metrics for reconciliation tasks are available as a part of
>> ReplicationSupervisor class which includes:
>>
>>    - numRequestedContainerReconciliations - Number of reconciliation tasks
>>    - numQueuedContainerReconciliations - Number of queued tasks
>>    - numTimeoutContainerReconciliations - Number of timed-out tasks
>>    - numSuccessContainerReconciliations- Number of Success
>>    - numFailureContainerReconciliations - Number of Failures
>>    - numSkippedContainerReconciliations - Number of Skipped Tasks
>>
>> Latency/Count metrics for the tasks exposed by CommandHandlerMetrics for
>> ReconcileContainerCommandHandler:
>>
>>    - TotalRunTimeMs - The total runtime of the command handler in
>>    milliseconds
>>    - AvgRunTimeMs - Average run time of the command handler in
>> milliseconds
>>    - QueueWaitingTaskCount - The number of queued tasks waiting for
>>    execution
>>    - InvocationCount - The number of times the command handler has been
>>    invoked
>>    - CommandReceivedCount - The number of received SCM commands for each
>>    command type
>>
>> Other container reconciliation-related tasks are encapsulated in
>> ContainerMerkleTreeMetrics:
>>
>>    - numMerkleTreeWriteFailure - Number of Merkle tree write failure
>>    - numMerkleTreeReadFailure - Number of Merkle tree read failure
>>    - numMerkleTreeDiffFailure - Number of Merkle tree diff failure
>>    - numNoRepairContainerDiff - Number of container diff that doesn’t
>>    require repair
>>    - numRepairContainerDiff - Number of container diff that require repair
>>    - merkleTreeWriteLatencyNS- Merkle tree write latency
>>    - merkleTreeReadLatencyNS - Merkle tree read latency
>>    - merkleTreeCreateLatencyNS - Merkle tree creation latency
>>    - merkleTreeDiffLatencyNS - Merkle tree diff latency
>>
>>
>> Thanks for reviewing
>>
>> Ethan
>>
>

Re: [VOTE] Merge branch HDDS-10239-container-reconciliation into master

Reply via email to