One more thing: how about we call all the completed work Phase 1 and the work after merge Phase 2? Then, we should update HDDS-10239 to describe what was done in Phase 1 and move the unresolved JIRAs to a new umbrella Phase 2 JIRA.
Tsz-Wo On Tue, Jun 17, 2025 at 12:17 PM Tsz-Wo Nicholas Sze <szets...@gmail.com> wrote: > +1 > Thanks for the great work! The feature is going to be very useful. Some > questions/comments/suggestions below. > > "Healthy" related: > - What does ChunkMerkleTree.isHealthy == false means? Checksum > mismatched? Missing? > - Rename ScanResult.isHealthy() to hasErrors() since all the (non-test) > calls are !isHealthy(). > * We should avoid the term "healthy". It is not clear what it means when > isHealthy==false. > It is better to have "boolean isChecksumMatched", "boolean hasError" > or enum HealthType {NO_ERROR, CHECKSUM_MISMATCHED, MISSING, IO_EXCEPTION} > > Others: > - Why is ContainerChecksumTreeManager created in both BlockDeletingService > and OzoneContainer? Should they share it? > - Use ContainerID instead of Long. > - Add javadoc for all the maps using Long. Describe what it is for (e.g. > block id). > - When converting a checksum (long) to a string, use hexadecimal. > - Avoid Optional, which is slow and generates garbage. > - Fix conflicts with the master. > - Use proto 3 for new protos (but it may be hard to do.) > > Tsz-Wo > > > On Tue, Jun 10, 2025 at 5:57 PM Ethan Rose <er...@apache.org> wrote: > >> Based on discussion in the community sync this week I want to add some >> more >> information. >> Code >> >> For those interested in checking out the code, these are some of the major >> classes to start with: >> >> - ReconcileContainerTask: This is the command on the datanode that is >> received from SCM to reconcile a container with a datanode’s peers. It >> passes through the ReplicationSupervisor just like replication and >> reconstruction commands. >> - ContainerProtos.ContainerChecksumInfo: This is the proto format of >> the >> new file that is written into the containers with the merkle tree and >> list >> of deleted blocks. >> - ContainerMerkleTreeWriter: This class is used to build merkle trees >> chunk by chunk and generate a protobuf representation of the tree. >> - ContainerChecksumTreeManager: This class coordinates reads and writes >> of ContainerChecksumInfo for containers. The diff method determines >> which repairs should be done on a container based on a peer’s merkle >> tree. >> - KeyValueContainerCheck#scanData: This is the existing method called >> by >> the background and on-demand container data scanners to scan a >> container. >> It has been updated to build the merkle tree as it runs. >> - KeyValueHandler#reconcileContainer: This method updates the container >> based on the peer’s replica. >> - Major tests for reconciliation have been added to >> TestContainerCommandReconciliation (integration test) and >> TestContainerReconciliationWithMockDatanodes (unit test with mocked >> clients). >> - There are more tasks under the reconciliation jira to expand the >> types of faults being tested. >> >> Logging >> >> Logging was added on the datanodes to track reconciliation as it is >> happening. The datanode application log will print a summary of messages >> like this: >> >> 2025-06-10 20:13:14,570 [main] INFO keyvalue.KeyValueHandler >> (KeyValueHandler.java:reconcileContainer(1595)) - Beginning >> reconciliation for container 100 with peer >> bbc09073-ac0d-4b2f-afe4-1de5f9dc6f43(dn3/237.6.76.4). Current data >> checksum is dcce847d >> 2025-06-10 20:13:14,589 [main] WARN keyvalue.KeyValueHandler >> (KeyValueHandler.java:reconcileContainer(1681)) - Container 100 >> reconciled with peer >> bbc09073-ac0d-4b2f-afe4-1de5f9dc6f43(dn3/237.6.76.4). Data checksum >> updated from dcce847d to 16189e0b. >> Missing blocks repaired: 5/5 >> Missing chunks repaired: 0/0 >> Corrupt chunks repaired: 10/10 >> Time taken: 19 ms >> 2025-06-10 20:13:14,589 [main] WARN keyvalue.KeyValueHandler >> (KeyValueHandler.java:reconcileContainer(1704)) - Completed >> reconciliation for container 100 with 1/1 peers. 15 blocks were >> updated. Data checksum updated from dcce847d to 16189e0b >> >> This shows: >> >> - Reconciliation started between this datanode and one other peer for >> container 100 >> - After reconciliation with the peer completed, the data checksum of >> our >> container was updated >> - Compared to this peer, we needed to ingest 5 missing blocks and >> repair >> 10 corrupt chunks. All operations were successful >> - At the end we get a summary of how many changes were done to this >> container after consulting all the peers in the reconcile request. In >> this >> case there was only one peer. >> By enabling debug logging we can see the individual blocks and chunks >> that were repaired as well. >> >> In the dn-container.log file, dataChecksum is now included for every log >> line. We also get one new line in this log every time the checksum for a >> container is updated. >> >> In case logs roll off, a debug tool to inspect container’s checksum >> information locally on a datanode will be implemented in HDDS-13239 >> <https://issues.apache.org/jira/browse/HDDS-13239>. >> Metrics >> >> The metrics for reconciliation tasks are available as a part of >> ReplicationSupervisor class which includes: >> >> - numRequestedContainerReconciliations - Number of reconciliation tasks >> - numQueuedContainerReconciliations - Number of queued tasks >> - numTimeoutContainerReconciliations - Number of timed-out tasks >> - numSuccessContainerReconciliations- Number of Success >> - numFailureContainerReconciliations - Number of Failures >> - numSkippedContainerReconciliations - Number of Skipped Tasks >> >> Latency/Count metrics for the tasks exposed by CommandHandlerMetrics for >> ReconcileContainerCommandHandler: >> >> - TotalRunTimeMs - The total runtime of the command handler in >> milliseconds >> - AvgRunTimeMs - Average run time of the command handler in >> milliseconds >> - QueueWaitingTaskCount - The number of queued tasks waiting for >> execution >> - InvocationCount - The number of times the command handler has been >> invoked >> - CommandReceivedCount - The number of received SCM commands for each >> command type >> >> Other container reconciliation-related tasks are encapsulated in >> ContainerMerkleTreeMetrics: >> >> - numMerkleTreeWriteFailure - Number of Merkle tree write failure >> - numMerkleTreeReadFailure - Number of Merkle tree read failure >> - numMerkleTreeDiffFailure - Number of Merkle tree diff failure >> - numNoRepairContainerDiff - Number of container diff that doesn’t >> require repair >> - numRepairContainerDiff - Number of container diff that require repair >> - merkleTreeWriteLatencyNS- Merkle tree write latency >> - merkleTreeReadLatencyNS - Merkle tree read latency >> - merkleTreeCreateLatencyNS - Merkle tree creation latency >> - merkleTreeDiffLatencyNS - Merkle tree diff latency >> >> >> Thanks for reviewing >> >> Ethan >> >