[PR] HDDS-15542. Closed-container reconciliation advances BCSID past a hole, masking missing chunks [ozone]

via GitHub Fri, 12 Jun 2026 00:13:51 -0700


kerneltime opened a new pull request, #10496:
URL: https://github.com/apache/ozone/pull/10496


   ## What changes were proposed in this pull request?
   
   When a closed container is reconciled against a peer, 
`KeyValueHandler.reconcileChunksPerBlock` pulls chunks from the peer to repair 
the local replica. The block commit sequence ID (BCSID) is a high-water mark: 
it is meant to assert that the replica holds, durably, every committed chunk up 
to that sequence id. The method's contract states the BCSID is advanced to the 
peer's value only when the entire block is read and written successfully.
   
   The per-chunk loop can stop early in two ways:
   
   1. A chunk read or write throws `IOException`. This path sets the loop's 
success flag to `false`, and the block is committed with `overwriteBscId == 
false`, so the BCSID is left alone. Correct.
   2. The loop reaches a chunk whose preceding chunk is missing locally (a 
hole). Writing it would leave a gap in the block file, which is not supported, 
so the loop breaks. This path did **not** clear the success flag.
   
   Because the success flag is initialized to `true` and the hole path left it 
untouched, the block was committed as if fully repaired: the BCSID was 
overwritten with the peer's higher value even though the chunks past the hole 
were never ingested, and the container BCSID was advanced to match. The replica 
then advertised a sequence id whose backing data it does not actually hold.
   
   This is reachable in normal operation, not only under corruption. The 
container scanner deliberately omits a missing chunk from the merkle tree it 
builds (`KeyValueContainerCheck`: "Missing chunks should not be added to the 
merkle tree."), so a healthy peer that is itself missing one chunk in the 
middle of a block advertises exactly such a gapped list. SCM treats the 
reported BCSID as a freshness/completeness signal 
(`AbstractContainerReportHandler` raises the container's recorded sequence id 
to a healthy replica's value, and the force-close and delete-versus-resurrect 
paths compare against it), so an inflated BCSID can cause a holed replica 
missing committed data to be treated as the most up-to-date copy, and a 
complete replica to be deleted as redundant.
   
   The fix treats the hole exit as a partial result, exactly like the 
`IOException` exit: it clears the success flag before breaking out of the loop. 
Chunks ingested before the hole are still committed, but the BCSID is not 
advanced while the block remains incomplete. The post-reconciliation scanner 
rebuilds the merkle tree from what is on disk, and a later reconciliation round 
that first fills the missing chunk can then advance the BCSID legitimately.
   
   To make the path testable, `reconcileChunksPerBlock` is now package-private 
(`@VisibleForTesting`) and the block input stream factory can be substituted by 
a test.
   
   ## What is the link to the Apache JIRA
   
   https://issues.apache.org/jira/browse/HDDS-15542
   
   ## How was this patch tested?
   
   Added `TestReconcileChunksPerBlockHoleBcsId`, a single-threaded unit test 
that exercises the real `reconcileChunksPerBlock` against a real closed 
container holding only the offset-0 chunk (BCSID 1), with a mocked peer that 
advertises BCSID 99 and a chunk merkle list `{CHUNK_LEN, 3*CHUNK_LEN}` (the 
chunk at `2*CHUNK_LEN` is omitted, so `3*CHUNK_LEN` sits past a hole). The test 
asserts that, because a hole remains, the block and container BCSID are not 
advanced to the peer value.
   
   The test fails on the unmodified code (the BCSID is advanced to 99) and 
passes with this change (it stays at 1), so it doubles as the regression guard. 
Run locally:
   
   ```
   mvn -pl hadoop-hdds/container-service test 
-Dtest='TestReconcileChunksPerBlockHoleBcsId'
   ```


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[PR] HDDS-15542. Closed-container reconciliation advances BCSID past a hole, masking missing chunks [ozone]

Reply via email to