Ritesh Shukla created HDDS-15542:
------------------------------------

             Summary: Closed-container reconciliation advances BCSID past a 
hole, masking missing chunks
                 Key: HDDS-15542
                 URL: https://issues.apache.org/jira/browse/HDDS-15542
             Project: Apache Ozone
          Issue Type: Bug
          Components: Ozone Datanode
            Reporter: Ritesh Shukla
            Assignee: Ritesh Shukla


h2. Summary

When a closed container is reconciled against a peer, 
{{KeyValueHandler.reconcileChunksPerBlock}} pulls chunks from the peer to 
repair the local replica. The block commit sequence ID (BCSID) is a high-water 
mark: it is meant to assert that the replica holds, durably, every committed 
chunk up to that sequence id. The method's own contract says the BCSID is 
advanced to the peer's value only when the entire block is read and written 
successfully.

The per-chunk loop can stop early in two ways:

# A chunk read or write throws {{IOException}}. This path sets the loop's 
success flag to false, and the block is committed without overwriting the 
BCSID. Correct.
# The loop reaches a chunk whose preceding chunk is missing locally (a hole). 
Writing it would leave a gap in the block file, which is not supported, so the 
loop breaks. This path does *not* clear the success flag.

Because the success flag is initialized to true and the hole path leaves it 
untouched, the block is committed as if fully repaired: the BCSID is 
overwritten with the peer's higher value even though the chunks past the hole 
were never ingested, and the container BCSID is advanced to match. The replica 
then advertises a sequence id whose backing data it does not actually hold.

h2. Why this is reachable in normal operation

A peer can legitimately advertise a chunk list with a gap. The container 
scanner deliberately omits a missing chunk from the merkle tree it builds 
({{KeyValueContainerCheck}}: "Missing chunks should not be added to the merkle 
tree."), so a healthy peer that is itself missing one chunk in the middle of a 
block advertises exactly such a gapped list. The local replica ingests up to 
the hole and then stops, with chunks past the hole still absent.

h2. Impact

SCM treats the reported BCSID as a freshness and completeness signal. 
{{AbstractContainerReportHandler}} raises the container's recorded sequence id 
to a healthy replica's reported value, and several SCM decisions key on that 
comparison: quasi-close to force-close, and the delete-versus-resurrect choice 
for DELETING/DELETED containers (a CLOSED or QUASI_CLOSED replica is deleted 
when its BCSID is at or below the container's). An inflated BCSID can therefore 
cause a holed replica that is missing committed data to be treated as the most 
up-to-date copy, and a complete replica to be deleted as redundant.

This is a silent data-completeness defect, not a cosmetic counter drift. It is 
bounded to an off-nominal but reachable schedule: the peer must be ahead in 
BCSID, its chunk list must contain a hole relative to what the local replica 
already holds, and a later SCM decision must act on the inflated value. It is 
not always-on.

h2. Fix

Treat the hole exit as a partial result, exactly like the {{IOException}} exit: 
clear the success flag before breaking out of the loop. The chunks ingested 
before the hole are still committed, but the BCSID is not advanced while the 
block remains incomplete. The post-reconciliation scanner rebuilds the merkle 
tree from what is actually on disk, and a later reconciliation round that first 
fills the missing chunk can then legitimately advance the BCSID.

h2. Reproduction

A unit test drives the real {{reconcileChunksPerBlock}} against a real closed 
container holding only the offset-0 chunk (BCSID 1) while a mocked peer 
advertises BCSID 99 and a chunk list with a hole. Before the fix the block and 
container BCSID are advanced to 99; after the fix they stay at 1. The test is 
included in the linked pull request and serves as the regression guard.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to