[ 
https://issues.apache.org/jira/browse/HDDS-15542?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

ASF GitHub Bot updated HDDS-15542:
----------------------------------
    Labels: pull-request-available  (was: )

> Closed-container reconciliation advances BCSID past a hole, masking missing 
> chunks
> ----------------------------------------------------------------------------------
>
>                 Key: HDDS-15542
>                 URL: https://issues.apache.org/jira/browse/HDDS-15542
>             Project: Apache Ozone
>          Issue Type: Bug
>          Components: Ozone Datanode
>            Reporter: Ritesh Shukla
>            Assignee: Ritesh Shukla
>            Priority: Major
>              Labels: pull-request-available
>
> h2. Summary
> When a closed container is reconciled against a peer, 
> {{KeyValueHandler.reconcileChunksPerBlock}} pulls chunks from the peer to 
> repair the local replica. The block commit sequence ID (BCSID) is a 
> high-water mark: it is meant to assert that the replica holds, durably, every 
> committed chunk up to that sequence id. The method's own contract says the 
> BCSID is advanced to the peer's value only when the entire block is read and 
> written successfully.
> The per-chunk loop can stop early in two ways:
> # A chunk read or write throws {{IOException}}. This path sets the loop's 
> success flag to false, and the block is committed without overwriting the 
> BCSID. Correct.
> # The loop reaches a chunk whose preceding chunk is missing locally (a hole). 
> Writing it would leave a gap in the block file, which is not supported, so 
> the loop breaks. This path does *not* clear the success flag.
> Because the success flag is initialized to true and the hole path leaves it 
> untouched, the block is committed as if fully repaired: the BCSID is 
> overwritten with the peer's higher value even though the chunks past the hole 
> were never ingested, and the container BCSID is advanced to match. The 
> replica then advertises a sequence id whose backing data it does not actually 
> hold.
> h2. Why this is reachable in normal operation
> A peer can legitimately advertise a chunk list with a gap. The container 
> scanner deliberately omits a missing chunk from the merkle tree it builds 
> ({{KeyValueContainerCheck}}: "Missing chunks should not be added to the 
> merkle tree."), so a healthy peer that is itself missing one chunk in the 
> middle of a block advertises exactly such a gapped list. The local replica 
> ingests up to the hole and then stops, with chunks past the hole still absent.
> h2. Impact
> SCM treats the reported BCSID as a freshness and completeness signal. 
> {{AbstractContainerReportHandler}} raises the container's recorded sequence 
> id to a healthy replica's reported value, and several SCM decisions key on 
> that comparison: quasi-close to force-close, and the delete-versus-resurrect 
> choice for DELETING/DELETED containers (a CLOSED or QUASI_CLOSED replica is 
> deleted when its BCSID is at or below the container's). An inflated BCSID can 
> therefore cause a holed replica that is missing committed data to be treated 
> as the most up-to-date copy, and a complete replica to be deleted as 
> redundant.
> This is a silent data-completeness defect, not a cosmetic counter drift. It 
> is bounded to an off-nominal but reachable schedule: the peer must be ahead 
> in BCSID, its chunk list must contain a hole relative to what the local 
> replica already holds, and a later SCM decision must act on the inflated 
> value. It is not always-on.
> h2. Fix
> Treat the hole exit as a partial result, exactly like the {{IOException}} 
> exit: clear the success flag before breaking out of the loop. The chunks 
> ingested before the hole are still committed, but the BCSID is not advanced 
> while the block remains incomplete. The post-reconciliation scanner rebuilds 
> the merkle tree from what is actually on disk, and a later reconciliation 
> round that first fills the missing chunk can then legitimately advance the 
> BCSID.
> h2. Reproduction
> A unit test drives the real {{reconcileChunksPerBlock}} against a real closed 
> container holding only the offset-0 chunk (BCSID 1) while a mocked peer 
> advertises BCSID 99 and a chunk list with a hole. Before the fix the block 
> and container BCSID are advanced to 99; after the fix they stay at 1. The 
> test is included in the linked pull request and serves as the regression 
> guard.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to