[
https://issues.apache.org/jira/browse/HDDS-15542?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
ASF GitHub Bot updated HDDS-15542:
----------------------------------
Labels: pull-request-available (was: )
> Closed-container reconciliation advances BCSID past a hole, masking missing
> chunks
> ----------------------------------------------------------------------------------
>
> Key: HDDS-15542
> URL: https://issues.apache.org/jira/browse/HDDS-15542
> Project: Apache Ozone
> Issue Type: Bug
> Components: Ozone Datanode
> Reporter: Ritesh Shukla
> Assignee: Ritesh Shukla
> Priority: Major
> Labels: pull-request-available
>
> h2. Summary
> When a closed container is reconciled against a peer,
> {{KeyValueHandler.reconcileChunksPerBlock}} pulls chunks from the peer to
> repair the local replica. The block commit sequence ID (BCSID) is a
> high-water mark: it is meant to assert that the replica holds, durably, every
> committed chunk up to that sequence id. The method's own contract says the
> BCSID is advanced to the peer's value only when the entire block is read and
> written successfully.
> The per-chunk loop can stop early in two ways:
> # A chunk read or write throws {{IOException}}. This path sets the loop's
> success flag to false, and the block is committed without overwriting the
> BCSID. Correct.
> # The loop reaches a chunk whose preceding chunk is missing locally (a hole).
> Writing it would leave a gap in the block file, which is not supported, so
> the loop breaks. This path does *not* clear the success flag.
> Because the success flag is initialized to true and the hole path leaves it
> untouched, the block is committed as if fully repaired: the BCSID is
> overwritten with the peer's higher value even though the chunks past the hole
> were never ingested, and the container BCSID is advanced to match. The
> replica then advertises a sequence id whose backing data it does not actually
> hold.
> h2. Why this is reachable in normal operation
> A peer can legitimately advertise a chunk list with a gap. The container
> scanner deliberately omits a missing chunk from the merkle tree it builds
> ({{KeyValueContainerCheck}}: "Missing chunks should not be added to the
> merkle tree."), so a healthy peer that is itself missing one chunk in the
> middle of a block advertises exactly such a gapped list. The local replica
> ingests up to the hole and then stops, with chunks past the hole still absent.
> h2. Impact
> SCM treats the reported BCSID as a freshness and completeness signal.
> {{AbstractContainerReportHandler}} raises the container's recorded sequence
> id to a healthy replica's reported value, and several SCM decisions key on
> that comparison: quasi-close to force-close, and the delete-versus-resurrect
> choice for DELETING/DELETED containers (a CLOSED or QUASI_CLOSED replica is
> deleted when its BCSID is at or below the container's). An inflated BCSID can
> therefore cause a holed replica that is missing committed data to be treated
> as the most up-to-date copy, and a complete replica to be deleted as
> redundant.
> This is a silent data-completeness defect, not a cosmetic counter drift. It
> is bounded to an off-nominal but reachable schedule: the peer must be ahead
> in BCSID, its chunk list must contain a hole relative to what the local
> replica already holds, and a later SCM decision must act on the inflated
> value. It is not always-on.
> h2. Fix
> Treat the hole exit as a partial result, exactly like the {{IOException}}
> exit: clear the success flag before breaking out of the loop. The chunks
> ingested before the hole are still committed, but the BCSID is not advanced
> while the block remains incomplete. The post-reconciliation scanner rebuilds
> the merkle tree from what is actually on disk, and a later reconciliation
> round that first fills the missing chunk can then legitimately advance the
> BCSID.
> h2. Reproduction
> A unit test drives the real {{reconcileChunksPerBlock}} against a real closed
> container holding only the offset-0 chunk (BCSID 1) while a mocked peer
> advertises BCSID 99 and a chunk list with a hole. Before the fix the block
> and container BCSID are advanced to 99; after the fix they stay at 1. The
> test is included in the linked pull request and serves as the regression
> guard.
--
This message was sent by Atlassian Jira
(v8.20.10#820010)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]