Ritesh Shukla created HDDS-15542:
------------------------------------
Summary: Closed-container reconciliation advances BCSID past a
hole, masking missing chunks
Key: HDDS-15542
URL: https://issues.apache.org/jira/browse/HDDS-15542
Project: Apache Ozone
Issue Type: Bug
Components: Ozone Datanode
Reporter: Ritesh Shukla
Assignee: Ritesh Shukla
h2. Summary
When a closed container is reconciled against a peer,
{{KeyValueHandler.reconcileChunksPerBlock}} pulls chunks from the peer to
repair the local replica. The block commit sequence ID (BCSID) is a high-water
mark: it is meant to assert that the replica holds, durably, every committed
chunk up to that sequence id. The method's own contract says the BCSID is
advanced to the peer's value only when the entire block is read and written
successfully.
The per-chunk loop can stop early in two ways:
# A chunk read or write throws {{IOException}}. This path sets the loop's
success flag to false, and the block is committed without overwriting the
BCSID. Correct.
# The loop reaches a chunk whose preceding chunk is missing locally (a hole).
Writing it would leave a gap in the block file, which is not supported, so the
loop breaks. This path does *not* clear the success flag.
Because the success flag is initialized to true and the hole path leaves it
untouched, the block is committed as if fully repaired: the BCSID is
overwritten with the peer's higher value even though the chunks past the hole
were never ingested, and the container BCSID is advanced to match. The replica
then advertises a sequence id whose backing data it does not actually hold.
h2. Why this is reachable in normal operation
A peer can legitimately advertise a chunk list with a gap. The container
scanner deliberately omits a missing chunk from the merkle tree it builds
({{KeyValueContainerCheck}}: "Missing chunks should not be added to the merkle
tree."), so a healthy peer that is itself missing one chunk in the middle of a
block advertises exactly such a gapped list. The local replica ingests up to
the hole and then stops, with chunks past the hole still absent.
h2. Impact
SCM treats the reported BCSID as a freshness and completeness signal.
{{AbstractContainerReportHandler}} raises the container's recorded sequence id
to a healthy replica's reported value, and several SCM decisions key on that
comparison: quasi-close to force-close, and the delete-versus-resurrect choice
for DELETING/DELETED containers (a CLOSED or QUASI_CLOSED replica is deleted
when its BCSID is at or below the container's). An inflated BCSID can therefore
cause a holed replica that is missing committed data to be treated as the most
up-to-date copy, and a complete replica to be deleted as redundant.
This is a silent data-completeness defect, not a cosmetic counter drift. It is
bounded to an off-nominal but reachable schedule: the peer must be ahead in
BCSID, its chunk list must contain a hole relative to what the local replica
already holds, and a later SCM decision must act on the inflated value. It is
not always-on.
h2. Fix
Treat the hole exit as a partial result, exactly like the {{IOException}} exit:
clear the success flag before breaking out of the loop. The chunks ingested
before the hole are still committed, but the BCSID is not advanced while the
block remains incomplete. The post-reconciliation scanner rebuilds the merkle
tree from what is actually on disk, and a later reconciliation round that first
fills the missing chunk can then legitimately advance the BCSID.
h2. Reproduction
A unit test drives the real {{reconcileChunksPerBlock}} against a real closed
container holding only the offset-0 chunk (BCSID 1) while a mocked peer
advertises BCSID 99 and a chunk list with a hole. Before the fix the block and
container BCSID are advanced to 99; after the fix they stay at 1. The test is
included in the linked pull request and serves as the regression guard.
--
This message was sent by Atlassian Jira
(v8.20.10#820010)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]