[
https://issues.apache.org/jira/browse/HDDS-15542?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Ritesh Shukla updated HDDS-15542:
---------------------------------
Description:
h2. Summary
When a closed container is reconciled against a peer,
{{KeyValueHandler.reconcileChunksPerBlock}} pulls chunks from the peer to
repair the local replica. The block commit sequence ID (BCSID) is a high-water
mark: it is meant to assert that the replica holds, durably, every committed
chunk up to that sequence id. The method's own contract says the BCSID is
advanced to the peer's value only when the entire block is read and written
successfully.
The per-chunk loop can stop early in two ways:
# A chunk read or write throws {{IOException}}. This path sets the loop's
success flag to false, and the block is committed without overwriting the
BCSID. Correct.
# The loop reaches a chunk whose preceding chunk is missing locally (a hole).
Writing it would leave a gap in the block file, which is not supported, so the
loop breaks. This path does *not* clear the success flag.
Because the success flag is initialized to true and the hole path leaves it
untouched, the block is committed as if fully repaired: the BCSID is
overwritten with the peer's higher value even though the chunks past the hole
were never ingested, and the container BCSID is advanced to match. The replica
then advertises a sequence id whose backing data it does not actually hold.
h2. How the repair list develops a hole in normal operation
The list of chunks handed to {{reconcileChunksPerBlock}} is not the peer's raw
chunk list. It is the diff produced by
{{ContainerChecksumTreeManager.compareChunkMerkleTrees}}, which adds a peer
chunk to the repair list only through {{reportChunkIfHealthy}} (it is skipped
if the peer reports it as unhealthy). A peer chunk is therefore left out of the
repair list when:
* the peer's chunk at that offset is corrupt (checksum mismatch). The scanner
still records it in the merkle tree but marked unhealthy, and the diff skips it
because reconciliation must not copy corrupt data; or
* the peer's chunk at that offset is genuinely missing. The scanner omits a
missing chunk from the merkle tree entirely ({{KeyValueContainerCheck}}:
"Missing chunks should not be added to the merkle tree."), so the diff never
sees that offset.
Either way the repair list gets an interior gap. The bug then fires when the
local replica is also missing the chunk before the gap: the loop ingests up to
the gap, hits the hole break, and stops with chunks past the gap still absent.
The precise condition is an interior offset where the local replica is missing
the chunk and the peer's chunk is corrupt or missing, while the peer has a
healthy chunk at a later offset that the local replica needs. That is a genuine
multi-replica divergence (independent corruption or loss on different chunks
across replicas), which is exactly the situation reconciliation exists to
repair. It is an off-nominal but reachable state, not a pristine-cluster
assumption, and the peer is not healthy: it has a bad chunk at the gap offset
even though the chunks being pulled are healthy.
h2. Impact
SCM treats the reported BCSID as a freshness and completeness signal.
{{AbstractContainerReportHandler}} raises the container's recorded sequence id
to a healthy replica's reported value, and several SCM decisions key on that
comparison: quasi-close to force-close, and the delete-versus-resurrect choice
for DELETING/DELETED containers (a CLOSED or QUASI_CLOSED replica is deleted
when its BCSID is at or below the container's). An inflated BCSID can therefore
cause a holed replica that is missing committed data to be treated as the most
up-to-date copy, and a complete replica to be deleted as redundant.
This is a silent data-completeness defect, not a cosmetic counter drift. It is
bounded to an off-nominal but reachable schedule: the peer must be ahead in
BCSID, the repair list must contain an interior gap (as described above), the
local replica must be missing the chunk before the gap, and a later SCM
decision must act on the inflated value. It is not always-on.
h2. Fix
Treat the hole exit as a partial result, exactly like the {{IOException}} exit:
clear the success flag before breaking out of the loop. The chunks ingested
before the hole are still committed, but the BCSID is not advanced while the
block remains incomplete. The post-reconciliation scanner rebuilds the merkle
tree from what is actually on disk, and a later reconciliation round that first
fills the missing chunk can then legitimately advance the BCSID.
h2. Reproduction
A unit test drives the real {{reconcileChunksPerBlock}} against a real closed
container holding only the offset-0 chunk (BCSID 1) while a mocked peer
advertises BCSID 99 and a repair list with a hole. Before the fix the block and
container BCSID are advanced to 99; after the fix they stay at 1. The test is
included in the linked pull request and serves as the regression guard.
was:
h2. Summary
When a closed container is reconciled against a peer,
{{KeyValueHandler.reconcileChunksPerBlock}} pulls chunks from the peer to
repair the local replica. The block commit sequence ID (BCSID) is a high-water
mark: it is meant to assert that the replica holds, durably, every committed
chunk up to that sequence id. The method's own contract says the BCSID is
advanced to the peer's value only when the entire block is read and written
successfully.
The per-chunk loop can stop early in two ways:
# A chunk read or write throws {{IOException}}. This path sets the loop's
success flag to false, and the block is committed without overwriting the
BCSID. Correct.
# The loop reaches a chunk whose preceding chunk is missing locally (a hole).
Writing it would leave a gap in the block file, which is not supported, so the
loop breaks. This path does *not* clear the success flag.
Because the success flag is initialized to true and the hole path leaves it
untouched, the block is committed as if fully repaired: the BCSID is
overwritten with the peer's higher value even though the chunks past the hole
were never ingested, and the container BCSID is advanced to match. The replica
then advertises a sequence id whose backing data it does not actually hold.
h2. Why this is reachable in normal operation
A peer can legitimately advertise a chunk list with a gap. The container
scanner deliberately omits a missing chunk from the merkle tree it builds
({{KeyValueContainerCheck}}: "Missing chunks should not be added to the merkle
tree."), so a healthy peer that is itself missing one chunk in the middle of a
block advertises exactly such a gapped list. The local replica ingests up to
the hole and then stops, with chunks past the hole still absent.
h2. Impact
SCM treats the reported BCSID as a freshness and completeness signal.
{{AbstractContainerReportHandler}} raises the container's recorded sequence id
to a healthy replica's reported value, and several SCM decisions key on that
comparison: quasi-close to force-close, and the delete-versus-resurrect choice
for DELETING/DELETED containers (a CLOSED or QUASI_CLOSED replica is deleted
when its BCSID is at or below the container's). An inflated BCSID can therefore
cause a holed replica that is missing committed data to be treated as the most
up-to-date copy, and a complete replica to be deleted as redundant.
This is a silent data-completeness defect, not a cosmetic counter drift. It is
bounded to an off-nominal but reachable schedule: the peer must be ahead in
BCSID, its chunk list must contain a hole relative to what the local replica
already holds, and a later SCM decision must act on the inflated value. It is
not always-on.
h2. Fix
Treat the hole exit as a partial result, exactly like the {{IOException}} exit:
clear the success flag before breaking out of the loop. The chunks ingested
before the hole are still committed, but the BCSID is not advanced while the
block remains incomplete. The post-reconciliation scanner rebuilds the merkle
tree from what is actually on disk, and a later reconciliation round that first
fills the missing chunk can then legitimately advance the BCSID.
h2. Reproduction
A unit test drives the real {{reconcileChunksPerBlock}} against a real closed
container holding only the offset-0 chunk (BCSID 1) while a mocked peer
advertises BCSID 99 and a chunk list with a hole. Before the fix the block and
container BCSID are advanced to 99; after the fix they stay at 1. The test is
included in the linked pull request and serves as the regression guard.
> Closed-container reconciliation advances BCSID past a hole, masking missing
> chunks
> ----------------------------------------------------------------------------------
>
> Key: HDDS-15542
> URL: https://issues.apache.org/jira/browse/HDDS-15542
> Project: Apache Ozone
> Issue Type: Bug
> Components: Ozone Datanode
> Reporter: Ritesh Shukla
> Assignee: Ritesh Shukla
> Priority: Major
> Labels: pull-request-available
>
> h2. Summary
> When a closed container is reconciled against a peer,
> {{KeyValueHandler.reconcileChunksPerBlock}} pulls chunks from the peer to
> repair the local replica. The block commit sequence ID (BCSID) is a
> high-water mark: it is meant to assert that the replica holds, durably, every
> committed chunk up to that sequence id. The method's own contract says the
> BCSID is advanced to the peer's value only when the entire block is read and
> written successfully.
> The per-chunk loop can stop early in two ways:
> # A chunk read or write throws {{IOException}}. This path sets the loop's
> success flag to false, and the block is committed without overwriting the
> BCSID. Correct.
> # The loop reaches a chunk whose preceding chunk is missing locally (a hole).
> Writing it would leave a gap in the block file, which is not supported, so
> the loop breaks. This path does *not* clear the success flag.
> Because the success flag is initialized to true and the hole path leaves it
> untouched, the block is committed as if fully repaired: the BCSID is
> overwritten with the peer's higher value even though the chunks past the hole
> were never ingested, and the container BCSID is advanced to match. The
> replica then advertises a sequence id whose backing data it does not actually
> hold.
> h2. How the repair list develops a hole in normal operation
> The list of chunks handed to {{reconcileChunksPerBlock}} is not the peer's
> raw chunk list. It is the diff produced by
> {{ContainerChecksumTreeManager.compareChunkMerkleTrees}}, which adds a peer
> chunk to the repair list only through {{reportChunkIfHealthy}} (it is skipped
> if the peer reports it as unhealthy). A peer chunk is therefore left out of
> the repair list when:
> * the peer's chunk at that offset is corrupt (checksum mismatch). The scanner
> still records it in the merkle tree but marked unhealthy, and the diff skips
> it because reconciliation must not copy corrupt data; or
> * the peer's chunk at that offset is genuinely missing. The scanner omits a
> missing chunk from the merkle tree entirely ({{KeyValueContainerCheck}}:
> "Missing chunks should not be added to the merkle tree."), so the diff never
> sees that offset.
> Either way the repair list gets an interior gap. The bug then fires when the
> local replica is also missing the chunk before the gap: the loop ingests up
> to the gap, hits the hole break, and stops with chunks past the gap still
> absent. The precise condition is an interior offset where the local replica
> is missing the chunk and the peer's chunk is corrupt or missing, while the
> peer has a healthy chunk at a later offset that the local replica needs. That
> is a genuine multi-replica divergence (independent corruption or loss on
> different chunks across replicas), which is exactly the situation
> reconciliation exists to repair. It is an off-nominal but reachable state,
> not a pristine-cluster assumption, and the peer is not healthy: it has a bad
> chunk at the gap offset even though the chunks being pulled are healthy.
> h2. Impact
> SCM treats the reported BCSID as a freshness and completeness signal.
> {{AbstractContainerReportHandler}} raises the container's recorded sequence
> id to a healthy replica's reported value, and several SCM decisions key on
> that comparison: quasi-close to force-close, and the delete-versus-resurrect
> choice for DELETING/DELETED containers (a CLOSED or QUASI_CLOSED replica is
> deleted when its BCSID is at or below the container's). An inflated BCSID can
> therefore cause a holed replica that is missing committed data to be treated
> as the most up-to-date copy, and a complete replica to be deleted as
> redundant.
> This is a silent data-completeness defect, not a cosmetic counter drift. It
> is bounded to an off-nominal but reachable schedule: the peer must be ahead
> in BCSID, the repair list must contain an interior gap (as described above),
> the local replica must be missing the chunk before the gap, and a later SCM
> decision must act on the inflated value. It is not always-on.
> h2. Fix
> Treat the hole exit as a partial result, exactly like the {{IOException}}
> exit: clear the success flag before breaking out of the loop. The chunks
> ingested before the hole are still committed, but the BCSID is not advanced
> while the block remains incomplete. The post-reconciliation scanner rebuilds
> the merkle tree from what is actually on disk, and a later reconciliation
> round that first fills the missing chunk can then legitimately advance the
> BCSID.
> h2. Reproduction
> A unit test drives the real {{reconcileChunksPerBlock}} against a real closed
> container holding only the offset-0 chunk (BCSID 1) while a mocked peer
> advertises BCSID 99 and a repair list with a hole. Before the fix the block
> and container BCSID are advanced to 99; after the fix they stay at 1. The
> test is included in the linked pull request and serves as the regression
> guard.
--
This message was sent by Atlassian Jira
(v8.20.10#820010)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]