[ 
https://issues.apache.org/jira/browse/HDDS-15542?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ritesh Shukla updated HDDS-15542:
---------------------------------
    Description: 
h2. Summary

When a closed container is reconciled against a peer, 
{{KeyValueHandler.reconcileChunksPerBlock}} pulls chunks from the peer to 
repair the local replica. The block commit sequence ID (BCSID) is a high-water 
mark: it is meant to assert that the replica holds, durably, every committed 
chunk up to that sequence id. The method's own contract says the BCSID is 
advanced to the peer's value only when the entire block is read and written 
successfully.

The per-chunk loop can stop early in two ways:

# A chunk read or write throws {{IOException}}. This path sets the loop's 
success flag to false, and the block is committed without overwriting the 
BCSID. Correct.
# The loop reaches a chunk whose preceding chunk is missing locally (a hole). 
Writing it would leave a gap in the block file, which is not supported, so the 
loop breaks. This path does *not* clear the success flag.

Because the success flag is initialized to true and the hole path leaves it 
untouched, the block is committed as if fully repaired: the BCSID is 
overwritten with the peer's higher value even though the chunks past the hole 
were never ingested, and the container BCSID is advanced to match. The replica 
then advertises a sequence id whose backing data it does not actually hold.

h2. How the repair list develops a hole in normal operation

The list of chunks handed to {{reconcileChunksPerBlock}} is not the peer's raw 
chunk list. It is the diff produced by 
{{ContainerChecksumTreeManager.compareChunkMerkleTrees}}, which adds a peer 
chunk to the repair list only through {{reportChunkIfHealthy}} (it is skipped 
if the peer reports it as unhealthy). A peer chunk is therefore left out of the 
repair list when:

* the peer's chunk at that offset is corrupt (checksum mismatch). The scanner 
still records it in the merkle tree but marked unhealthy, and the diff skips it 
because reconciliation must not copy corrupt data; or
* the peer's chunk at that offset is genuinely missing. The scanner omits a 
missing chunk from the merkle tree entirely ({{KeyValueContainerCheck}}: 
"Missing chunks should not be added to the merkle tree."), so the diff never 
sees that offset.

Either way the repair list gets an interior gap. The bug then fires when the 
local replica is also missing the chunk before the gap: the loop ingests up to 
the gap, hits the hole break, and stops with chunks past the gap still absent. 
The precise condition is an interior offset where the local replica is missing 
the chunk and the peer's chunk is corrupt or missing, while the peer has a 
healthy chunk at a later offset that the local replica needs. That is a genuine 
multi-replica divergence (independent corruption or loss on different chunks 
across replicas), which is exactly the situation reconciliation exists to 
repair. It is an off-nominal but reachable state, not a pristine-cluster 
assumption, and the peer is not healthy: it has a bad chunk at the gap offset 
even though the chunks being pulled are healthy.

h2. Impact

SCM treats the reported BCSID as a freshness and completeness signal. 
{{AbstractContainerReportHandler}} raises the container's recorded sequence id 
to a healthy replica's reported value, and several SCM decisions key on that 
comparison: quasi-close to force-close, and the delete-versus-resurrect choice 
for DELETING/DELETED containers (a CLOSED or QUASI_CLOSED replica is deleted 
when its BCSID is at or below the container's). An inflated BCSID can therefore 
cause a holed replica that is missing committed data to be treated as the most 
up-to-date copy, and a complete replica to be deleted as redundant.

This is a silent data-completeness defect, not a cosmetic counter drift. It is 
bounded to an off-nominal but reachable schedule: the peer must be ahead in 
BCSID, the repair list must contain an interior gap (as described above), the 
local replica must be missing the chunk before the gap, and a later SCM 
decision must act on the inflated value. It is not always-on.

h2. Fix

Treat the hole exit as a partial result, exactly like the {{IOException}} exit: 
clear the success flag before breaking out of the loop. The chunks ingested 
before the hole are still committed, but the BCSID is not advanced while the 
block remains incomplete. The post-reconciliation scanner rebuilds the merkle 
tree from what is actually on disk, and a later reconciliation round that first 
fills the missing chunk can then legitimately advance the BCSID.

h2. Reproduction

A unit test drives the real {{reconcileChunksPerBlock}} against a real closed 
container holding only the offset-0 chunk (BCSID 1) while a mocked peer 
advertises BCSID 99 and a repair list with a hole. Before the fix the block and 
container BCSID are advanced to 99; after the fix they stay at 1. The test is 
included in the linked pull request and serves as the regression guard.

  was:
h2. Summary

When a closed container is reconciled against a peer, 
{{KeyValueHandler.reconcileChunksPerBlock}} pulls chunks from the peer to 
repair the local replica. The block commit sequence ID (BCSID) is a high-water 
mark: it is meant to assert that the replica holds, durably, every committed 
chunk up to that sequence id. The method's own contract says the BCSID is 
advanced to the peer's value only when the entire block is read and written 
successfully.

The per-chunk loop can stop early in two ways:

# A chunk read or write throws {{IOException}}. This path sets the loop's 
success flag to false, and the block is committed without overwriting the 
BCSID. Correct.
# The loop reaches a chunk whose preceding chunk is missing locally (a hole). 
Writing it would leave a gap in the block file, which is not supported, so the 
loop breaks. This path does *not* clear the success flag.

Because the success flag is initialized to true and the hole path leaves it 
untouched, the block is committed as if fully repaired: the BCSID is 
overwritten with the peer's higher value even though the chunks past the hole 
were never ingested, and the container BCSID is advanced to match. The replica 
then advertises a sequence id whose backing data it does not actually hold.

h2. Why this is reachable in normal operation

A peer can legitimately advertise a chunk list with a gap. The container 
scanner deliberately omits a missing chunk from the merkle tree it builds 
({{KeyValueContainerCheck}}: "Missing chunks should not be added to the merkle 
tree."), so a healthy peer that is itself missing one chunk in the middle of a 
block advertises exactly such a gapped list. The local replica ingests up to 
the hole and then stops, with chunks past the hole still absent.

h2. Impact

SCM treats the reported BCSID as a freshness and completeness signal. 
{{AbstractContainerReportHandler}} raises the container's recorded sequence id 
to a healthy replica's reported value, and several SCM decisions key on that 
comparison: quasi-close to force-close, and the delete-versus-resurrect choice 
for DELETING/DELETED containers (a CLOSED or QUASI_CLOSED replica is deleted 
when its BCSID is at or below the container's). An inflated BCSID can therefore 
cause a holed replica that is missing committed data to be treated as the most 
up-to-date copy, and a complete replica to be deleted as redundant.

This is a silent data-completeness defect, not a cosmetic counter drift. It is 
bounded to an off-nominal but reachable schedule: the peer must be ahead in 
BCSID, its chunk list must contain a hole relative to what the local replica 
already holds, and a later SCM decision must act on the inflated value. It is 
not always-on.

h2. Fix

Treat the hole exit as a partial result, exactly like the {{IOException}} exit: 
clear the success flag before breaking out of the loop. The chunks ingested 
before the hole are still committed, but the BCSID is not advanced while the 
block remains incomplete. The post-reconciliation scanner rebuilds the merkle 
tree from what is actually on disk, and a later reconciliation round that first 
fills the missing chunk can then legitimately advance the BCSID.

h2. Reproduction

A unit test drives the real {{reconcileChunksPerBlock}} against a real closed 
container holding only the offset-0 chunk (BCSID 1) while a mocked peer 
advertises BCSID 99 and a chunk list with a hole. Before the fix the block and 
container BCSID are advanced to 99; after the fix they stay at 1. The test is 
included in the linked pull request and serves as the regression guard.


> Closed-container reconciliation advances BCSID past a hole, masking missing 
> chunks
> ----------------------------------------------------------------------------------
>
>                 Key: HDDS-15542
>                 URL: https://issues.apache.org/jira/browse/HDDS-15542
>             Project: Apache Ozone
>          Issue Type: Bug
>          Components: Ozone Datanode
>            Reporter: Ritesh Shukla
>            Assignee: Ritesh Shukla
>            Priority: Major
>              Labels: pull-request-available
>
> h2. Summary
> When a closed container is reconciled against a peer, 
> {{KeyValueHandler.reconcileChunksPerBlock}} pulls chunks from the peer to 
> repair the local replica. The block commit sequence ID (BCSID) is a 
> high-water mark: it is meant to assert that the replica holds, durably, every 
> committed chunk up to that sequence id. The method's own contract says the 
> BCSID is advanced to the peer's value only when the entire block is read and 
> written successfully.
> The per-chunk loop can stop early in two ways:
> # A chunk read or write throws {{IOException}}. This path sets the loop's 
> success flag to false, and the block is committed without overwriting the 
> BCSID. Correct.
> # The loop reaches a chunk whose preceding chunk is missing locally (a hole). 
> Writing it would leave a gap in the block file, which is not supported, so 
> the loop breaks. This path does *not* clear the success flag.
> Because the success flag is initialized to true and the hole path leaves it 
> untouched, the block is committed as if fully repaired: the BCSID is 
> overwritten with the peer's higher value even though the chunks past the hole 
> were never ingested, and the container BCSID is advanced to match. The 
> replica then advertises a sequence id whose backing data it does not actually 
> hold.
> h2. How the repair list develops a hole in normal operation
> The list of chunks handed to {{reconcileChunksPerBlock}} is not the peer's 
> raw chunk list. It is the diff produced by 
> {{ContainerChecksumTreeManager.compareChunkMerkleTrees}}, which adds a peer 
> chunk to the repair list only through {{reportChunkIfHealthy}} (it is skipped 
> if the peer reports it as unhealthy). A peer chunk is therefore left out of 
> the repair list when:
> * the peer's chunk at that offset is corrupt (checksum mismatch). The scanner 
> still records it in the merkle tree but marked unhealthy, and the diff skips 
> it because reconciliation must not copy corrupt data; or
> * the peer's chunk at that offset is genuinely missing. The scanner omits a 
> missing chunk from the merkle tree entirely ({{KeyValueContainerCheck}}: 
> "Missing chunks should not be added to the merkle tree."), so the diff never 
> sees that offset.
> Either way the repair list gets an interior gap. The bug then fires when the 
> local replica is also missing the chunk before the gap: the loop ingests up 
> to the gap, hits the hole break, and stops with chunks past the gap still 
> absent. The precise condition is an interior offset where the local replica 
> is missing the chunk and the peer's chunk is corrupt or missing, while the 
> peer has a healthy chunk at a later offset that the local replica needs. That 
> is a genuine multi-replica divergence (independent corruption or loss on 
> different chunks across replicas), which is exactly the situation 
> reconciliation exists to repair. It is an off-nominal but reachable state, 
> not a pristine-cluster assumption, and the peer is not healthy: it has a bad 
> chunk at the gap offset even though the chunks being pulled are healthy.
> h2. Impact
> SCM treats the reported BCSID as a freshness and completeness signal. 
> {{AbstractContainerReportHandler}} raises the container's recorded sequence 
> id to a healthy replica's reported value, and several SCM decisions key on 
> that comparison: quasi-close to force-close, and the delete-versus-resurrect 
> choice for DELETING/DELETED containers (a CLOSED or QUASI_CLOSED replica is 
> deleted when its BCSID is at or below the container's). An inflated BCSID can 
> therefore cause a holed replica that is missing committed data to be treated 
> as the most up-to-date copy, and a complete replica to be deleted as 
> redundant.
> This is a silent data-completeness defect, not a cosmetic counter drift. It 
> is bounded to an off-nominal but reachable schedule: the peer must be ahead 
> in BCSID, the repair list must contain an interior gap (as described above), 
> the local replica must be missing the chunk before the gap, and a later SCM 
> decision must act on the inflated value. It is not always-on.
> h2. Fix
> Treat the hole exit as a partial result, exactly like the {{IOException}} 
> exit: clear the success flag before breaking out of the loop. The chunks 
> ingested before the hole are still committed, but the BCSID is not advanced 
> while the block remains incomplete. The post-reconciliation scanner rebuilds 
> the merkle tree from what is actually on disk, and a later reconciliation 
> round that first fills the missing chunk can then legitimately advance the 
> BCSID.
> h2. Reproduction
> A unit test drives the real {{reconcileChunksPerBlock}} against a real closed 
> container holding only the offset-0 chunk (BCSID 1) while a mocked peer 
> advertises BCSID 99 and a repair list with a hole. Before the fix the block 
> and container BCSID are advanced to 99; after the fix they stay at 1. The 
> test is included in the linked pull request and serves as the regression 
> guard.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to