Chu Cheng Li created HDDS-15301:
-----------------------------------
Summary: Malformed PutBlock request can mark container UNHEALTHY
Key: HDDS-15301
URL: https://issues.apache.org/jira/browse/HDDS-15301
Project: Apache Ozone
Issue Type: Bug
Components: Ozone Client, Ozone Datanode
Reporter: Chu Cheng Li
Assignee: Chu Cheng Li
h2. Summary
A malformed client {{PutBlock}} request can cause the datanode to mark the
target container {{{}UNHEALTHY{}}}. The request should be rejected as a
client-side malformed request, but currently it is mapped to
{{{}IO_EXCEPTION{}}}, which {{HddsDispatcher}} treats as a container write
failure.
This means a bad/misbehaving client can poison the active container and close
the pipeline.
h2. Environment
* Ozone version: {{2.2.0-SNAPSHOT}}
* Cluster: {{MiniOzoneCluster}}
* Pipeline: single-node Ratis pipeline
* Client: custom Rust Ozone client reproducing Java client incremental chunk
list semantics
h2. Repro
Send an incremental {{PutBlock}} request where {{BlockData.size}} does not
equal the sum of chunks included in the request.
Example from the failing request:
{code:java}
putBlock {
blockData {
blockID {
containerID: 2
localID: 117883640217600002
blockCommitSequenceId: 0
}
metadata { key: "incremental" }
chunks {
chunkName: "117883640217600002_chunk_16"
offset: 16777216
len: 1048576
metadata { key: "full" }
checksumData { type: NONE bytesPerChecksum: 0 }
}
size: 17825792
}
eof: false
} {code}
The request includes only one {{1 MiB}} chunk, but {{size}} is {{{}17 MiB{}}}.
h2. Actual Behavior
The datanode rejects the protobuf with a {{{}CodecException{}}}:
{code:java}
Caused by: org.apache.hadoop.hdds.utils.db.CodecException:
Size mismatch: size (=17825792) != sum of chunks (=1048576){code}
That exception is caught in {{KeyValueHandler.handlePutBlock}} as an
{{IOException}} and returned as {{{}IO_EXCEPTION{}}}:
{code:java}
Operation: PutBlock, Message: Put Key failed, Result: IO_EXCEPTION{code}
{{}}
Then {{HddsDispatcher}} treats the failed write as a container write failure:
{code:java}
Marked container UNHEALTHY from OPEN: KeyValueContainerData #2{code}
After that, subsequent writes fail with:
{code:java}
Container 2 in UNHEALTHY state{code}
{{}}
SCM closes the pipeline, and clients may later see retry/failover noise such as:
{code:java}
not leader; suggested_leader_present=false
exhausted retry-window resend attempts {code}
--
This message was sent by Atlassian Jira
(v8.20.10#820010)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]