Chi-Hsuan Huang created HDDS-15605:
--------------------------------------
Summary: Intermittent failure in
TestFailureHandlingByClient.testContainerExclusionWithClosedContainerException
Key: HDDS-15605
URL: https://issues.apache.org/jira/browse/HDDS-15605
Project: Apache Ozone
Issue Type: Sub-task
Components: test
Reporter: Chi-Hsuan Huang
h2. Symptom
{{TestFailureHandlingByClient.testContainerExclusionWithClosedContainerException}}
fails intermittently \(observed \~1/40 on CI, not reproducible locally\) with
an assertion failure, not a timeout:
{code}
TestFailureHandlingByClient.testContainerExclusionWithClosedContainerException:398
java.lang.AssertionError:
Expecting empty but was:
\[5efc24c5\-0b87\-4bf7\-80b0\-751fafcf3248\(null/null\)\]
{code}
Line 398 asserts {{keyOutputStream.getExcludeList\(\).getDatanodes\(\)}} is
empty: only the closed container should be excluded, no datanode.
h2. Root cause analysis
In {{KeyOutputStream.handleException}} \(around lines 386\-400\), excluding a
datanode and excluding the container are two independent decisions that can
both fire:
{code}
Collection failedServers = streamEntry.getFailedServers\(\);
if \(\!failedServers.isEmpty\(\)\) {
excludeList.addDatanodes\(failedServers\); // populates
getDatanodes\(\)
}
if \(containerExclusionException\) {
excludeList.addConatinerId\(...\); // container \(expected
by the test\)
} else {
excludeList.addPipeline\(pipelineId\);
}
{code}
The test assumes the second write fails only with {{ClosedContainerException}},
so {{failedServers}} is empty. But the excluded datanode is printed as
{{\(null/null\)}}, which is what {{XceiverClientRatis.addDatanodetoReply}}
produces \(it builds {{DatanodeDetails}} from the Ratis peer UUID only, with no
IP or hostname\). This points to a Ratis peer write/watch failure rather than a
clean {{ClosedContainerException}}.
Sequence:h1. {{TestHelper.waitForContainerClose}} closes the container, which
also tears down the Ratis pipeline on the datanodes.
h1. The subsequent write \(or its watch\-for\-commit\) to a Ratis peer can fail
or time out while the pipeline is closing, so that peer is recorded in
{{failedServers}}.
h1. {{handleException}} then adds that datanode to the exclude list in addition
to the container, so {{getDatanodes\(\)}} is non\-empty and the assertion on
line 398 fails.
This is a timing race between container close and the in\-flight Ratis
write/watch, which is why it only shows up under load on CI.
h2. Notes
* Distinct from HDDS\-7878 \(resolved\), which tracked an intermittent
_timeout_ in the same method. This is an _assertion_ failure with a different
cause.
* Observed in CI:
[https://github.com/chihsuan/ozone/actions/runs/27691671664|https://github.com/chihsuan/ozone/actions/runs/27691671664]
\(job: integration \(client\)\).
* The assertion on line 398 \(and likely 399 for pipelines\) may be too strict
given that container close can legitimately surface a transient
datanode/pipeline failure.
--
This message was sent by Atlassian Jira
(v8.20.10#820010)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]