[
https://issues.apache.org/jira/browse/HDDS-6083?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
István Fajth updated HDDS-6083:
-------------------------------
Description:
We haven't seen much occurance, but what we have seen a couple of times already
is this:
{code}
Error:
testWriteShouldSuccessIfLessThanParityNodesFail(org.apache.hadoop.ozone.client.TestOzoneECClient)
Time elapsed: 0.116 s <<< FAILURE!
java.lang.AssertionError: expected:<2> but was:<1>
{code}
Based on what I found I guess the problem can affect more things, but we have
not seen much symptoms as we were so far lucky enough.
It seems that the problem comes from [this
line|https://github.com/apache/ozone/blob/HDDS-3816-ec/hadoop-ozone/client/src/test/java/org/apache/hadoop/ozone/client/MultiNodePipelineBlockAllocator.java#L55].
If we are unlucky enough, and we get the same int twice, then we will have two
pseudo DNs in the pipeline that gets the same MockDNStorage assigned. Which
means that if we declare that node to fail, we get a secondary failure during
failure handling, and the code is not prepared for that as of now, and also we
swallow the exception in handleOutputStreamWrite inside ECKeyOutputStream,
which we use from the handleStripeFailure method as well as during the regular
write.
was:
We haven't seen much occurance, but what we have seen a couple of times already
is this:
{code}
Error:
testWriteShouldSuccessIfLessThanParityNodesFail(org.apache.hadoop.ozone.client.TestOzoneECClient)
Time elapsed: 0.116 s <<< FAILURE!
java.lang.AssertionError: expected:<2> but was:<1>
{code}
Based on what I found I guess the problem can affect more things, but we have
not seen much symptoms as we were so far lucky enough.
It seems that the problem comes from [this
line|https://github.com/apache/ozone/blob/HDDS-3816-ec/hadoop-ozone/client/src/test/java/org/apache/hadoop/ozone/client/MultiNodePipelineBlockAllocator.java#L55].
If we are unlucky enough, and we get the same int twice, then we will have two
pseudo DNs in the pipeline that gets the same client assigned. Which means that
if we declare that node to fail, we get a secondary failure during failure
handling, and the code is not prepared for that as of now, and also we swallow
the exception in handleOutputStreamWrite inside ECKeyOutputStream, which we use
from the handleStripeFailure method as well as during the regular write.
> Fix flakyness of tests around nodefailures
> ------------------------------------------
>
> Key: HDDS-6083
> URL: https://issues.apache.org/jira/browse/HDDS-6083
> Project: Apache Ozone
> Issue Type: Sub-task
> Reporter: István Fajth
> Assignee: István Fajth
> Priority: Major
>
> We haven't seen much occurance, but what we have seen a couple of times
> already is this:
> {code}
> Error:
> testWriteShouldSuccessIfLessThanParityNodesFail(org.apache.hadoop.ozone.client.TestOzoneECClient)
> Time elapsed: 0.116 s <<< FAILURE!
> java.lang.AssertionError: expected:<2> but was:<1>
> {code}
> Based on what I found I guess the problem can affect more things, but we have
> not seen much symptoms as we were so far lucky enough.
> It seems that the problem comes from [this
> line|https://github.com/apache/ozone/blob/HDDS-3816-ec/hadoop-ozone/client/src/test/java/org/apache/hadoop/ozone/client/MultiNodePipelineBlockAllocator.java#L55].
> If we are unlucky enough, and we get the same int twice, then we will have
> two pseudo DNs in the pipeline that gets the same MockDNStorage assigned.
> Which means that if we declare that node to fail, we get a secondary failure
> during failure handling, and the code is not prepared for that as of now, and
> also we swallow the exception in handleOutputStreamWrite inside
> ECKeyOutputStream, which we use from the handleStripeFailure method as well
> as during the regular write.
--
This message was sent by Atlassian Jira
(v8.20.1#820001)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]