[jira] [Updated] (HDDS-6083) Fix flakyness of tests around nodefailures

Jira Thu, 09 Dec 2021 14:25:05 -0800


     [ 
https://issues.apache.org/jira/browse/HDDS-6083?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]


István Fajth updated HDDS-6083:
-------------------------------
    Description: 
We haven't seen much occurance, but what we have seen a couple of times already 
is this:
{code}
Error:  
testWriteShouldSuccessIfLessThanParityNodesFail(org.apache.hadoop.ozone.client.TestOzoneECClient)
  Time elapsed: 0.116 s  <<< FAILURE!
java.lang.AssertionError: expected:<2> but was:<1>
{code}

Based on what I found I guess the problem can affect more things, but we have 
not seen much symptoms as we were so far lucky enough.
It seems that the problem comes from [this 
line|https://github.com/apache/ozone/blob/HDDS-3816-ec/hadoop-ozone/client/src/test/java/org/apache/hadoop/ozone/client/MultiNodePipelineBlockAllocator.java#L55].
 If we are unlucky enough, and we get the same int twice, then we will have two 
pseudo DNs in the pipeline that gets the same MockDNStorage assigned. Which 
means that if we declare that node to fail, we get a secondary failure during 
failure handling, and the code is not prepared for that as of now, and also we 
swallow the exception in handleOutputStreamWrite inside ECKeyOutputStream, 
which we use from the handleStripeFailure method as well as during the regular 
write.

  was:
We haven't seen much occurance, but what we have seen a couple of times already 
is this:
{code}
Error:  
testWriteShouldSuccessIfLessThanParityNodesFail(org.apache.hadoop.ozone.client.TestOzoneECClient)
  Time elapsed: 0.116 s  <<< FAILURE!
java.lang.AssertionError: expected:<2> but was:<1>
{code}

Based on what I found I guess the problem can affect more things, but we have 
not seen much symptoms as we were so far lucky enough.
It seems that the problem comes from [this 
line|https://github.com/apache/ozone/blob/HDDS-3816-ec/hadoop-ozone/client/src/test/java/org/apache/hadoop/ozone/client/MultiNodePipelineBlockAllocator.java#L55].
 If we are unlucky enough, and we get the same int twice, then we will have two 
pseudo DNs in the pipeline that gets the same client assigned. Which means that 
if we declare that node to fail, we get a secondary failure during failure 
handling, and the code is not prepared for that as of now, and also we swallow 
the exception in handleOutputStreamWrite inside ECKeyOutputStream, which we use 
from the handleStripeFailure method as well as during the regular write.


> Fix flakyness of tests around nodefailures
> ------------------------------------------
>
>                 Key: HDDS-6083
>                 URL: https://issues.apache.org/jira/browse/HDDS-6083
>             Project: Apache Ozone
>          Issue Type: Sub-task
>            Reporter: István Fajth
>            Assignee: István Fajth
>            Priority: Major
>
> We haven't seen much occurance, but what we have seen a couple of times 
> already is this:
> {code}
> Error:  
> testWriteShouldSuccessIfLessThanParityNodesFail(org.apache.hadoop.ozone.client.TestOzoneECClient)
>   Time elapsed: 0.116 s  <<< FAILURE!
> java.lang.AssertionError: expected:<2> but was:<1>
> {code}
> Based on what I found I guess the problem can affect more things, but we have 
> not seen much symptoms as we were so far lucky enough.
> It seems that the problem comes from [this 
> line|https://github.com/apache/ozone/blob/HDDS-3816-ec/hadoop-ozone/client/src/test/java/org/apache/hadoop/ozone/client/MultiNodePipelineBlockAllocator.java#L55].
>  If we are unlucky enough, and we get the same int twice, then we will have 
> two pseudo DNs in the pipeline that gets the same MockDNStorage assigned. 
> Which means that if we declare that node to fail, we get a secondary failure 
> during failure handling, and the code is not prepared for that as of now, and 
> also we swallow the exception in handleOutputStreamWrite inside 
> ECKeyOutputStream, which we use from the handleStripeFailure method as well 
> as during the regular write.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[jira] [Updated] (HDDS-6083) Fix flakyness of tests around nodefailures

Reply via email to