[ 
https://issues.apache.org/jira/browse/HDFS-4633?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13613026#comment-13613026
 ] 

Chris Nauroth commented on HDFS-4633:
-------------------------------------

Here are some additional details.  There is a bad interaction between the 
1-second cache expiration used by 
{{TestDFSClientExcludedNodes#testExcludedNodesForgiveness}} and the 
exclusion/retry logic within 
{{DFSOutputStream#DataStreamer#nextBlockOutputStream}}.  Here is the sequence 
of events I observed during a failed test run.  Assume 3 data nodes named dn1, 
dn2, and dn3.

# DFSOutputStream writes first block to [dn1, dn2, dn3].
# Test stops data nodes [dn1, dn2].
# DFSOutputStream attempts writing second block to [dn1, dn2, dn3].  It fails 
to dn1 and marks it excluded.
# DFSOutputStream retries and attempts writing second block to [dn2, dn3].  It 
fails to dn2 and marks it excluded.
# DFSOutputStream retries, but by now, > 1 second has elapsed since dn1 failed. 
 dn1 gets evicted from the cache and it attempts writing second block to [dn1, 
dn3].  This fails again, so it marks dn1 excluded again.
# DFSOutputStream retries, but by now, > 1 second has elapsed since dn2 failed. 
 dn2 gets evicted from the cache and it attempts writing second block to [dn2, 
dn3].  This fails again, so it marks dn2 excluded again.
# At this point, {{DFSOutputStream#DataStreamer#nextBlockOutputStream}} has 
exceeded max block write retries (3).  It aborts and throws {{IOException}} 
with "Unable to create new block.".

                
> TestDFSClientExcludedNodes fails sporadically if excluded nodes cache expires 
> too quickly
> -----------------------------------------------------------------------------------------
>
>                 Key: HDFS-4633
>                 URL: https://issues.apache.org/jira/browse/HDFS-4633
>             Project: Hadoop HDFS
>          Issue Type: Bug
>          Components: hdfs-client, test
>    Affects Versions: 3.0.0
>            Reporter: Chris Nauroth
>            Assignee: Chris Nauroth
>
> {{TestDFSClientExcludedNodes}} simulates failures of individual data nodes in 
> the client's write pipeline and checks the client's ability to recover.  
> HDFS-4246 added support for periodic "forgiveness" by caching the list of 
> known bad data nodes with a periodic eviction.  The test uses a 1 second 
> cache expiration.  This sometimes causes failed nodes to be forgiven too fast 
> and violate the assumptions of the test.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

Reply via email to