[
https://issues.apache.org/jira/browse/HDFS-4633?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13613026#comment-13613026
]
Chris Nauroth commented on HDFS-4633:
-------------------------------------
Here are some additional details. There is a bad interaction between the
1-second cache expiration used by
{{TestDFSClientExcludedNodes#testExcludedNodesForgiveness}} and the
exclusion/retry logic within
{{DFSOutputStream#DataStreamer#nextBlockOutputStream}}. Here is the sequence
of events I observed during a failed test run. Assume 3 data nodes named dn1,
dn2, and dn3.
# DFSOutputStream writes first block to [dn1, dn2, dn3].
# Test stops data nodes [dn1, dn2].
# DFSOutputStream attempts writing second block to [dn1, dn2, dn3]. It fails
to dn1 and marks it excluded.
# DFSOutputStream retries and attempts writing second block to [dn2, dn3]. It
fails to dn2 and marks it excluded.
# DFSOutputStream retries, but by now, > 1 second has elapsed since dn1 failed.
dn1 gets evicted from the cache and it attempts writing second block to [dn1,
dn3]. This fails again, so it marks dn1 excluded again.
# DFSOutputStream retries, but by now, > 1 second has elapsed since dn2 failed.
dn2 gets evicted from the cache and it attempts writing second block to [dn2,
dn3]. This fails again, so it marks dn2 excluded again.
# At this point, {{DFSOutputStream#DataStreamer#nextBlockOutputStream}} has
exceeded max block write retries (3). It aborts and throws {{IOException}}
with "Unable to create new block.".
> TestDFSClientExcludedNodes fails sporadically if excluded nodes cache expires
> too quickly
> -----------------------------------------------------------------------------------------
>
> Key: HDFS-4633
> URL: https://issues.apache.org/jira/browse/HDFS-4633
> Project: Hadoop HDFS
> Issue Type: Bug
> Components: hdfs-client, test
> Affects Versions: 3.0.0
> Reporter: Chris Nauroth
> Assignee: Chris Nauroth
>
> {{TestDFSClientExcludedNodes}} simulates failures of individual data nodes in
> the client's write pipeline and checks the client's ability to recover.
> HDFS-4246 added support for periodic "forgiveness" by caching the list of
> known bad data nodes with a periodic eviction. The test uses a 1 second
> cache expiration. This sometimes causes failed nodes to be forgiven too fast
> and violate the assumptions of the test.
--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira