[
https://issues.apache.org/jira/browse/HDFS-4273?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Binglin Chang updated HDFS-4273:
--------------------------------
Attachment: HDFS-4273.v7.patch
Update patch, chages:
1. rebase to current trunk
2. local DN in deadNodes can expire, after local DN expires, it is removed from
deadNodes
3. set static const LOCAL_DEADNODE_EXPIRE_MILLISECONDS to10 minutes, so local
DN should expire in 10 minutes, then read operations will try to use this local
DN is possible. Assuming fail is fast when connecting to local DN when local DN
is dead, performance impact should be small for extra retry.
We can make LOCAL_DEADNODE_EXPIRE_MILLISECONDS configurable by adding it to
dfsclient.conf, if someone think it necessary.
> Problem in DFSInputStream read retry logic may cause early failure
> ------------------------------------------------------------------
>
> Key: HDFS-4273
> URL: https://issues.apache.org/jira/browse/HDFS-4273
> Project: Hadoop HDFS
> Issue Type: Bug
> Affects Versions: 2.0.2-alpha
> Reporter: Binglin Chang
> Assignee: Binglin Chang
> Priority: Minor
> Attachments: HDFS-4273-v2.patch, HDFS-4273.patch, HDFS-4273.v3.patch,
> HDFS-4273.v4.patch, HDFS-4273.v5.patch, HDFS-4273.v6.patch,
> HDFS-4273.v7.patch, TestDFSInputStream.java
>
>
> Assume the following call logic
> {noformat}
> readWithStrategy()
> -> blockSeekTo()
> -> readBuffer()
> -> reader.doRead()
> -> seekToNewSource() add currentNode to deadnode, wish to get a
> different datanode
> -> blockSeekTo()
> -> chooseDataNode()
> -> block missing, clear deadNodes and pick the currentNode again
> seekToNewSource() return false
> readBuffer() re-throw the exception quit loop
> readWithStrategy() got the exception, and may fail the read call before
> tried MaxBlockAcquireFailures.
> {noformat}
> some issues of the logic:
> 1. seekToNewSource() logic is broken because it may clear deadNodes in the
> middle.
> 2. the variable "int retries=2" in readWithStrategy seems have conflict with
> MaxBlockAcquireFailures, should it be removed?
--
This message was sent by Atlassian JIRA
(v6.1.5#6160)