[
https://issues.apache.org/jira/browse/HDFS-378?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12800588#action_12800588
]
Todd Lipcon commented on HDFS-378:
----------------------------------
Been thinking about this a bit tonight. It seems to me we have the following
classes of errors to deal with:
# A DN has died but the NN does not yet know about it. Thus, the client fails
entirely to connect to the DN. Ideally, the client shouldn't reconnect to this
for quite some time.
# A DN is heavily loaded (above its max.xcievers value) and thus the client is
rejected. But, we'd like to retry it reasonably often, and ideally don't want
to fail a read completely, even if all replicas are in this state for a short
period of time.
# A particular replica is corrupt or missing on a DN. Here, we just want to
avoid reading this particular block from this DN until the block has been
rereplicated from a healthy copy.
Case #3 above is actually handled implicitly with no long-term/inter-operation
tracking on the client, since the client will report the bad block to the NN
immediately upon discovering it. Then, on the next getBlockLocations call for
the same block, it will automatically be filtered out of the LocatedBlocks
result by the NN. When it's been fixed up, the new valid location will end up
in the LocatedBlocks result (whether on the same DN or a different one)
Given this, I disagree with the Description of this issue - I don't think the
client needs to track failures by block, just by datanode, as long as checksum
failures are handled differently than connection or timeout failures.
The remaining question is how to handle both case 1 and case 2 above in a
convenient manner. Here's one idea:
Whenever the client fails to read from a datanode, the timestamp of the failure
is recorded in a map keyed by node. When a block is to be read, the list of
locations is sorted based on ascending timestamp of last faillure - thus the
nodes that have had problems least recently are retried first. Any node with
last failure past some threshold in the past (eg 5 minutes) is considered to
have never failed and is removed from the map. Any node that has no recorded
failure info should be prioritized above any node that does have failure info.
This should be fairly simple to implement without any protocol changes, and
also easy to reason about. The map would ideally be DFSClient-wide so
applications that use a lot of separate InputStreams won't use a lot of extra
memory, and can share their view of the DN health.
One possible improvement on the above is to use datanode heartbeat times to
distinguish between case 1 and case 2. Specifically, a "relativeLastHeartbeat"
field could be added to LocatedBlocks for each datanode. The DN can then use
this information to remove failure info for any DN whose failures were recorded
before the last heartbeat. Thus, it will retry heavily loaded nodes once per
heartbeat interval, but won't retry nodes that have actually failed. The
downside is that this would require a protocol change, and be harder to reason
about for cases like network partitions where a DN is heartbeating fine but
some set of clients can't connect to it.
Looking forward to hearing people's thoughts.
> DFSClient should track failures by block rather than globally
> -------------------------------------------------------------
>
> Key: HDFS-378
> URL: https://issues.apache.org/jira/browse/HDFS-378
> Project: Hadoop HDFS
> Issue Type: Improvement
> Reporter: Chris Douglas
>
> Rather than tracking the total number of times DFSInputStream failed to
> locate a datanode for a particular block, such failures and the the list of
> datanodes involved should be scoped to individual blocks. In particular, the
> "deadnode" list should be a map of blocks to a list of failed nodes, the
> latter reset and the nodes retried per the existing semantics.
--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.