[ 
https://issues.apache.org/jira/browse/HDFS-378?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12800588#action_12800588
 ] 

Todd Lipcon commented on HDFS-378:
----------------------------------

Been thinking about this a bit tonight. It seems to me we have the following 
classes of errors to deal with:

# A DN has died but the NN does not yet know about it. Thus, the client fails 
entirely to connect to the DN. Ideally, the client shouldn't reconnect to this 
for quite some time.
# A DN is heavily loaded (above its max.xcievers value) and thus the client is 
rejected. But, we'd like to retry it reasonably often, and ideally don't want 
to fail a read completely, even if all replicas are in this state for a short 
period of time.
# A particular replica is corrupt or missing on a DN. Here, we just want to 
avoid reading this particular block from this DN until the block has been 
rereplicated from a healthy copy.

Case #3 above is actually handled implicitly with no long-term/inter-operation 
tracking on the client, since the client will report the bad block to the NN 
immediately upon discovering it. Then, on the next getBlockLocations call for 
the same block, it will automatically be filtered out of the LocatedBlocks 
result by the NN. When it's been fixed up, the new valid location will end up 
in the LocatedBlocks result (whether on the same DN or a different one)

Given this, I disagree with the Description of this issue - I don't think the 
client needs to track failures by block, just by datanode, as long as checksum 
failures are handled differently than connection or timeout failures.

The remaining question is how to handle both case 1 and case 2 above in a 
convenient manner. Here's one idea:

Whenever the client fails to read from a datanode, the timestamp of the failure 
is recorded in a map keyed by node. When a block is to be read, the list of 
locations is sorted based on ascending timestamp of last faillure - thus the 
nodes that have had problems least recently are retried first. Any node with 
last failure past some threshold in the past (eg 5 minutes) is considered to 
have never failed and is removed from the map. Any node that has no recorded 
failure info should be prioritized above any node that does have failure info.

This should be fairly simple to implement without any protocol changes, and 
also easy to reason about. The map would ideally be DFSClient-wide so 
applications that use a lot of separate InputStreams won't use a lot of extra 
memory, and can share their view of the DN health.

One possible improvement on the above is to use datanode heartbeat times to 
distinguish between case 1 and case 2. Specifically, a "relativeLastHeartbeat" 
field could be added to LocatedBlocks for each datanode. The DN can then use 
this information to remove failure info for any DN whose failures were recorded 
before the last heartbeat. Thus, it will retry heavily loaded nodes once per 
heartbeat interval, but won't retry nodes that have actually failed. The 
downside is that this would require a protocol change, and be harder to reason 
about for cases like network partitions where a DN is heartbeating fine but 
some set of clients can't connect to it.

Looking forward to hearing people's thoughts.

> DFSClient should track failures by block rather than globally
> -------------------------------------------------------------
>
>                 Key: HDFS-378
>                 URL: https://issues.apache.org/jira/browse/HDFS-378
>             Project: Hadoop HDFS
>          Issue Type: Improvement
>            Reporter: Chris Douglas
>
> Rather than tracking the total number of times DFSInputStream failed to 
> locate a datanode for a particular block, such failures and the the list of 
> datanodes involved should be scoped to individual blocks. In particular, the 
> "deadnode" list should be a map of blocks to a list of failed nodes, the 
> latter reset and the nodes retried per the existing semantics.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

Reply via email to