[ 
https://issues.apache.org/jira/browse/HDFS-4246?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13607935#comment-13607935
 ] 

Aaron T. Myers commented on HDFS-4246:
--------------------------------------

Patch looks pretty good to me. A few little comments:

# 30 minutes seems a little high to me for a default value. I could easily 
imagine someone wanting to perform a rolling restart of DNs in a cluster with 
long-lived clients, wherein the whole rolling restart process might take less 
than 30 minutes. In this case the client would end up not aging off any of the 
excluded nodes. Maybe something in the neighborhood of 5-10 minutes would make 
more sense? Thoughts?
# In the test after restarting the two DNs, recommend adding a 
MiniDFSCluster#waitActive to make sure that the DNs have finished 
restarting/registered with the NN before proceeding on with the rest of the 
test.
# Recommend using ThreadUtil#sleepAtLeastIgnoreInterrupts to do the sleep in 
the test to ensure the client has had time enough to remove the 
formerly-excluded nodes.
# You'll need to add a timeout for the test in order to pass test-patch.
                
> The exclude node list should be more forgiving, for each output stream
> ----------------------------------------------------------------------
>
>                 Key: HDFS-4246
>                 URL: https://issues.apache.org/jira/browse/HDFS-4246
>             Project: Hadoop HDFS
>          Issue Type: Improvement
>          Components: hdfs-client
>    Affects Versions: 2.0.0-alpha
>            Reporter: Harsh J
>            Assignee: Harsh J
>            Priority: Minor
>         Attachments: HDFS-4246.patch, HDFS-4246.patch
>
>
> Originally observed by Inder on the mailing lists:
> {quote}
> Folks,
> i was wondering if there is any mechanism/logic to move a node back from the 
> excludedNodeList to live nodes to be tried for new block creation.
> In the current DFSOutputStream code i do not see this. The use-case is if the 
> write timeout is being reduced and certain nodes get aggressively added to 
> the excludedNodeList and the client caches DFSOutputStream then the 
> excludedNodes never get tried again in the lifetime of the application 
> caching DFSOutputStream
> {quote}
> What this leads to, is a special scenario, that may impact smaller clusters 
> more than larger ones:
> 1. File is opened for continuous hflush/sync-based writes, such as a HBase 
> WAL for example. This file is gonna be kept open for a very very long time, 
> by design.
> 2. Over time, nodes are excluded for various errors, such as DN crashes, 
> network failures, etc.
> 3. Eventually, exclude list == live nodes list or close, and the write 
> suffers. At time of equality, the write also fails with an error of not being 
> able to get a block allocation.
> We should perhaps make the excludeNodes list a timed-cache collection, so 
> that even if it begins filling up, the older excludes are pruned away, giving 
> those nodes a try again for later.
> One place we have to be careful about, though, is rack-failures. Those 
> sometimes never come back fast enough, and can be problematic to retry code 
> with such an eventually-forgiving list. Perhaps we can retain forgiven nodes 
> and if they are entered again, we may double or triple the forgiveness value 
> (in time units), to counter this? Its just one idea.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

Reply via email to