[ https://issues.apache.org/jira/browse/HDFS-4246?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17937797#comment-17937797 ]
NaihaoFan commented on HDFS-4246: --------------------------------- Hi [~qwertymaniac], hope all is well. I have one question, in the method getExcludedNodes, why use `excludedNodes.getAllPresent(excludedNodes.asMap().keySet()).keySet().toArray(DatanodeInfo.EMPTY_ARRAY);` instead of directly `excludedNodes.asMap().keySet().toArray(DatanodeInfo.EMPTY_ARRAY);`? Seems the first (current) implementation will refresh the nodes? Refer to: [CachesExplained ยท google/guava Wiki|https://github.com/google/guava/wiki/CachesExplained#asmap] Do we intend to refresh the nodes cached when get? I think maybe not refresh it is the right implementation. (If I missed some background) > The exclude node list should be more forgiving, for each output stream > ---------------------------------------------------------------------- > > Key: HDFS-4246 > URL: https://issues.apache.org/jira/browse/HDFS-4246 > Project: Hadoop HDFS > Issue Type: Improvement > Components: hdfs-client > Affects Versions: 2.0.0-alpha > Reporter: Harsh J > Assignee: Harsh J > Priority: Minor > Fix For: 2.1.0-beta > > Attachments: HDFS-4246.patch, HDFS-4246.patch, HDFS-4246.patch > > > Originally observed by Inder on the mailing lists: > {quote} > Folks, > i was wondering if there is any mechanism/logic to move a node back from the > excludedNodeList to live nodes to be tried for new block creation. > In the current DFSOutputStream code i do not see this. The use-case is if the > write timeout is being reduced and certain nodes get aggressively added to > the excludedNodeList and the client caches DFSOutputStream then the > excludedNodes never get tried again in the lifetime of the application > caching DFSOutputStream > {quote} > What this leads to, is a special scenario, that may impact smaller clusters > more than larger ones: > 1. File is opened for continuous hflush/sync-based writes, such as a HBase > WAL for example. This file is gonna be kept open for a very very long time, > by design. > 2. Over time, nodes are excluded for various errors, such as DN crashes, > network failures, etc. > 3. Eventually, exclude list == live nodes list or close, and the write > suffers. At time of equality, the write also fails with an error of not being > able to get a block allocation. > We should perhaps make the excludeNodes list a timed-cache collection, so > that even if it begins filling up, the older excludes are pruned away, giving > those nodes a try again for later. > One place we have to be careful about, though, is rack-failures. Those > sometimes never come back fast enough, and can be problematic to retry code > with such an eventually-forgiving list. Perhaps we can retain forgiven nodes > and if they are entered again, we may double or triple the forgiveness value > (in time units), to counter this? Its just one idea. -- This message was sent by Atlassian Jira (v8.20.10#820010) --------------------------------------------------------------------- To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org