[ https://issues.apache.org/jira/browse/HDFS-4246?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Harsh J updated HDFS-4246: -------------------------- Target Version/s: 3.0.0 Status: Patch Available (was: Open) > The exclude node list should be more forgiving, for each output stream > ---------------------------------------------------------------------- > > Key: HDFS-4246 > URL: https://issues.apache.org/jira/browse/HDFS-4246 > Project: Hadoop HDFS > Issue Type: Improvement > Components: hdfs-client > Reporter: Harsh J > Assignee: Harsh J > Priority: Minor > Attachments: HDFS-4246.patch > > > Originally observed by Inder on the mailing lists: > {quote} > Folks, > i was wondering if there is any mechanism/logic to move a node back from the > excludedNodeList to live nodes to be tried for new block creation. > In the current DFSOutputStream code i do not see this. The use-case is if the > write timeout is being reduced and certain nodes get aggressively added to > the excludedNodeList and the client caches DFSOutputStream then the > excludedNodes never get tried again in the lifetime of the application > caching DFSOutputStream > {quote} > What this leads to, is a special scenario, that may impact smaller clusters > more than larger ones: > 1. File is opened for continuous hflush/sync-based writes, such as a HBase > WAL for example. This file is gonna be kept open for a very very long time, > by design. > 2. Over time, nodes are excluded for various errors, such as DN crashes, > network failures, etc. > 3. Eventually, exclude list == live nodes list or close, and the write > suffers. At time of equality, the write also fails with an error of not being > able to get a block allocation. > We should perhaps make the excludeNodes list a timed-cache collection, so > that even if it begins filling up, the older excludes are pruned away, giving > those nodes a try again for later. > One place we have to be careful about, though, is rack-failures. Those > sometimes never come back fast enough, and can be problematic to retry code > with such an eventually-forgiving list. Perhaps we can retain forgiven nodes > and if they are entered again, we may double or triple the forgiveness value > (in time units), to counter this? Its just one idea. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira