[ https://issues.apache.org/jira/browse/HDFS-14997?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16978370#comment-16978370 ]
Stephen O'Donnell commented on HDFS-14997: ------------------------------------------ I think fixing this is a great idea. We have seen a lot of occurrences when a client gets the error like "failed to close file as the last block has insufficient number of replicas" and in many cases it can be traced to the DN heartbeat thread getting blocked for a long time at the "process commands" step. This is almost always because something else is holding the FsDatasetImpl lock, which is needed to process the commands. Pushing the processing of these commands to an async thread is a good idea as it avoids needing to take the lock at all when processing the heartbeat. I also agree that there are many places in the datanode where the FsDatasetImpl lock is held for IO operations, and I suspect there are times we could potentially lock on a volume rather than DN wide, but I have not taken the time to dig into that. It is certainly something we can look into on further Jiras. > BPServiceActor process command from NameNode asynchronously > ----------------------------------------------------------- > > Key: HDFS-14997 > URL: https://issues.apache.org/jira/browse/HDFS-14997 > Project: Hadoop HDFS > Issue Type: Improvement > Components: datanode > Reporter: Xiaoqiao He > Assignee: Xiaoqiao He > Priority: Major > Attachments: HDFS-14997.001.patch > > > There are two core functions, report(#sendHeartbeat, #blockReport, > #cacheReport) and #processCommand in #BPServiceActor main process flow. If > processCommand cost long time it will block send report flow. Meanwhile > processCommand could cost long time(over 1000s the worst case I meet) when IO > load of DataNode is very high. Since some IO operations are under > #datasetLock, So it has to wait to acquire #datasetLock long time when > process some of commands(such as #DNA_INVALIDATE). In such case, #heartbeat > will not send to NameNode in-time, and trigger other disasters. > I propose to improve #processCommand asynchronously and not block > #BPServiceActor to send heartbeat back to NameNode when meet high IO load. > Notes: > 1. Lifeline could be one effective solution, however some old branches are > not support this feature. > 2. IO operations under #datasetLock is another issue, I think we should solve > it at another JIRA. -- This message was sent by Atlassian Jira (v8.3.4#803005) --------------------------------------------------------------------- To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org