[ https://issues.apache.org/jira/browse/HDFS-14997?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16979453#comment-16979453 ]
Íñigo Goiri commented on HDFS-14997: ------------------------------------ We should add this to the metrics to be able to track if we are queueing too many commands, etc. Other minor comments: * We should add a javadoc to the methods in CommandProcessingThread. Specially to processQueue() evne though is private. * For processQueue(), I prefer while instead of do/while. * processQueue() should use {{numProcessCommands++}}. Let's also make it {{numProcessedCommands}}. * Do we need to do take and then poll? * In the interrupted case, we should log with debug in the other cases (use also the logger {} format). If it is interrupted, shouldn't shouldRun() return false so no need to break? * We should extend the CommandProcessingThread #enqueue() to support {{List<DatanodeCommand>}} and {{DatanodeCommand}} as arguments so we don't need to do the transformations in the part where we add it. * {{processCommand(DatanodeCommand[] cmds)}} is kind of repeated now. Should we merge the new and the old together? > BPServiceActor process command from NameNode asynchronously > ----------------------------------------------------------- > > Key: HDFS-14997 > URL: https://issues.apache.org/jira/browse/HDFS-14997 > Project: Hadoop HDFS > Issue Type: Improvement > Components: datanode > Reporter: Xiaoqiao He > Assignee: Xiaoqiao He > Priority: Major > Attachments: HDFS-14997.001.patch, HDFS-14997.002.patch > > > There are two core functions, report(#sendHeartbeat, #blockReport, > #cacheReport) and #processCommand in #BPServiceActor main process flow. If > processCommand cost long time it will block send report flow. Meanwhile > processCommand could cost long time(over 1000s the worst case I meet) when IO > load of DataNode is very high. Since some IO operations are under > #datasetLock, So it has to wait to acquire #datasetLock long time when > process some of commands(such as #DNA_INVALIDATE). In such case, #heartbeat > will not send to NameNode in-time, and trigger other disasters. > I propose to improve #processCommand asynchronously and not block > #BPServiceActor to send heartbeat back to NameNode when meet high IO load. > Notes: > 1. Lifeline could be one effective solution, however some old branches are > not support this feature. > 2. IO operations under #datasetLock is another issue, I think we should solve > it at another JIRA. -- This message was sent by Atlassian Jira (v8.3.4#803005) --------------------------------------------------------------------- To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org