[ 
https://issues.apache.org/jira/browse/HDFS-14997?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16978530#comment-16978530
 ] 

Stephen O'Donnell commented on HDFS-14997:
------------------------------------------

Yes, we often bump dfs.client.block.write.locateFollowingBlock.retries to 10, 
but in extreme circumstances even that is not enough and it causes clients to 
hang for a long time (several 100 seconds with a setting of 10 I think). 

Most of the commands are actually processed async already. I had a quick look 
and I believe Transfer, Invalidate, Finalize and Cache at least ultimately 
submit the command to their own ExecutorService. The issue, is that getting 
each command onto its own queue needs that global lock.

I still think we should improve this area, but if a DN is under high load such 
that it is showing this problem I do wonder if the command will start to back 
up in the new thread, as they ultimately still need that lock. Therefore the 
work Aiphago is doing would go nicely with this to decrease the lock contention 
at that level too.

> BPServiceActor process command from NameNode asynchronously
> -----------------------------------------------------------
>
>                 Key: HDFS-14997
>                 URL: https://issues.apache.org/jira/browse/HDFS-14997
>             Project: Hadoop HDFS
>          Issue Type: Improvement
>          Components: datanode
>            Reporter: Xiaoqiao He
>            Assignee: Xiaoqiao He
>            Priority: Major
>         Attachments: HDFS-14997.001.patch
>
>
> There are two core functions, report(#sendHeartbeat, #blockReport, 
> #cacheReport) and #processCommand in #BPServiceActor main process flow. If 
> processCommand cost long time it will block send report flow. Meanwhile 
> processCommand could cost long time(over 1000s the worst case I meet) when IO 
> load  of DataNode is very high. Since some IO operations are under 
> #datasetLock, So it has to wait to acquire #datasetLock long time when 
> process some of commands(such as #DNA_INVALIDATE). In such case, #heartbeat 
> will not send to NameNode in-time, and trigger other disasters.
> I propose to improve #processCommand asynchronously and not block 
> #BPServiceActor to send heartbeat back to NameNode when meet high IO load.
> Notes:
> 1. Lifeline could be one effective solution, however some old branches are 
> not support this feature.
> 2. IO operations under #datasetLock is another issue, I think we should solve 
> it at another JIRA.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

---------------------------------------------------------------------
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org

Reply via email to