[ 
https://issues.apache.org/jira/browse/HDFS-14997?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17003471#comment-17003471
 ] 

Ayush Saxena edited comment on HDFS-14997 at 12/26/19 5:31 AM:
---------------------------------------------------------------

Hi [~hexiaoqiao] [~elgoiri]

Seems there are couple of tests failing due to heap issue. According to the 
hprof file, 60 percent of the memory is being occupied by the Datanode Object. 
Can you give a check, I suspect related.

Ref 
:https://builds.apache.org/job/PreCommit-HDFS-Build/28549/testReport/org.apache.hadoop.hdfs/TestFileChecksum/testStripedFileChecksum3/

There was a similar failure in JENKINS result here too.


was (Author: ayushtkn):
Hi [~hexiaoqiao] [~elgoiri]

Seems there are couple of tests failing due to heap issue. According to the 
hprof file, 60 percent of the memory is being occupied by the Datanode Object. 
Can you give a check, I suspect related.

> BPServiceActor processes commands from NameNode asynchronously
> --------------------------------------------------------------
>
>                 Key: HDFS-14997
>                 URL: https://issues.apache.org/jira/browse/HDFS-14997
>             Project: Hadoop HDFS
>          Issue Type: Improvement
>          Components: datanode
>            Reporter: Xiaoqiao He
>            Assignee: Xiaoqiao He
>            Priority: Major
>             Fix For: 3.3.0
>
>         Attachments: HDFS-14997.001.patch, HDFS-14997.002.patch, 
> HDFS-14997.003.patch, HDFS-14997.004.patch, HDFS-14997.005.patch
>
>
> There are two core functions, report(#sendHeartbeat, #blockReport, 
> #cacheReport) and #processCommand in #BPServiceActor main process flow. If 
> processCommand cost long time it will block send report flow. Meanwhile 
> processCommand could cost long time(over 1000s the worst case I meet) when IO 
> load  of DataNode is very high. Since some IO operations are under 
> #datasetLock, So it has to wait to acquire #datasetLock long time when 
> process some of commands(such as #DNA_INVALIDATE). In such case, #heartbeat 
> will not send to NameNode in-time, and trigger other disasters.
> I propose to improve #processCommand asynchronously and not block 
> #BPServiceActor to send heartbeat back to NameNode when meet high IO load.
> Notes:
> 1. Lifeline could be one effective solution, however some old branches are 
> not support this feature.
> 2. IO operations under #datasetLock is another issue, I think we should solve 
> it at another JIRA.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to