[ 
https://issues.apache.org/jira/browse/HDFS-14997?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17181563#comment-17181563
 ] 

zy.jordan commented on HDFS-14997:
----------------------------------

Hi ,[~hexiaoqiao]

>From this code, it looks like run process cmd async instead of in heartbeat 
>thread. It looks like perfect.

But look insight, I found heartbeat thread run 
bpos.updateActorStatesFromHeartbeat() func, and updateActorStatesFromHeartbeat 
held write lock()
{code:java}
private void offerService() throws Exception {
  ...
  while (shouldRun()) {
    ...
    bpos.updateActorStatesFromHeartbeat(this, resp.getNameNodeHaState());
    ...
  }
  ...
}{code}
 
{code:java}
void updateActorStatesFromHeartbeat( BPServiceActor actor, NNHAStatusHeartbeat 
nnHaState) {
  writeLock();
  try {
    ...
  } finally {
    writeUnlock();
  }
  ...
}
{code}
And in async process cmd thread, proces cmd func also held the same lock
{code:java}
boolean processCommandFromActor(DatanodeCommand cmd, BPServiceActor actor) 
throws IOException {
  ...
  writeLock();
  try {
    ...
  } finally {
    writeUnlock();
  }
}
{code}
On my view, the process cmd looks like not async

 

> BPServiceActor processes commands from NameNode asynchronously
> --------------------------------------------------------------
>
>                 Key: HDFS-14997
>                 URL: https://issues.apache.org/jira/browse/HDFS-14997
>             Project: Hadoop HDFS
>          Issue Type: Improvement
>          Components: datanode
>            Reporter: Xiaoqiao He
>            Assignee: Xiaoqiao He
>            Priority: Major
>             Fix For: 3.3.0
>
>         Attachments: HDFS-14997.001.patch, HDFS-14997.002.patch, 
> HDFS-14997.003.patch, HDFS-14997.004.patch, HDFS-14997.005.patch, 
> HDFS-14997.addendum.patch, image-2019-12-26-16-15-44-814.png
>
>
> There are two core functions, report(#sendHeartbeat, #blockReport, 
> #cacheReport) and #processCommand in #BPServiceActor main process flow. If 
> processCommand cost long time it will block send report flow. Meanwhile 
> processCommand could cost long time(over 1000s the worst case I meet) when IO 
> load  of DataNode is very high. Since some IO operations are under 
> #datasetLock, So it has to wait to acquire #datasetLock long time when 
> process some of commands(such as #DNA_INVALIDATE). In such case, #heartbeat 
> will not send to NameNode in-time, and trigger other disasters.
> I propose to improve #processCommand asynchronously and not block 
> #BPServiceActor to send heartbeat back to NameNode when meet high IO load.
> Notes:
> 1. Lifeline could be one effective solution, however some old branches are 
> not support this feature.
> 2. IO operations under #datasetLock is another issue, I think we should solve 
> it at another JIRA.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to