[ 
https://issues.apache.org/jira/browse/HDFS-16898?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17683637#comment-17683637
 ] 

ASF GitHub Bot commented on HDFS-16898:
---------------------------------------

hfutatzhanghb commented on PR #5330:
URL: https://github.com/apache/hadoop/pull/5330#issuecomment-1414734286

   > Thanks for the PR @hfutatzhanghb. Curious if you have any thread dumps or 
logs collected (before coming to this conclusion) and would like to share reg 
the issue.
   
   hi, @virajjasani . Thanks for your replying. Some logs are like below:
   First, we add some logs in 
`BPServiceActor.CommandProcessingThread#processCommand`:
   
   
![image](https://user-images.githubusercontent.com/25115709/216499334-66fb3f87-05c8-4baa-b2c1-5f8bba58e7b4.png)
   
   and we grep some logs as below:
   
   
![image](https://user-images.githubusercontent.com/25115709/216498739-db2b23c4-765d-4d54-b23f-428947454914.png)
   
   we can draw a conclusion that the execution time of processCommandFromActor 
method is very high, even more than 119 seconds.   And in 
processCommandFromActor method, it uses the write lock which is the same one as 
updateActorStatesFromHeartbeat method used.  The updateActorStatesFromHeartbeat 
method is in offerService method, so this could hang the hearbeat thread.
   
   
![image](https://user-images.githubusercontent.com/25115709/216500941-268a7eab-8988-4ebc-b455-481f6fa850b8.png)
   
   In our production cluster, we have use this feature, it works well. 
   
   




> Make write lock fine-grain in processCommandFromActor method
> ------------------------------------------------------------
>
>                 Key: HDFS-16898
>                 URL: https://issues.apache.org/jira/browse/HDFS-16898
>             Project: Hadoop HDFS
>          Issue Type: Improvement
>    Affects Versions: 3.3.4
>            Reporter: ZhangHB
>            Priority: Major
>              Labels: pull-request-available
>
> Now in method processCommandFromActor,  we have code like below:
>  
> {code:java}
> writeLock();
> try {
>   if (actor == bpServiceToActive) {
>     return processCommandFromActive(cmd, actor);
>   } else {
>     return processCommandFromStandby(cmd, actor);
>   }
> } finally {
>   writeUnlock();
> } {code}
> if method processCommandFromActive costs much time, the write lock would not 
> release.
>  
> It maybe block the updateActorStatesFromHeartbeat method in 
> offerService,furthermore, it can cause the lastcontact of datanode very high, 
> even dead when lastcontact beyond 600s.
> {code:java}
> bpos.updateActorStatesFromHeartbeat(
>     this, resp.getNameNodeHaState());{code}
> here we can make write lock fine-grain in processCommandFromActor method to 
> address this problem
>  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to