[
https://issues.apache.org/jira/browse/HDFS-16898?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17683637#comment-17683637
]
ASF GitHub Bot commented on HDFS-16898:
---------------------------------------
hfutatzhanghb commented on PR #5330:
URL: https://github.com/apache/hadoop/pull/5330#issuecomment-1414734286
> Thanks for the PR @hfutatzhanghb. Curious if you have any thread dumps or
logs collected (before coming to this conclusion) and would like to share reg
the issue.
hi, @virajjasani . Thanks for your replying. Some logs are like below:
First, we add some logs in
`BPServiceActor.CommandProcessingThread#processCommand`:

and we grep some logs as below:

we can draw a conclusion that the execution time of processCommandFromActor
method is very high, even more than 119 seconds. And in
processCommandFromActor method, it uses the write lock which is the same one as
updateActorStatesFromHeartbeat method used. The updateActorStatesFromHeartbeat
method is in offerService method, so this could hang the hearbeat thread.

In our production cluster, we have use this feature, it works well.
> Make write lock fine-grain in processCommandFromActor method
> ------------------------------------------------------------
>
> Key: HDFS-16898
> URL: https://issues.apache.org/jira/browse/HDFS-16898
> Project: Hadoop HDFS
> Issue Type: Improvement
> Affects Versions: 3.3.4
> Reporter: ZhangHB
> Priority: Major
> Labels: pull-request-available
>
> Now in method processCommandFromActor, we have code like below:
>
> {code:java}
> writeLock();
> try {
> if (actor == bpServiceToActive) {
> return processCommandFromActive(cmd, actor);
> } else {
> return processCommandFromStandby(cmd, actor);
> }
> } finally {
> writeUnlock();
> } {code}
> if method processCommandFromActive costs much time, the write lock would not
> release.
>
> It maybe block the updateActorStatesFromHeartbeat method in
> offerService,furthermore, it can cause the lastcontact of datanode very high,
> even dead when lastcontact beyond 600s.
> {code:java}
> bpos.updateActorStatesFromHeartbeat(
> this, resp.getNameNodeHaState());{code}
> here we can make write lock fine-grain in processCommandFromActor method to
> address this problem
>
--
This message was sent by Atlassian Jira
(v8.20.10#820010)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]