[
https://issues.apache.org/jira/browse/HDFS-12639?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17681706#comment-17681706
]
姚凡 edited comment on HDFS-12639 at 1/29/23 9:39 AM:
----------------------------------------------------
I had the same problem,processCommandFromActor method will hold write lock.
* writeLock();
try {
if (actor == bpServiceToActive) {
return processCommandFromActive(cmd, actor);
} else {
return processCommandFromStandby(cmd, actor);
}
} finally {
writeUnlock();
}
When a large number of files are deleted, the deletion of a batch of blocks is
slow due to OS or other factors. and will block
updateActorStatesFromHeartbeat and IBR.
* 14:11:59,213 | INFO | Command processor | Took 545492 ms to process 2
commands from NN | BPServiceActor. java: 1409
14:22:57,558 | INFO Command processor | Took 658345 ms to process 2 commands
from NN | BPServiceActor. java: 1409
14:31:15,231 | INFO | Command processor | Took 497659 ms to process 1 commands
from NN | BPServiceActor.java:1409
14:40:38,864 | INFO Command processor | look 563633 ms to process 2 commands
from NN | BPServiceActor.java:1409
14:46:05,669 | INFO Command processor | Took 325944 ms to process 2 commands
from NN | BPServiceActor.java:1409
* "BP-571631763-172.16.5.5-1479839737541 heartbeating to xxxxxxxx" #82 daemon
prio=5 os_prio=98 tid=0x00007 fdb3459c800 nid=0x28143f waiting on
condition [0x60007 fdb389ed000 }
java. lang.Thread.State: WAITING (parking)
at sun.misc.Unsafe.park(Native Method)
parking to wait for _<9x9999699641647fad6> (a java.utiL.concurrent.Locks
.ReehtrantReadwriteLoc k$NonfairSync)
at java.util.concurrent.Locks.LockSupport.park(LockSupport.Java:175)
at
java.util.concurrent.locks.AbstractQueuedSynchronizer.parkAndCheckInterrupt(AbstractQueuedSynchronizer.java:838)
at
java.util.concurrent.locks.AbstractQueuedSynchronizer.acquireQueued(AbstractQueuedSynchronizer.Java:B72)
at
java.util.concurrent.locks.AbstractQueuedSynchronizer.acquire(AbstractQueuedSynchronizer.java:1201)
at java.util.concurrent.locks. ReentrantReadwr iteLock$wr iteLock. loc k(
ReentrantReadwr iteLock -java:943)
at org.apache -hadoop .hdfs.server.datanode.BPOfferService.writeLock(
BPOfferService. java:120)
at
org.apache.hadoop.hdfs.server.datanode.BPOfferService.updateActorStatesFromHeartbeat(BPOfferService.java:580)
at org.apache : hadoop. hdfs.server.datanode.BPServiceActor
-offerService(BPServ iceActor.java:676)
at org.apache.hadoop.hdfs.server.datanode.BPServiceActor.run(BPServiceActor.
java:862)
at java.lang.Thread.run(Thread.java:748)
processCommandFromActor hold lock Is necessary ?
was (Author: yaofan):
I had the same problem,processCommandFromActor method will hold write lock.
!image-2023-01-29-17-18-21-014.png|width=532,height=256!
When a large number of files are deleted, the deletion of a batch of blocks is
slow due to OS or other factors. and will block
updateActorStatesFromHeartbeat and IBR.
!image-2023-01-29-17-27-15-076.png|width=790,height=72!
!image-2023-01-29-17-23-42-979.png|width=786,height=181!
!image-2023-01-29-17-20-59-350.png|width=492,height=308!
processCommandFromActor hold lock Is necessary ?
> BPOfferService lock may stall all service actors
> ------------------------------------------------
>
> Key: HDFS-12639
> URL: https://issues.apache.org/jira/browse/HDFS-12639
> Project: Hadoop HDFS
> Issue Type: Bug
> Components: datanode
> Affects Versions: 2.8.0
> Reporter: Daryn Sharp
> Assignee: Hanisha Koneru
> Priority: Major
>
> {{BPOfferService}} manages {{BPServiceActor}} instances for the active and
> standby. It uses a RW lock to primarily protect registration information
> while determining the active/standby from heartbeats.
> Unfortunately the write lock is held during command processing. If an actor
> is experiencing high latency processing commands, the other actor will
> neither be able to register (blocked in createRegistration, setNamespaceInfo,
> verifyAndSetNamespaceInfo) nor process heartbeats (blocked in
> updateActorStatesFromHeartbeat).
> The worst case scenario for processing commands while holding the lock is
> re-registration. The actor will loop, catching and logging exceptions,
> leaving the other actor blocked for an non-deterministic (possibly infinite)
> amount of time.
> The lock must not be held during command processing.
--
This message was sent by Atlassian Jira
(v8.20.10#820010)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]