[jira] [Comment Edited] (HDFS-12639) BPOfferService lock may stall all service actors

Jira Sun, 29 Jan 2023 01:40:05 -0800


    [ 
https://issues.apache.org/jira/browse/HDFS-12639?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17681706#comment-17681706
 ]


姚凡 edited comment on HDFS-12639 at 1/29/23 9:39 AM:
----------------------------------------------------

I had the same problem，processCommandFromActor method will hold  write lock.
 * writeLock();
try {
if (actor == bpServiceToActive) {
return processCommandFromActive(cmd, actor);
} else {
return processCommandFromStandby(cmd, actor);
}
} finally {
writeUnlock();
}

When a large number of files are deleted, the deletion of a batch of blocks is 
slow due to OS or other factors.  and will  block  
updateActorStatesFromHeartbeat  and IBR.
 *  14:11:59,213 | INFO | Command processor | Took 545492 ms to process 2 
commands from NN | BPServiceActor. java: 1409
 14:22:57,558 | INFO  Command processor | Took 658345 ms to process 2 commands 
from NN | BPServiceActor. java: 1409
 14:31:15,231 | INFO | Command processor | Took 497659 ms to process 1 commands 
from NN | BPServiceActor.java:1409
 14:40:38,864 | INFO  Command processor | look 563633 ms to process 2 commands 
from NN | BPServiceActor.java:1409
 14:46:05,669 | INFO  Command processor | Took 325944 ms to process 2 commands 
from NN | BPServiceActor.java:1409
 * "BP-571631763-172.16.5.5-1479839737541 heartbeating to xxxxxxxx" #82 daemon 
prio=5 os_prio=98 tid=0x00007 fdb3459c800 nid=0x28143f waiting on
condition [0x60007 fdb389ed000 }
java. lang.Thread.State: WAITING (parking)
at sun.misc.Unsafe.park(Native Method)
parking to wait for _<9x9999699641647fad6>   (a java.utiL.concurrent.Locks 
.ReehtrantReadwriteLoc k$NonfairSync)
at java.util.concurrent.Locks.LockSupport.park(LockSupport.Java:175)
at 
java.util.concurrent.locks.AbstractQueuedSynchronizer.parkAndCheckInterrupt(AbstractQueuedSynchronizer.java:838)
at 
java.util.concurrent.locks.AbstractQueuedSynchronizer.acquireQueued(AbstractQueuedSynchronizer.Java:B72)
at 
java.util.concurrent.locks.AbstractQueuedSynchronizer.acquire(AbstractQueuedSynchronizer.java:1201)
at java.util.concurrent.locks. ReentrantReadwr iteLock$wr iteLock. loc k( 
ReentrantReadwr iteLock -java:943)
at org.apache -hadoop .hdfs.server.datanode.BPOfferService.writeLock( 
BPOfferService. java:120)
at 
org.apache.hadoop.hdfs.server.datanode.BPOfferService.updateActorStatesFromHeartbeat(BPOfferService.java:580)
at org.apache : hadoop. hdfs.server.datanode.BPServiceActor 
-offerService(BPServ iceActor.java:676)
at org.apache.hadoop.hdfs.server.datanode.BPServiceActor.run(BPServiceActor. 
java:862)
at java.lang.Thread.run(Thread.java:748)

 

processCommandFromActor  hold lock Is  necessary ?      

 

 


was (Author: yaofan):
I had the same problem，processCommandFromActor method will hold  write lock.

!image-2023-01-29-17-18-21-014.png|width=532,height=256!

When a large number of files are deleted, the deletion of a batch of blocks is 
slow due to OS or other factors.  and will  block  

updateActorStatesFromHeartbeat  and IBR.

!image-2023-01-29-17-27-15-076.png|width=790,height=72!

!image-2023-01-29-17-23-42-979.png|width=786,height=181!

!image-2023-01-29-17-20-59-350.png|width=492,height=308!

 

processCommandFromActor  hold lock Is  necessary ?      

 

 

> BPOfferService lock may stall all service actors
> ------------------------------------------------
>
>                 Key: HDFS-12639
>                 URL: https://issues.apache.org/jira/browse/HDFS-12639
>             Project: Hadoop HDFS
>          Issue Type: Bug
>          Components: datanode
>    Affects Versions: 2.8.0
>            Reporter: Daryn Sharp
>            Assignee: Hanisha Koneru
>            Priority: Major
>
> {{BPOfferService}} manages {{BPServiceActor}} instances for the active and 
> standby.  It uses a RW lock to primarily protect registration information 
> while determining the active/standby from heartbeats.
> Unfortunately the write lock is held during command processing.  If an actor 
> is experiencing high latency processing commands, the other actor will 
> neither be able to register (blocked in createRegistration, setNamespaceInfo, 
> verifyAndSetNamespaceInfo) nor process heartbeats (blocked in 
> updateActorStatesFromHeartbeat).
> The worst case scenario for processing commands while holding the lock is 
> re-registration.  The actor will loop, catching and logging exceptions, 
> leaving the other actor blocked for an non-deterministic (possibly infinite) 
> amount of time.
> The lock must not be held during command processing.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[jira] [Comment Edited] (HDFS-12639) BPOfferService lock may stall all service actors

Reply via email to