[jira] [Work logged] (HDFS-16600) Deadlock on DataNode

ASF GitHub Bot (Jira) Fri, 10 Jun 2022 18:24:26 -0700


     [ 
https://issues.apache.org/jira/browse/HDFS-16600?focusedWorklogId=780461&page=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-780461
 ]


ASF GitHub Bot logged work on HDFS-16600:
-----------------------------------------

                Author: ASF GitHub Bot
            Created on: 11/Jun/22 01:23
            Start Date: 11/Jun/22 01:23
    Worklog Time Spent: 10m 
      Work Description: slfan1989 commented on PR #4367:
URL: https://github.com/apache/hadoop/pull/4367#issuecomment-1152828770

   @Hexiaoqiao @ZanderXu @tomscut 
   
   I still have some doubts about this.
   
   1. I still hope ZanderXu Can provide deadlock exception stack error 
information, I will continue to try to reproduce this problem in this part.
   
   2. I read the code of testSynchronousEviction carefully, this code uses the 
special storage strategy LAZY_PERSIST, This strategy will asynchronously flush 
memory blocks to disk. LazyWriter takes care of this work.
   Part of the code is as follows
   ```
   private boolean saveNextReplica() {
         RamDiskReplica block = null;
         FsVolumeReference targetReference;
         FsVolumeImpl targetVolume;
         ReplicaInfo replicaInfo;
         boolean succeeded = false;
   
         try {
           block = ramDiskReplicaTracker.dequeueNextReplicaToPersist();
           if (block != null) {
             try (AutoCloseableLock lock = 
lockManager.writeLock(LockLevel.BLOCK_POOl,
                 block.getBlockPoolId())) {
               replicaInfo = volumeMap.get(block.getBlockPoolId(), 
block.getBlockId());
     .....
   ```
   If ZanderXu's judgment is correct, will this code also deadlock?
   
   3.I always have a question, why we first add blockpool readlock, and then 
add volume write lock, how is the order of this lock derived?
   
   4.I checked lockManager.writeLock(LockLevel.BLOCK_POOl, 
block.getBlockPoolId()), and I found that when adding volume, the writeLock of 
BLOCK_POOl is also used, so will it also deadlock?
   
   > in conclusion
   
   I don't think this is a deadlock. Is it because createRow got the read lock, 
which caused evictBlocks to get the write lock for a long time, and then 
exceeded the waiting time of the junit test, which eventually led to an error.
   
   I think to solve this problem completely, we also need to look at the 
processing logic of LazyWriter. It should not be enough to just modify 
evictBlocks.
   




Issue Time Tracking
-------------------

    Worklog Id:     (was: 780461)
    Time Spent: 3h 40m  (was: 3.5h)

> Deadlock on DataNode
> --------------------
>
>                 Key: HDFS-16600
>                 URL: https://issues.apache.org/jira/browse/HDFS-16600
>             Project: Hadoop HDFS
>          Issue Type: Bug
>            Reporter: ZanderXu
>            Assignee: ZanderXu
>            Priority: Major
>              Labels: pull-request-available
>          Time Spent: 3h 40m
>  Remaining Estimate: 0h
>
> The UT 
> org.apache.hadoop.hdfs.server.datanode.fsdataset.impl.testSynchronousEviction 
> failed, because happened deadlock, which  is introduced by 
> [HDFS-16534|https://issues.apache.org/jira/browse/HDFS-16534]. 
> DeadLock:
> {code:java}
> // org.apache.hadoop.hdfs.server.datanode.fsdataset.impl.createRbw line 1588 
> need a read lock
> try (AutoCloseableLock lock = lockManager.readLock(LockLevel.BLOCK_POOl,
>         b.getBlockPoolId()))
> // org.apache.hadoop.hdfs.server.datanode.fsdataset.impl.evictBlocks line 
> 3526 need a write lock
> try (AutoCloseableLock lock = lockManager.writeLock(LockLevel.BLOCK_POOl, 
> bpid))
> {code}



--
This message was sent by Atlassian Jira
(v8.20.7#820007)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[jira] [Work logged] (HDFS-16600) Deadlock on DataNode

Reply via email to