[ 
https://issues.apache.org/jira/browse/HDFS-4851?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13696044#comment-13696044
 ] 

Uma Maheswara Rao G commented on HDFS-4851:
-------------------------------------------

Thanks for working on this Andrew.
This can happen wherever we do stopWriter under fsdataset lock and if old 
writer needs to get this lock at that moment right.
As this only on recovery calls, your patch looks to be simple to fail the 
current recover if older DX not able to proceed due the current thread held the 
lock already on fsdataset. Other option may be, how about moving this stop 
writer call to other method and where we just get rbw in lock and then we just 
interrupt without lock. After this step only we call recoverRBW. (now 
recoverRBW need not stop old writer  as we moved that logic to separate call)
                
> Deadlock in pipeline recovery
> -----------------------------
>
>                 Key: HDFS-4851
>                 URL: https://issues.apache.org/jira/browse/HDFS-4851
>             Project: Hadoop HDFS
>          Issue Type: Bug
>          Components: datanode
>    Affects Versions: 3.0.0, 2.0.4-alpha
>            Reporter: Andrew Wang
>            Assignee: Andrew Wang
>         Attachments: hdfs-4851-1.patch
>
>
> Here's a deadlock scenario that cropped up during pipeline recovery, debugged 
> through jstacks. Todd tipped me off to this one.
> # Pipeline fails, client initiates recovery. We have the old leftover 
> DataXceiver, and a new one doing recovery.
> # New DataXceiver does {{recoverRbw}}, grabbing the {{FsDatasetImpl}} lock
> # Old DataXceiver is in {{BlockReceiver#computePartialChunkCrc}}, calls 
> {{FsDatasetImpl#getTmpInputStreams}} and blocks on the {{FsDatasetImpl}} lock.
> # New DataXceiver {{ReplicaInPipeline#stopWriter}}, interrupting the old 
> DataXceiver and then joining on it.
> # Boom, deadlock. New DX holds the {{FsDatasetImpl}} lock and is joining on 
> the old DX, which is in turn waiting on the {{FsDatasetImpl}} lock.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

Reply via email to