[
https://issues.apache.org/jira/browse/HDFS-4851?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Andrew Wang updated HDFS-4851:
------------------------------
Attachment: hdfs-4851-1.patch
I realized that HDFS-3655 is actually addressing the same issue and tried to
revive that approach, but it ended up being super complicated to check and
recheck the preconditions on lock acquisition.
Attached here instead is a simpler strategy: abort recovery if we end up
waiting too long on this lock. While not optimal, it should be safe since the
client can retry recovery again.
Since this is very hard to unit test, I tested by adding a loop that grabs the
lock repeatedly in {{receivePacket}}, verified that this caused the deadlock,
and then applied the patch and verified that the error message printed.
HDFS-3655 can be where we properly fix this issue, or more broadly re-examine
finer grained locking during recovery.
> Deadlock in pipeline recovery
> -----------------------------
>
> Key: HDFS-4851
> URL: https://issues.apache.org/jira/browse/HDFS-4851
> Project: Hadoop HDFS
> Issue Type: Bug
> Components: datanode
> Affects Versions: 3.0.0, 2.0.4-alpha
> Reporter: Andrew Wang
> Assignee: Andrew Wang
> Attachments: hdfs-4851-1.patch
>
>
> Here's a deadlock scenario that cropped up during pipeline recovery, debugged
> through jstacks. Todd tipped me off to this one.
> # Pipeline fails, client initiates recovery. We have the old leftover
> DataXceiver, and a new one doing recovery.
> # New DataXceiver does {{recoverRbw}}, grabbing the {{FsDatasetImpl}} lock
> # Old DataXceiver is in {{BlockReceiver#computePartialChunkCrc}}, calls
> {{FsDatasetImpl#getTmpInputStreams}} and blocks on the {{FsDatasetImpl}} lock.
> # New DataXceiver {{ReplicaInPipeline#stopWriter}}, interrupting the old
> DataXceiver and then joining on it.
> # Boom, deadlock. New DX holds the {{FsDatasetImpl}} lock and is joining on
> the old DX, which is in turn waiting on the {{FsDatasetImpl}} lock.
--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira