Andrew Wang created HDFS-4851: --------------------------------- Summary: Deadlock in pipeline recovery Key: HDFS-4851 URL: https://issues.apache.org/jira/browse/HDFS-4851 Project: Hadoop HDFS Issue Type: Bug Components: datanode Affects Versions: 2.0.4-alpha, 3.0.0 Reporter: Andrew Wang Assignee: Andrew Wang
Here's a deadlock scenario that cropped up during pipeline recovery, debugged through jstacks. Todd tipped me off to this one. # Pipeline fails, client initiates recovery. We have the old leftover DataXceiver, and a new one doing recovery. # New DataXceiver does {{recoverRbw}}, grabbing the {{FsDatasetImpl}} lock # Old DataXceiver is in {{BlockReceiver#computePartialChunkCrc}}, calls {{FsDatasetImpl#getTmpInputStreams}} and blocks on the {{FsDatasetImpl}} lock. # New DataXceiver {{ReplicaInPipeline#stopWriter}}, interrupting the old DataXceiver and then joining on it. # Boom, deadlock. New DX holds the {{FsDatasetImpl}} lock and is joining on the old DX, which is in turn waiting on the {{FsDatasetImpl}} lock. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira