[
https://issues.apache.org/jira/browse/HDFS-5016?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13714831#comment-13714831
]
Suresh Srinivas commented on HDFS-5016:
---------------------------------------
Based on the thread dump, the following code path causes the issue (this code
corresponds to current branch-2.1.0-beta):
# Block is being recovered, which interrupts the current writer thread
(receiving the block) at FsDatasetImpl.recoverRbw(FsDatasetImpl.java:738).
* This hold FSDatasetImpl lock and calls for writer.join() at
ReplicaInPipeline.stopWriter(ReplicaInPipeline.java:157)
# Writer thread is interrupted. It in turn interrupts the responder thread and
calls join on the responder at
BlockReceiver.receiveBlock(BlockReceiver.java:709)
# Responder thread is stuck doing flush on the socket to write response to the
node that has been firewalled.
#* Flush cannot be interrupted.
#* We cannot enable socket write timeouts (in java only socket read timeouts
can be set)
To summarize, responder thread is stuck in flush call, writer thread is stuck
on calling join() on the responder thread, FSDataset recoverRbw is holding the
FSDataset lock and is stuck waiting on join() for the responder thread. Since
the FSDataset lock is held, which is crucial for the datanode, the heart beat
thread, data transceiver threads are blocked waiting on FSDataset lock.
Here is a simple patch that adds timeouts to the join call. Devaraj, can you
see if this fixes the issue you are seeing?
> Heartbeating thread blocks under some failure conditions leading to loss of
> datanodes
> -------------------------------------------------------------------------------------
>
> Key: HDFS-5016
> URL: https://issues.apache.org/jira/browse/HDFS-5016
> Project: Hadoop HDFS
> Issue Type: Bug
> Reporter: Devaraj Das
> Assignee: Suresh Srinivas
> Priority: Blocker
> Fix For: 2.1.0-beta
>
> Attachments: HDFS-5016.patch, jstack1.txt
>
>
> In the testing of some failure scenarios for HBase MTTR, we have been
> simulating node failures via firewalling of nodes (where all communication
> ports would be firewalled except ssh's port). We have noticed that when a
> (data)node is firewalled, we lose certain other datanodes - those that were
> involved in some communication with the firewalled node before the latter was
> firewalled. Will attach jstack output from one of the lost datanodes. The
> heartbeating thread seems to be locked up.
> This jira is to track a fix for the problem.
--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira