[
https://issues.apache.org/jira/browse/HDFS-5124?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13747738#comment-13747738
]
Daryn Sharp commented on HDFS-5124:
-----------------------------------
I think you may have more problems blocking the clients on the FSN lock during
standby transition instead of throwing an exception immediately.
The RPC server only has a limited number of sockets readers (default I think is
5). Once those 5 jam up waiting on the FSN lock, the listener thread is going
to keep accepting sockets as fast as he can - at least until an OOM when it
tries to forcibly close sockets and then sleeps for 1 minute. Hopefully the
OOM didn't cripple some other subsystem...
Meanwhile clients will have their connection accepted, but get no response for
the server. One saving grace is the clients don't create their ping stream
until after SASL auth has completed, so they are going abort the connection
after a read timeout while waiting for a response to the connection header.
The ping stream would normally keep the connection alive.
I've tested this patch and I no longer have deadlocks.
> Namenode in secure cluster deadlocks
> ------------------------------------
>
> Key: HDFS-5124
> URL: https://issues.apache.org/jira/browse/HDFS-5124
> Project: Hadoop HDFS
> Issue Type: Bug
> Components: namenode
> Affects Versions: 2.1.1-beta
> Environment: Secure Hadoop 2 cluster
> Reporter: Deepesh Khandelwal
> Assignee: Jing Zhao
> Priority: Blocker
> Attachments: HADOOP-5124.patch, HDFS-5124.001.patch,
> HDFS-5124.002.patch, nn_jstack.out
>
>
> Namenode deadlocks after a while in use.
--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira