[
https://issues.apache.org/jira/browse/HDFS-4176?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14189767#comment-14189767
]
Marc Heide commented on HDFS-4176:
----------------------------------
So what became of this error?
I am pretty that we have observed exactly this problem on one of our test
clusters using Cloudera 4.5 (hadoop 2.0.0-cdh4.5.0) release in a Quorum Based
HA mode. For a test we intentionally destroyed one of the active namenode's
disks using Linux dd command (yeah, its ugly but so is life). The poor thing
got stuck in an IO operation trying to close a file. The thread which got
blocked, held locks which blocked then a lot of other threads (e.g. threads for
incoming RPC calls). That had a fatal impact on the whole cluster, since
everything stopped to work at once. HBase, HDFS and all commands did not work
and either came back with a timeout or simply hang forever. Unfortunately the
live checks from ZKFC seemed to work just fine, so the ZKFC did not detect
failures and hence did not trigger a failover.
So we tried to stop it manually. After doing a kill -2 and then a kill -9 on
the NameNode process the ZKFC finally detected the error and tried to activate
the standby NameNode on another machine. But this got stuck too. I have
attached the pstack of this NameNode process as he tries to get active but
never made it. As far as I can see he is not able to stop the
EditLogTailerThread.
The root cause is probably that the formerly active NameNode was not really
dead. After searching around for some time we found that he had left a zombie
(defunct process) running, which held the Port 8020 opened! You cannot kill
such zombies in Linux without a reboot. So this is exaclty the situation
described here. Former NN was frozen but not really dead. And the standby could
not go active.
Another sad story is that even the restart of this standby NameNode did not
work. It became active, thats fine. But as long as this other zombie was
running and kept his 8020 port open, all clients got stuck, so neither HBase
started properly, nor could we access the HDFS with the dfs client commands.
Just as we rebooted the former NN's machine, the cluster started up properly.
But this is probably not part of this Jira. So working with interruptible RPC
calls and using a timeout everywhere seems to be vital.
> EditLogTailer should call rollEdits with a timeout
> --------------------------------------------------
>
> Key: HDFS-4176
> URL: https://issues.apache.org/jira/browse/HDFS-4176
> Project: Hadoop HDFS
> Issue Type: Bug
> Components: ha, namenode
> Affects Versions: 3.0.0, 2.0.2-alpha
> Reporter: Todd Lipcon
>
> When the EditLogTailer thread calls rollEdits() on the active NN via RPC, it
> currently does so without a timeout. So, if the active NN has frozen (but not
> actually crashed), this call can hang forever. This can then potentially
> prevent the standby from becoming active.
> This may actually considered a side effect of HADOOP-6762 -- if the RPC were
> interruptible, that would also fix the issue.
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)