[ 
https://issues.apache.org/jira/browse/HDFS-4176?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14189767#comment-14189767
 ] 

Marc Heide commented on HDFS-4176:
----------------------------------

So what became of this error?

I am pretty that we have observed exactly this problem on one of our test 
clusters using Cloudera 4.5 (hadoop 2.0.0-cdh4.5.0) release in a Quorum Based 
HA mode. For a test we intentionally destroyed one of the active namenode's 
disks using Linux dd command (yeah, its ugly but so is life). The poor thing 
got stuck in an IO operation trying to close a file. The thread which got 
blocked, held locks which blocked then a lot of other threads (e.g. threads for 
incoming RPC calls). That had a fatal impact on the whole cluster, since 
everything stopped to work at once. HBase, HDFS and all commands did not work 
and either came back with a timeout or simply hang forever. Unfortunately the 
live checks from ZKFC seemed to work just fine, so the ZKFC did not detect 
failures and hence did not trigger a failover.

So we tried to stop it manually. After doing a kill -2 and then a kill -9 on 
the NameNode process the ZKFC finally detected the error and tried to activate 
the standby NameNode on another machine. But this got stuck too. I have 
attached the pstack of this NameNode process as he tries to get active but 
never made it. As far as I can see he is not able to stop the 
EditLogTailerThread. 

The root cause is probably that the formerly active NameNode was not really 
dead. After searching around for some time we found that he had left a zombie 
(defunct process) running, which held the Port 8020 opened! You cannot kill 
such zombies in Linux without a reboot. So this is exaclty the situation 
described here. Former NN was frozen but not really dead. And the standby could 
not go active. 

Another sad story is that even the restart of this standby NameNode did not 
work. It became active, thats fine. But as long as this other zombie was 
running and kept his 8020 port open, all clients got stuck, so neither HBase 
started properly, nor could we access the HDFS with the dfs client commands. 
Just as we rebooted the former NN's machine, the cluster started up properly. 
But this is probably not part of this Jira. So working with interruptible RPC 
calls and using a timeout everywhere seems to be vital.

> EditLogTailer should call rollEdits with a timeout
> --------------------------------------------------
>
>                 Key: HDFS-4176
>                 URL: https://issues.apache.org/jira/browse/HDFS-4176
>             Project: Hadoop HDFS
>          Issue Type: Bug
>          Components: ha, namenode
>    Affects Versions: 3.0.0, 2.0.2-alpha
>            Reporter: Todd Lipcon
>
> When the EditLogTailer thread calls rollEdits() on the active NN via RPC, it 
> currently does so without a timeout. So, if the active NN has frozen (but not 
> actually crashed), this call can hang forever. This can then potentially 
> prevent the standby from becoming active.
> This may actually considered a side effect of HADOOP-6762 -- if the RPC were 
> interruptible, that would also fix the issue.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Reply via email to