Tianyin Xu created HADOOP-11328:
-----------------------------------
Summary: ZKFailoverController.java does not log Exception and
causes latent problems during failover
Key: HADOOP-11328
URL: https://issues.apache.org/jira/browse/HADOOP-11328
Project: Hadoop Common
Issue Type: Bug
Components: ha
Affects Versions: 2.5.1
Reporter: Tianyin Xu
In _ZKFailoverController.java_, the _Exception_ caught by the _run()_ method
does not have a single error log. This causes latent problems that are only
manifested during failover.
h5. The problem we encountered
An _Exception_ is thrown from the _doRun()_ method during _initHM()_ (caused by
a configuration error). If you want to repeat, you can set
"_ha.health-monitor.connect-retry-interval.ms_" to be any nonsensical value.
{code:title=ZKFailoverController.java|borderStyle=solid}
private int doRun(String[] args)
...
initRPC();
initHM();
startRPC();
....
}
{code}
The Exception is caught in the _run()_ method, as follows,
{code:title=ZKFailoverController.java|borderStyle=solid}
public int run(final String[] args) throws Exception {
...
try {
...
@Override
public Integer run() {
try {
return doRun(args);
} catch (Exception t) {
throw new RuntimeException(t);
} finally {
if (elector != null) {
elector.terminateConnection();
}
}
}
});
} catch (RuntimeException rte) {
throw (Exception)rte.getCause();
}
}
{code}
Unfortunately, the Exception (causing the shutdown of the process) is *not
logged at all*. This causes latent errors which is only manifested during
failover (because ZKFC is dead). The tricky thing here is that everything looks
perfectly fine: the _jps_ command shows a running DFSZKFailoverController
process and the two NameNode (active and standby) work fine.
h5. Patch
We strongly suggest to add a error log to notify the error caught, such as,
---
hadoop-common-project/hadoop-common/src/main/java/org/apache/hadoop/ha/ZKFailoverController.java
(revision 1641307)
+++
hadoop-common-project/hadoop-common/src/main/java/org/apache/hadoop/ha/ZKFailoverController.java
(working copy)
{code:title=@@ -178,6 +178,7 @@|borderStyle=solid}
}
});
} catch (RuntimeException rte) {
+ LOG.fatal("The failover controller encounters runtime error: " + rte);
throw (Exception)rte.getCause();
}
}
{code}
Thanks!
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)