Liang Xie created HDFS-6617:
-------------------------------
Summary: Flake
TestDFSZKFailoverController.testManualFailoverWithDFSHAAdmin due to a long edit
log sync op
Key: HDFS-6617
URL: https://issues.apache.org/jira/browse/HDFS-6617
Project: Hadoop HDFS
Issue Type: Test
Components: auto-failover, test
Affects Versions: 3.0.0
Reporter: Liang Xie
Assignee: Liang Xie
Priority: Minor
Hit a false alarm testing while working at HDFS-6614, see
https://builds.apache.org/job/PreCommit-HDFS-Build/7259//testReport/org.apache.hadoop.hdfs.server.namenode.ha/TestDFSZKFailoverController/testManualFailoverWithDFSHAAdmin/
After a deep looking at the logging, shows failure came from a timeout at
ZKFailoverController.doCedeActive():
localTarget.getProxy(conf, timeout).transitionToStandby(createReqInfo());
While stopping active service, see FSNamesystem.stopActiveServices():
void stopActiveServices() {
LOG.info("Stopping services started for active state");
....
this is corelated with the log:
"2014-07-01 08:12:50,615 INFO namenode.FSNamesystem
(FSNamesystem.java:stopActiveServices(1167)) - Stopping services started for
active state"
then stopActiveServices will call editLog.close(), which goes to
endCurrentLogSegment(), see log:
2014-07-01 08:12:50,616 INFO namenode.FSEditLog
(FSEditLog.java:endCurrentLogSegment(1216)) - Ending log segment 1
but this operation did not done in 5 seconds, then triggered timeout:
2014-07-01 08:12:55,624 WARN ha.ZKFailoverController
(ZKFailoverController.java:doCedeActive(577)) - Unable to transition local node
to standby: Call From asf001.sp2.ygridcore.net/67.195.138.31 to localhost:10021
failed on socket timeout exception: java.net.SocketTimeoutException: 5000
millis timeout while waiting for channel to be ready for read. ch :
java.nio.channels.SocketChannel[connected local=/127.0.0.1:53965
remote=localhost/127.0.0.1:10021]; For more details see:
http://wiki.apache.org/hadoop/SocketTimeout
the logEdit/logSync finally done with printStatistics(true):
2014-07-01 08:13:05,243 INFO namenode.FSEditLog
(FSEditLog.java:printStatistics(675)) - Number of transactions: 2 Total time
for transactions(ms): 0 Number of transactions batched in Syncs: 0 Number of
syncs: 3 SyncTimes(ms): 14667 74 105
so obviously, this long sync contributed the timeout, maybe the QA box is slow
at that moment, so one possible fix here is set the default fence timeout to a
bigger value.
--
This message was sent by Atlassian JIRA
(v6.2#6252)