[jira] [Updated] (HDFS-6617) Flake TestDFSZKFailoverController.testManualFailoverWithDFSHAAdmin due to a long edit log sync op

Liang Xie (JIRA) Tue, 01 Jul 2014 03:50:24 -0700

     [ 
https://issues.apache.org/jira/browse/HDFS-6617?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]


Liang Xie updated HDFS-6617:
----------------------------

    Description: 
Just Hit a false alarm testing while working at  HDFS-6614, see 
https://builds.apache.org/job/PreCommit-HDFS-Build/7259//testReport/org.apache.hadoop.hdfs.server.namenode.ha/TestDFSZKFailoverController/testManualFailoverWithDFSHAAdmin/

After a looking at the log, shows the failure came from a timeout at 
ZKFailoverController.doCedeActive():
localTarget.getProxy(conf, timeout).transitionToStandby(createReqInfo());

While stopping active service, see FSNamesystem.stopActiveServices():
  void stopActiveServices() {
    LOG.info("Stopping services started for active state");
    ....
this corelates with the log:
"2014-07-01 08:12:50,615 INFO  namenode.FSNamesystem 
(FSNamesystem.java:stopActiveServices(1167)) - Stopping services started for 
active state"

then stopActiveServices will call editLog.close(), which goes to 
endCurrentLogSegment(), see log:
2014-07-01 08:12:50,616 INFO  namenode.FSEditLog 
(FSEditLog.java:endCurrentLogSegment(1216)) - Ending log segment 1

but this operation did not finish in 5 seconds, then triggered the timeout:

2014-07-01 08:12:55,624 WARN  ha.ZKFailoverController 
(ZKFailoverController.java:doCedeActive(577)) - Unable to transition local node 
to standby: Call From asf001.sp2.ygridcore.net/67.195.138.31 to localhost:10021 
failed on socket timeout exception: java.net.SocketTimeoutException: 5000 
millis timeout while waiting for channel to be ready for read. ch : 
java.nio.channels.SocketChannel[connected local=/127.0.0.1:53965 
remote=localhost/127.0.0.1:10021]; For more details see:  
http://wiki.apache.org/hadoop/SocketTimeout

the logEdit/logSync finally done followed with printStatistics(true):
2014-07-01 08:13:05,243 INFO  namenode.FSEditLog 
(FSEditLog.java:printStatistics(675)) - Number of transactions: 2 Total time 
for transactions(ms): 0 Number of transactions batched in Syncs: 0 Number of 
syncs: 3 SyncTimes(ms): 14667 74 105 

so obviously, this long sync contributed the timeout,  maybe the QA box is very 
slow at that moment, so one possible fix here is setting the default fence 
timeout to a bigger one.

  was:
Hit a false alarm testing while working at  HDFS-6614, see 
https://builds.apache.org/job/PreCommit-HDFS-Build/7259//testReport/org.apache.hadoop.hdfs.server.namenode.ha/TestDFSZKFailoverController/testManualFailoverWithDFSHAAdmin/

After a deep looking at the logging, shows failure came from a timeout at 
ZKFailoverController.doCedeActive():
localTarget.getProxy(conf, timeout).transitionToStandby(createReqInfo());

While stopping active service, see FSNamesystem.stopActiveServices():
  void stopActiveServices() {
    LOG.info("Stopping services started for active state");
    ....
this is corelated with the log:
"2014-07-01 08:12:50,615 INFO  namenode.FSNamesystem 
(FSNamesystem.java:stopActiveServices(1167)) - Stopping services started for 
active state"

then stopActiveServices will call editLog.close(), which goes to 
endCurrentLogSegment(), see log:
2014-07-01 08:12:50,616 INFO  namenode.FSEditLog 
(FSEditLog.java:endCurrentLogSegment(1216)) - Ending log segment 1

but this operation did not done in 5 seconds, then triggered timeout:

2014-07-01 08:12:55,624 WARN  ha.ZKFailoverController 
(ZKFailoverController.java:doCedeActive(577)) - Unable to transition local node 
to standby: Call From asf001.sp2.ygridcore.net/67.195.138.31 to localhost:10021 
failed on socket timeout exception: java.net.SocketTimeoutException: 5000 
millis timeout while waiting for channel to be ready for read. ch : 
java.nio.channels.SocketChannel[connected local=/127.0.0.1:53965 
remote=localhost/127.0.0.1:10021]; For more details see:  
http://wiki.apache.org/hadoop/SocketTimeout

the logEdit/logSync finally done with printStatistics(true):
2014-07-01 08:13:05,243 INFO  namenode.FSEditLog 
(FSEditLog.java:printStatistics(675)) - Number of transactions: 2 Total time 
for transactions(ms): 0 Number of transactions batched in Syncs: 0 Number of 
syncs: 3 SyncTimes(ms): 14667 74 105 

so obviously, this long sync contributed the timeout,  maybe the QA box is slow 
at that moment, so one possible fix here is set the default fence timeout to a 
bigger value.


> Flake TestDFSZKFailoverController.testManualFailoverWithDFSHAAdmin due to a 
> long edit log sync op
> -------------------------------------------------------------------------------------------------
>
>                 Key: HDFS-6617
>                 URL: https://issues.apache.org/jira/browse/HDFS-6617
>             Project: Hadoop HDFS
>          Issue Type: Test
>          Components: auto-failover, test
>    Affects Versions: 3.0.0
>            Reporter: Liang Xie
>            Assignee: Liang Xie
>            Priority: Minor
>         Attachments: HDFS-6617.txt
>
>
> Just Hit a false alarm testing while working at  HDFS-6614, see 
> https://builds.apache.org/job/PreCommit-HDFS-Build/7259//testReport/org.apache.hadoop.hdfs.server.namenode.ha/TestDFSZKFailoverController/testManualFailoverWithDFSHAAdmin/
> After a looking at the log, shows the failure came from a timeout at 
> ZKFailoverController.doCedeActive():
> localTarget.getProxy(conf, timeout).transitionToStandby(createReqInfo());
> While stopping active service, see FSNamesystem.stopActiveServices():
>   void stopActiveServices() {
>     LOG.info("Stopping services started for active state");
>     ....
> this corelates with the log:
> "2014-07-01 08:12:50,615 INFO  namenode.FSNamesystem 
> (FSNamesystem.java:stopActiveServices(1167)) - Stopping services started for 
> active state"
> then stopActiveServices will call editLog.close(), which goes to 
> endCurrentLogSegment(), see log:
> 2014-07-01 08:12:50,616 INFO  namenode.FSEditLog 
> (FSEditLog.java:endCurrentLogSegment(1216)) - Ending log segment 1
> but this operation did not finish in 5 seconds, then triggered the timeout:
> 2014-07-01 08:12:55,624 WARN  ha.ZKFailoverController 
> (ZKFailoverController.java:doCedeActive(577)) - Unable to transition local 
> node to standby: Call From asf001.sp2.ygridcore.net/67.195.138.31 to 
> localhost:10021 failed on socket timeout exception: 
> java.net.SocketTimeoutException: 5000 millis timeout while waiting for 
> channel to be ready for read. ch : java.nio.channels.SocketChannel[connected 
> local=/127.0.0.1:53965 remote=localhost/127.0.0.1:10021]; For more details 
> see:  http://wiki.apache.org/hadoop/SocketTimeout
> the logEdit/logSync finally done followed with printStatistics(true):
> 2014-07-01 08:13:05,243 INFO  namenode.FSEditLog 
> (FSEditLog.java:printStatistics(675)) - Number of transactions: 2 Total time 
> for transactions(ms): 0 Number of transactions batched in Syncs: 0 Number of 
> syncs: 3 SyncTimes(ms): 14667 74 105 
> so obviously, this long sync contributed the timeout,  maybe the QA box is 
> very slow at that moment, so one possible fix here is setting the default 
> fence timeout to a bigger one.



--
This message was sent by Atlassian JIRA
(v6.2#6252)

[jira] [Updated] (HDFS-6617) Flake TestDFSZKFailoverController.testManualFailoverWithDFSHAAdmin due to a long edit log sync op

Reply via email to