[jira] [Commented] (HADOOP-15684) triggerActiveLogRoll stuck on dead name node, when ConnectTimeoutException happens.

Rong Tang (JIRA) Tue, 11 Sep 2018 11:58:20 -0700


    [ 
https://issues.apache.org/jira/browse/HADOOP-15684?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16611087#comment-16611087
 ]


Rong Tang commented on HADOOP-15684:
------------------------------------

[~elgoiri]   This is a more common and simpler way to build MiniDFSCluster, 
many test classes use the later way.

see (search basePort) : 
[https://github.com/apache/hadoop/blob/c5bf43a8e8aec595d1a8133cb0656778b252de89/hadoop-hdfs-project/hadoop-hdfs/src/test/java/org/apache/hadoop/hdfs/server/namenode/ha/TestFailureToReadEdits.java]

[https://github.com/apache/hadoop/blob/a55d6bba71c81c1c4e9d8cd11f55c78f10a548b0/hadoop-hdfs-project/hadoop-hdfs/src/test/java/org/apache/hadoop/hdfs/server/namenode/TestEditLogAutoroll.java]

[https://github.com/apache/hadoop/blob/c5bf43a8e8aec595d1a8133cb0656778b252de89/hadoop-hdfs-project/hadoop-hdfs/src/test/java/org/apache/hadoop/hdfs/server/namenode/metrics/TestNameNodeMetrics.java]

 

It doesn't matter to use assigned ports ore serial ports, but the code with 
latter can be simpler.

 

> triggerActiveLogRoll stuck on dead name node, when ConnectTimeoutException 
> happens. 
> ------------------------------------------------------------------------------------
>
>                 Key: HADOOP-15684
>                 URL: https://issues.apache.org/jira/browse/HADOOP-15684
>             Project: Hadoop Common
>          Issue Type: Bug
>          Components: ha
>    Affects Versions: 3.0.0-alpha1
>            Reporter: Rong Tang
>            Assignee: Rong Tang
>            Priority: Critical
>         Attachments: 
> 0001-RollEditLog-try-next-NN-when-exception-happens.patch, 
> HADOOP-15684.000.patch, HADOOP-15684.001.patch, HADOOP-15684.002.patch, 
> HADOOP-15684.003.patch, HADOOP-15684.004.patch, 
> hadoop--rollingUpgrade-SourceMachine001.log
>
>
> When name node call triggerActiveLogRoll, and the cachedActiveProxy is a dead 
> name node, it will throws a ConnectTimeoutException, expected behavior is to 
> try next NN, but current logic doesn't do so, instead, it keeps trying the 
> dead, mistakenly take it as active.
>  
> 2018-08-17 10:02:12,001 WARN [Edit log tailer] 
> org.apache.hadoop.hdfs.server.namenode.ha.EditLogTailer: Unable to trigger a 
> roll of the active NN
> org.apache.hadoop.net.ConnectTimeoutException: Call From 
> SourceMachine001/SourceIP to001 TargetMachine001.ap.gbl:8020 failed on socket 
> timeout exception: org.apache.hadoop.net.ConnectTimeoutException: 20000 
> millis timeout 
> org.apache.hadoop.hdfs.server.namenode.ha.EditLogTailer$2.doWork(EditLogTailer.java:298)
>  
> C:\Users\rotang>ping TargetMachine001
> Pinging TargetMachine001[TargetIP001] with 32 bytes of data:
>  Request timed out.
>  Request timed out.
>  Request timed out.
>  Request timed out.
>  Attachment is a log file saying how it repeatedly retries a dead name node, 
> and a fix patch.
>  I replaced the actual machine name/ip as SourceMachine001/SourceIP001 and 
> TargetMachine001/TargetIP001.
>  
> How to Repro:
> In a good running NNs, take down the active NN (don't let it come back during 
> test), and then the stand by NNs will keep trying dead (old active) NN, 
> because it is the cached one.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[jira] [Commented] (HADOOP-15684) triggerActiveLogRoll stuck on dead name node, when ConnectTimeoutException happens.

Reply via email to