[ 
https://issues.apache.org/jira/browse/HADOOP-15684?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Rong Tang updated HADOOP-15684:
-------------------------------
    Description: 
When name node call triggerActiveLogRoll, and the cachedActiveProxy is a dead 
name node, it will throws a ConnectTimeoutException, expected behavior is to 
try next NN, but current logic doesn't do so, instead, it keeps trying the 
dead, mistakenly take it as active.

 

2018-08-17 10:02:12,001 WARN [Edit log tailer] 
org.apache.hadoop.hdfs.server.namenode.ha.EditLogTailer: Unable to trigger a 
roll of the active NN

org.apache.hadoop.net.ConnectTimeoutException: Call From 
SourceMachine001/SourceIP to001 TargetMachine001.ap.gbl:8020 failed on socket 
timeout exception: org.apache.hadoop.net.ConnectTimeoutException: 20000 millis 
timeout 
org.apache.hadoop.hdfs.server.namenode.ha.EditLogTailer$2.doWork(EditLogTailer.java:298)

 

C:\Users\rotang>ping TargetMachine001

Pinging TargetMachine001[TargetIP001] with 32 bytes of data:
 Request timed out.
 Request timed out.
 Request timed out.
 Request timed out.

 Attachment is a log file saying how it repeatedly retries a dead name node, 
and a fix patch.

 I replaced the actual machine name/ip as SourceMachine001/SourceIP001 and 
TargetMachine001/TargetIP001.

 

How to Repro:

In a good running NNs, take down the active NN (don't let it come back during 
test), and then the stand by NNs will keep trying dead (old active) NN, because 
it is the cached one.

  was:
When name node call triggerActiveLogRoll, and the cachedActiveProxy is a dead 
name node, it will throws a ConnectTimeoutException, expected behavior is to 
try next NN, but current logic doesn't do so, instead, it keeps trying the 
dead, mistakenly take it as active.

 

2018-08-17 10:02:12,001 WARN [Edit log tailer] 
org.apache.hadoop.hdfs.server.namenode.ha.EditLogTailer: Unable to trigger a 
roll of the active NN

org.apache.hadoop.net.ConnectTimeoutException: Call From 
SourceMachine001/SourceIP to001 TargetMachine001.ap.gbl:8020 failed on socket 
timeout exception: org.apache.hadoop.net.ConnectTimeoutException: 20000 millis 
timeout 
org.apache.hadoop.hdfs.server.namenode.ha.EditLogTailer$2.doWork(EditLogTailer.java:298)

 

C:\Users\rotang>ping TargetMachine001

Pinging TargetMachine001[TargetIP001] with 32 bytes of data:
 Request timed out.
 Request timed out.
 Request timed out.
 Request timed out.

 

Attachment is a log file saying how it repeatedly retries a dead name node, and 
a fix patch.

 I replaced the actual machine name/ip as SourceMachine001/SourceIP001 and 
TargetMachine001/TargetIP001.

 


> triggerActiveLogRoll stuck on dead name node, when ConnectTimeoutException 
> happens. 
> ------------------------------------------------------------------------------------
>
>                 Key: HADOOP-15684
>                 URL: https://issues.apache.org/jira/browse/HADOOP-15684
>             Project: Hadoop Common
>          Issue Type: Bug
>          Components: ha
>    Affects Versions: 3.0.0-alpha1
>            Reporter: Rong Tang
>            Priority: Critical
>         Attachments: 
> 0001-RollEditLog-try-next-NN-when-exception-happens.patch, 
> hadoop--rollingUpgrade-SourceMachine001.log
>
>
> When name node call triggerActiveLogRoll, and the cachedActiveProxy is a dead 
> name node, it will throws a ConnectTimeoutException, expected behavior is to 
> try next NN, but current logic doesn't do so, instead, it keeps trying the 
> dead, mistakenly take it as active.
>  
> 2018-08-17 10:02:12,001 WARN [Edit log tailer] 
> org.apache.hadoop.hdfs.server.namenode.ha.EditLogTailer: Unable to trigger a 
> roll of the active NN
> org.apache.hadoop.net.ConnectTimeoutException: Call From 
> SourceMachine001/SourceIP to001 TargetMachine001.ap.gbl:8020 failed on socket 
> timeout exception: org.apache.hadoop.net.ConnectTimeoutException: 20000 
> millis timeout 
> org.apache.hadoop.hdfs.server.namenode.ha.EditLogTailer$2.doWork(EditLogTailer.java:298)
>  
> C:\Users\rotang>ping TargetMachine001
> Pinging TargetMachine001[TargetIP001] with 32 bytes of data:
>  Request timed out.
>  Request timed out.
>  Request timed out.
>  Request timed out.
>  Attachment is a log file saying how it repeatedly retries a dead name node, 
> and a fix patch.
>  I replaced the actual machine name/ip as SourceMachine001/SourceIP001 and 
> TargetMachine001/TargetIP001.
>  
> How to Repro:
> In a good running NNs, take down the active NN (don't let it come back during 
> test), and then the stand by NNs will keep trying dead (old active) NN, 
> because it is the cached one.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to