[ 
https://issues.apache.org/jira/browse/HADOOP-15684?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Rong Tang updated HADOOP-15684:
-------------------------------
    Description: 
When name node call triggerActiveLogRoll, and the cachedActiveProxy is a dead 
name node, it will throws a ConnectTimeoutException, expected behavior is to 
try next NN, but current logic doesn't do so, instead, it keeps trying the 
dead, mistakenly take it as active.

 

2018-08-17 10:02:12,001 WARN [Edit log tailer] 
org.apache.hadoop.hdfs.server.namenode.ha.EditLogTailer: Unable to trigger a 
roll of the active NN

org.apache.hadoop.net.ConnectTimeoutException: Call From 
SourceMachine001/SourceIP to001 TargetMachine001.ap.gbl:8020 failed on socket 
timeout exception: org.apache.hadoop.net.ConnectTimeoutException: 20000 millis 
timeout 
org.apache.hadoop.hdfs.server.namenode.ha.EditLogTailer$2.doWork(EditLogTailer.java:298)

 

C:\Users\rotang>ping TargetMachine001

Pinging TargetMachine001[TargetIP001] with 32 bytes of data:
 Request timed out.
 Request timed out.
 Request timed out.
 Request timed out.

 

Attachment is a log file saying how it repeatedly retries a dead name node, and 
a fix patch.

 I replaced the actual machine name/ip as SourceMachine001/SourceIP001 and 
TargetMachine001/TargetIP001.

 

  was:
When name node call triggerActiveLogRoll, and the cachedActiveProxy is a dead 
name node, it will throws a ConnectTimeoutException, expected behavior is to 
try next NN, but current logic doesn't do so, instead, it keeps trying the 
dead, mistakenly take it as active.

 

2018-08-17 10:02:12,001 WARN [Edit log tailer] 
org.apache.hadoop.hdfs.server.namenode.ha.EditLogTailer: Unable to trigger a 
roll of the active NN

org.apache.hadoop.net.ConnectTimeoutException: Call From SourceMachine/SourceIP 
to TargetMachine.ap.gbl:8020 failed on socket timeout exception: 
org.apache.hadoop.net.ConnectTimeoutException: 20000 millis timeout 
org.apache.hadoop.hdfs.server.namenode.ha.EditLogTailer$2.doWork(EditLogTailer.java:298)

 

C:\Users\rotang>ping TargetMachine

Pinging TargetMachine[TargetIP] with 32 bytes of data:
 Request timed out.
 Request timed out.
 Request timed out.
 Request timed out.

 

Attachment is a log file saying how it repeatedly retries a dead name node, and 
a fix patch.

 

 


> triggerActiveLogRoll stuck on dead name node, when ConnectTimeoutException 
> happens. 
> ------------------------------------------------------------------------------------
>
>                 Key: HADOOP-15684
>                 URL: https://issues.apache.org/jira/browse/HADOOP-15684
>             Project: Hadoop Common
>          Issue Type: Bug
>          Components: ha
>    Affects Versions: 3.0.0-alpha1
>            Reporter: Rong Tang
>            Priority: Critical
>         Attachments: 
> 0001-RollEditLog-try-next-NN-when-exception-happens.patch, 
> hadoop--rollingUpgrade-SourceMachine001.log
>
>
> When name node call triggerActiveLogRoll, and the cachedActiveProxy is a dead 
> name node, it will throws a ConnectTimeoutException, expected behavior is to 
> try next NN, but current logic doesn't do so, instead, it keeps trying the 
> dead, mistakenly take it as active.
>  
> 2018-08-17 10:02:12,001 WARN [Edit log tailer] 
> org.apache.hadoop.hdfs.server.namenode.ha.EditLogTailer: Unable to trigger a 
> roll of the active NN
> org.apache.hadoop.net.ConnectTimeoutException: Call From 
> SourceMachine001/SourceIP to001 TargetMachine001.ap.gbl:8020 failed on socket 
> timeout exception: org.apache.hadoop.net.ConnectTimeoutException: 20000 
> millis timeout 
> org.apache.hadoop.hdfs.server.namenode.ha.EditLogTailer$2.doWork(EditLogTailer.java:298)
>  
> C:\Users\rotang>ping TargetMachine001
> Pinging TargetMachine001[TargetIP001] with 32 bytes of data:
>  Request timed out.
>  Request timed out.
>  Request timed out.
>  Request timed out.
>  
> Attachment is a log file saying how it repeatedly retries a dead name node, 
> and a fix patch.
>  I replaced the actual machine name/ip as SourceMachine001/SourceIP001 and 
> TargetMachine001/TargetIP001.
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to