[jira] [Comment Edited] (HADOOP-15684) triggerActiveLogRoll stuck on dead name node, when ConnectTimeoutException happens.

Rong Tang (JIRA) Wed, 22 Aug 2018 15:06:15 -0700


    [ 
https://issues.apache.org/jira/browse/HADOOP-15684?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16589398#comment-16589398
 ]


Rong Tang edited comment on HADOOP-15684 at 8/22/18 10:05 PM:
--------------------------------------------------------------

[~elgoiri] , thanks for your comments.
 * Why do we need to fix the ports? Not a big fan of the retry approach. 
 ** _IPC ports are used to determine whether enabling roll log or not._
 ** _Removed the retry._
 * Add a couple high level comments on the invokedtimes approach.
 ** _Added_
 * You should wrap the whole thing in a try finally shutdown with a null check.
 ** _Moved the "cluster" related code into try-finally._

 

Uploaded a new patch. 

Do you know if there is more convenient way for code review instead of using 
plain text?

 


was (Author: trjianjianjiao):
[~elgoiri] , thanks for your comments.
 * Why do we need to fix the ports? Not a big fan of the retry approach. 
 ** _IPC ports are used to determine whether enabling roll log or not._
 ** _Removed the retry._
 * Add a couple high level comments on the invokedtimes approach.
 ** _Added_
 * You should wrap the whole thing in a try finally shutdown with a null check.
 ** _Moved the "cluster" related code into try-finally._

 

Uploaded a new patch. 

Do you know if there is more convenient way for code review instead of giving 
comments in texts?

 

> triggerActiveLogRoll stuck on dead name node, when ConnectTimeoutException 
> happens. 
> ------------------------------------------------------------------------------------
>
>                 Key: HADOOP-15684
>                 URL: https://issues.apache.org/jira/browse/HADOOP-15684
>             Project: Hadoop Common
>          Issue Type: Bug
>          Components: ha
>    Affects Versions: 3.0.0-alpha1
>            Reporter: Rong Tang
>            Assignee: Rong Tang
>            Priority: Critical
>         Attachments: 
> 0001-RollEditLog-try-next-NN-when-exception-happens.patch, 
> HADOOP-15684.000.patch, HADOOP-15684.001.patch, 
> hadoop--rollingUpgrade-SourceMachine001.log
>
>
> When name node call triggerActiveLogRoll, and the cachedActiveProxy is a dead 
> name node, it will throws a ConnectTimeoutException, expected behavior is to 
> try next NN, but current logic doesn't do so, instead, it keeps trying the 
> dead, mistakenly take it as active.
>  
> 2018-08-17 10:02:12,001 WARN [Edit log tailer] 
> org.apache.hadoop.hdfs.server.namenode.ha.EditLogTailer: Unable to trigger a 
> roll of the active NN
> org.apache.hadoop.net.ConnectTimeoutException: Call From 
> SourceMachine001/SourceIP to001 TargetMachine001.ap.gbl:8020 failed on socket 
> timeout exception: org.apache.hadoop.net.ConnectTimeoutException: 20000 
> millis timeout 
> org.apache.hadoop.hdfs.server.namenode.ha.EditLogTailer$2.doWork(EditLogTailer.java:298)
>  
> C:\Users\rotang>ping TargetMachine001
> Pinging TargetMachine001[TargetIP001] with 32 bytes of data:
>  Request timed out.
>  Request timed out.
>  Request timed out.
>  Request timed out.
>  Attachment is a log file saying how it repeatedly retries a dead name node, 
> and a fix patch.
>  I replaced the actual machine name/ip as SourceMachine001/SourceIP001 and 
> TargetMachine001/TargetIP001.
>  
> How to Repro:
> In a good running NNs, take down the active NN (don't let it come back during 
> test), and then the stand by NNs will keep trying dead (old active) NN, 
> because it is the cached one.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[jira] [Comment Edited] (HADOOP-15684) triggerActiveLogRoll stuck on dead name node, when ConnectTimeoutException happens.

Reply via email to