[
https://issues.apache.org/jira/browse/HADOOP-15684?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16589398#comment-16589398
]
Rong Tang edited comment on HADOOP-15684 at 8/22/18 10:05 PM:
--------------------------------------------------------------
[~elgoiri] , thanks for your comments.
* Why do we need to fix the ports? Not a big fan of the retry approach.
** _IPC ports are used to determine whether enabling roll log or not._
** _Removed the retry._
* Add a couple high level comments on the invokedtimes approach.
** _Added_
* You should wrap the whole thing in a try finally shutdown with a null check.
** _Moved the "cluster" related code into try-finally._
Uploaded a new patch.
Do you know if there is more convenient way for code review instead of using
plain text?
was (Author: trjianjianjiao):
[~elgoiri] , thanks for your comments.
* Why do we need to fix the ports? Not a big fan of the retry approach.
** _IPC ports are used to determine whether enabling roll log or not._
** _Removed the retry._
* Add a couple high level comments on the invokedtimes approach.
** _Added_
* You should wrap the whole thing in a try finally shutdown with a null check.
** _Moved the "cluster" related code into try-finally._
Uploaded a new patch.
Do you know if there is more convenient way for code review instead of giving
comments in texts?
> triggerActiveLogRoll stuck on dead name node, when ConnectTimeoutException
> happens.
> ------------------------------------------------------------------------------------
>
> Key: HADOOP-15684
> URL: https://issues.apache.org/jira/browse/HADOOP-15684
> Project: Hadoop Common
> Issue Type: Bug
> Components: ha
> Affects Versions: 3.0.0-alpha1
> Reporter: Rong Tang
> Assignee: Rong Tang
> Priority: Critical
> Attachments:
> 0001-RollEditLog-try-next-NN-when-exception-happens.patch,
> HADOOP-15684.000.patch, HADOOP-15684.001.patch,
> hadoop--rollingUpgrade-SourceMachine001.log
>
>
> When name node call triggerActiveLogRoll, and the cachedActiveProxy is a dead
> name node, it will throws a ConnectTimeoutException, expected behavior is to
> try next NN, but current logic doesn't do so, instead, it keeps trying the
> dead, mistakenly take it as active.
>
> 2018-08-17 10:02:12,001 WARN [Edit log tailer]
> org.apache.hadoop.hdfs.server.namenode.ha.EditLogTailer: Unable to trigger a
> roll of the active NN
> org.apache.hadoop.net.ConnectTimeoutException: Call From
> SourceMachine001/SourceIP to001 TargetMachine001.ap.gbl:8020 failed on socket
> timeout exception: org.apache.hadoop.net.ConnectTimeoutException: 20000
> millis timeout
> org.apache.hadoop.hdfs.server.namenode.ha.EditLogTailer$2.doWork(EditLogTailer.java:298)
>
> C:\Users\rotang>ping TargetMachine001
> Pinging TargetMachine001[TargetIP001] with 32 bytes of data:
> Request timed out.
> Request timed out.
> Request timed out.
> Request timed out.
> Attachment is a log file saying how it repeatedly retries a dead name node,
> and a fix patch.
> I replaced the actual machine name/ip as SourceMachine001/SourceIP001 and
> TargetMachine001/TargetIP001.
>
> How to Repro:
> In a good running NNs, take down the active NN (don't let it come back during
> test), and then the stand by NNs will keep trying dead (old active) NN,
> because it is the cached one.
--
This message was sent by Atlassian JIRA
(v7.6.3#76005)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]