[
https://issues.apache.org/jira/browse/HDFS-16689?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17583796#comment-17583796
]
ASF GitHub Bot commented on HDFS-16689:
---------------------------------------
abhishekkarigar commented on PR #4744:
URL: https://github.com/apache/hadoop/pull/4744#issuecomment-1224559583
@ZanderXu
hi one more question,
i am setting up HA namenode on kubernetes
on the standby namenode , i triggered the bootstrapStandby
$ hdfs namenode -bootstrapStandby
and i get the following error
}{"name":"org.apache.hadoop.hdfs.server.namenode.NameNode","time":1661279164420,"date":"2022-08-23
18:26:04,420","level":"INFO","thread":"main","message":"registered UNIX signal
handlers for [TERM, HUP,
INT]"}{"name":"org.apache.hadoop.hdfs.server.namenode.NameNode","time":1661279164532,"date":"2022-08-23
18:26:04,532","level":"INFO","thread":"main","message":"createNameNode
[-bootstrapStandby]"}{"name":"org.apache.hadoop.hdfs.server.namenode.ha.BootstrapStandby","time":1661279164846,"date":"2022-08-23
18:26:04,846","level":"INFO","thread":"main","message":"Found nn:
apache-hadoop-namenode-0.apache-hadoop-namenode.nom-backend.svc.cluster.local,
ipc:
hdfs:8020"}{"name":"org.apache.hadoop.hdfs.server.namenode.NameNode","time":1661279164847,"date":"2022-08-23
18:26:04,847","level":"ERROR","thread":"main","message":"Failed to start
namenode.","exceptionclass":"java.io.IOException","stack":["java.io.IOException:
java.lang.NullPointerException","\tat
org.apache.hadoop.hdfs.server.namenode.ha.BootstrapStandby.run(BootstrapStandby.java:549)","\tat
org.apache.hadoop.hdfs.server.namenode.NameNode.createNameNode(NameNode.java:1741)","\tat
org.apache.hadoop.hdfs.server.namenode.NameNode.main(NameNode.java:1834)","Caused
by: java.lang.NullPointerException","\tat
org.apache.hadoop.hdfs.server.namenode.ha.BootstrapStandby.parseConfAndFindOtherNN(BootstrapStandby.java:435)","\tat
org.apache.hadoop.hdfs.server.namenode.ha.BootstrapStandby.run(BootstrapStandby.java:114)","\tat
org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:81)","\tat
org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:95)","\tat
org.apache.hadoop.hdfs.server.namenode.ha.BootstrapStandby.run(BootstrapStandby.java:544)","\t...
2
more"]}{"name":"org.apache.hadoop.util.ExitUtil","time":1661279164850,"date":"2022-08-23
18:26:04,850","level":"INFO","thread":"main","message":"Exiting with status 1:
java.io.IOException:
java.lang.NullPointerException"}{"name":"org.apache.hadoop.hdfs.server.namenode.NameNode","time":1661279164852,"date":"2022-08-23
18:26:04,852","level":"INFO","thread":"shutdown-hook-0","message":"SHUTDOWN_MSG:
\n/************************************************************\nSHUTDOWN_MSG:
Shutting down NameNode at
apache-hadoop-namenode-1.apache-hadoop-namenode.nom-backend.svc.cluster.local/10.129.2.45\n
> Standby NameNode crashes when transitioning to Active with in-progress tailer
> -----------------------------------------------------------------------------
>
> Key: HDFS-16689
> URL: https://issues.apache.org/jira/browse/HDFS-16689
> Project: Hadoop HDFS
> Issue Type: Bug
> Reporter: ZanderXu
> Assignee: ZanderXu
> Priority: Critical
> Labels: pull-request-available
> Fix For: 3.4.0
>
> Time Spent: 50m
> Remaining Estimate: 0h
>
> Standby NameNode crashes when transitioning to Active with a in-progress
> tailer. And the error message like blew:
> {code:java}
> Caused by: java.lang.IllegalStateException: Cannot start writing at txid X
> when there is a stream available for read: ByteStringEditLog[X, Y],
> ByteStringEditLog[X, 0]
> at
> org.apache.hadoop.hdfs.server.namenode.FSEditLog.openForWrite(FSEditLog.java:344)
> at
> org.apache.hadoop.hdfs.server.namenode.FSEditLogAsync.openForWrite(FSEditLogAsync.java:113)
> at
> org.apache.hadoop.hdfs.server.namenode.FSNamesystem.startActiveServices(FSNamesystem.java:1423)
> at
> org.apache.hadoop.hdfs.server.namenode.NameNode$NameNodeHAContext.startActiveServices(NameNode.java:2132)
> ... 36 more
> {code}
> After tracing and found there is a critical bug in
> *EditlogTailer#catchupDuringFailover()* when
> *DFS_HA_TAILEDITS_INPROGRESS_KEY* is true. Because *catchupDuringFailover()*
> try to replay all missed edits from JournalNodes with *onlyDurableTxns=true*.
> It may cannot replay any edits when they are some abnormal JournalNodes.
> Reproduce method, suppose:
> - There are 2 namenode, namely NN0 and NN1, and the status of echo namenode
> is Active, Standby respectively. And there are 3 JournalNodes, namely JN0,
> JN1 and JN2.
> - NN0 try to sync 3 edits to JNs with started txid 3, but only successfully
> synced them to JN1 and JN3. And JN0 is abnormal, such as GC, bad network or
> restarted.
> - NN1's lastAppliedTxId is 2, and at the moment, we are trying failover
> active from NN0 to NN1.
> - NN1 only got two responses from JN0 and JN1 when it try to selecting
> inputStreams with *fromTxnId=3* and *onlyDurableTxns=true*, and the count
> txid of response is 0, 3 respectively. JN2 is abnormal, such as GC, bad
> network or restarted.
> - NN1 will cannot replay any Edits with *fromTxnId=3* from JournalNodes
> because the *maxAllowedTxns* is 0.
> So I think Standby NameNode should *catchupDuringFailover()* with
> *onlyDurableTxns=false* , so that it can replay all missed edits from
> JournalNode.
--
This message was sent by Atlassian Jira
(v8.20.10#820010)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]