[jira] [Updated] (HDFS-3908) In HA mode, when there is a ledger in BK, which is generated after the last checkpoint, missing, NN can't recover it.

Han Xiao (JIRA) Mon, 10 Sep 2012 03:47:12 -0700

     [ 
https://issues.apache.org/jira/browse/HDFS-3908?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]


Han Xiao updated HDFS-3908:
---------------------------

    Description: 
If not HA, when the num of edits.dir is larger than 1. Missing of one editlog 
file in a dir will not relust problem cause of the replica in the other dir. 
However, when in HA mode(using BK as ShareStorage), if an ledger missing, the 
missing ledger will not restored at the phase of NN starting even if the 
related editlog file existing in local dir.
The missing maintains when NN is still in standby state. However, when the NN 
enters active state, it will read the editlog file(related to the missing 
ledger) in local. But, unfortunately, the ledger after the missing one in BK 
can't be readed at such a phase(cause of gap).
Therefore in the following situation, editlogs will not be restored even there 
is an editlog file either in BK or in local dir: 

In such a stituation, editlog can't be restored:
1、fsiamge file: fsimage_0000000000000005946.md5
2、legder in zk:
        \[zk: localhost:2181(CONNECTED) 0\] ls 
/hdfsEdit/ledgers/edits_00000000000000594
        edits_000000000000005941_000000000000005942
        edits_000000000000005943_000000000000005944
        edits_000000000000005945_000000000000005946
        edits_000000000000005949_000000000000005949   
（missing edits_000000000000005947_000000000000005948）
3、editlog in local editlog dir：
        \-rw-r--r-- 1 root root      30 Sep  8 03:24 
edits_0000000000000005947-0000000000000005948
        \-rw-r--r-- 1 root root 1048576 Sep  8 03:35 
edits_0000000000000005950-0000000000000005950
        \-rw-r--r-- 1 root root 1048576 Sep  8 04:42 
edits_0000000000000005951-0000000000000005951
        （miss edits_0000000000000005949-0000000000000005919）
4、and the seen_txid
        vm2:/tmp/hadoop-root/dfs/name/current # cat seen_txid
        5949

Here, we want to restored editlog from txid 5946(image) to txid 
5949(seen_txid). The 5947-5948 is missing in BK, 5949-5949 is missing in local 
dir.
When start the NN, the following exception is thrown:

2012-09-08 06:26:10,031 FATAL org.apache.hadoop.hdfs.server.namenode.NameNode: 
Error encountered requiring NN shutdown. Shutting down immediately.
java.io.IOException: There appears to be a gap in the edit log.  We expected 
txid 5949, but got txid 5950.
        at 
org.apache.hadoop.hdfs.server.namenode.MetaRecoveryContext.editLogLoaderPrompt(MetaRecoveryContext.java:94)
        at 
org.apache.hadoop.hdfs.server.namenode.FSEditLogLoader.loadEditRecords(FSEditLogLoader.java:163)
        at 
org.apache.hadoop.hdfs.server.namenode.FSEditLogLoader.loadFSEdits(FSEditLogLoader.java:93)
        at 
org.apache.hadoop.hdfs.server.namenode.FSImage.loadEdits(FSImage.java:692)
        at 
org.apache.hadoop.hdfs.server.namenode.ha.EditLogTailer.doTailEdits(EditLogTailer.java:223)
        at 
org.apache.hadoop.hdfs.server.namenode.ha.EditLogTailer.catchupDuringFailover(EditLogTailer.java:182)
        at 
org.apache.hadoop.hdfs.server.namenode.FSNamesystem.startActiveServices(FSNamesystem.java:599)
        at 
org.apache.hadoop.hdfs.server.namenode.NameNode$NameNodeHAContext.startActiveServices(NameNode.java:1325)
        at 
org.apache.hadoop.hdfs.server.namenode.ha.ActiveState.enterState(ActiveState.java:61)
        at 
org.apache.hadoop.hdfs.server.namenode.ha.HAState.setStateInternal(HAState.java:63)
        at 
org.apache.hadoop.hdfs.server.namenode.ha.StandbyState.setState(StandbyState.java:49)
        at 
org.apache.hadoop.hdfs.server.namenode.NameNode.transitionToActive(NameNode.java:1233)
        at 
org.apache.hadoop.hdfs.server.namenode.NameNodeRpcServer.transitionToActive(NameNodeRpcServer.java:990)
        at 
org.apache.hadoop.ha.protocolPB.HAServiceProtocolServerSideTranslatorPB.transitionToActive(HAServiceProtocolServerSideTranslatorPB.java:107)
        at 
org.apache.hadoop.ha.proto.HAServiceProtocolProtos$HAServiceProtocolService$2.callBlockingMethod(HAServiceProtocolProtos.java:3633)
        at 
org.apache.hadoop.ipc.ProtobufRpcEngine$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine.java:427)
        at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:924)
        at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:1692)
        at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:1688)
        at java.security.AccessController.doPrivileged(Native Method)
        at javax.security.auth.Subject.doAs(Subject.java:396)
        at 
org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1232)
        at org.apache.hadoop.ipc.Server$Handler.run(Server.java:1686)
2012-09-08 06:26:10,036 INFO org.apache.hadoop.hdfs.server.namenode.NameNode: 
SHUTDOWN_MSG:
/************************************************************
SHUTDOWN_MSG: Shutting down NameNode at vm2/160.161.0.155
************************************************************/

  was:
If not HA, when the num of edits.dir is larger than 1. Missing of one editlog 
file in a dir will not relust problem cause of the replica in the other dir. 
However, when in HA mode(using BK as ShareStorage), if an ledger missing, the 
missing ledger will not restored at the phase of NN starting even if the 
related editlog file existing in local dir.
The missing maintains when NN is still in standby state. However, when the NN 
enters active state, it will read the editlog file(related to the missing 
ledger) in local. But, unfortunately, the ledger after the missing one in BK 
can't be readed at such a phase(cause of gap).
Therefore in the following situation, editlogs will not be restored even there 
is an editlog file either in BK or in local dir: 

In such a stituation, editlog can't be restored:
1、fsiamge file: fsimage_0000000000000005946.md5
2、legder in zk:
        [zk: localhost:2181(CONNECTED) 0] ls 
/hdfsEdit/ledgers/edits_00000000000000594
        edits_000000000000005941_000000000000005942
        edits_000000000000005943_000000000000005944
        edits_000000000000005945_000000000000005946
        edits_000000000000005949_000000000000005949   
（missing edits_000000000000005947_000000000000005948）
3、editlog in local editlog dir：
        \-rw-r--r-- 1 root root      30 Sep  8 03:24 
edits_0000000000000005947-0000000000000005948
        \-rw-r--r-- 1 root root 1048576 Sep  8 03:35 
edits_0000000000000005950-0000000000000005950
        \-rw-r--r-- 1 root root 1048576 Sep  8 04:42 
edits_0000000000000005951-0000000000000005951
        （miss edits_0000000000000005949-0000000000000005919）
4、and the seen_txid
        vm2:/tmp/hadoop-root/dfs/name/current # cat seen_txid
        5949

Here, we want to restored editlog from txid 5946(image) to txid 
5949(seen_txid). The 5947-5948 is missing in BK, 5949-5949 is missing in local 
dir.
When start the NN, the following exception is thrown:

2012-09-08 06:26:10,031 FATAL org.apache.hadoop.hdfs.server.namenode.NameNode: 
Error encountered requiring NN shutdown. Shutting down immediately.
java.io.IOException: There appears to be a gap in the edit log.  We expected 
txid 5949, but got txid 5950.
        at 
org.apache.hadoop.hdfs.server.namenode.MetaRecoveryContext.editLogLoaderPrompt(MetaRecoveryContext.java:94)
        at 
org.apache.hadoop.hdfs.server.namenode.FSEditLogLoader.loadEditRecords(FSEditLogLoader.java:163)
        at 
org.apache.hadoop.hdfs.server.namenode.FSEditLogLoader.loadFSEdits(FSEditLogLoader.java:93)
        at 
org.apache.hadoop.hdfs.server.namenode.FSImage.loadEdits(FSImage.java:692)
        at 
org.apache.hadoop.hdfs.server.namenode.ha.EditLogTailer.doTailEdits(EditLogTailer.java:223)
        at 
org.apache.hadoop.hdfs.server.namenode.ha.EditLogTailer.catchupDuringFailover(EditLogTailer.java:182)
        at 
org.apache.hadoop.hdfs.server.namenode.FSNamesystem.startActiveServices(FSNamesystem.java:599)
        at 
org.apache.hadoop.hdfs.server.namenode.NameNode$NameNodeHAContext.startActiveServices(NameNode.java:1325)
        at 
org.apache.hadoop.hdfs.server.namenode.ha.ActiveState.enterState(ActiveState.java:61)
        at 
org.apache.hadoop.hdfs.server.namenode.ha.HAState.setStateInternal(HAState.java:63)
        at 
org.apache.hadoop.hdfs.server.namenode.ha.StandbyState.setState(StandbyState.java:49)
        at 
org.apache.hadoop.hdfs.server.namenode.NameNode.transitionToActive(NameNode.java:1233)
        at 
org.apache.hadoop.hdfs.server.namenode.NameNodeRpcServer.transitionToActive(NameNodeRpcServer.java:990)
        at 
org.apache.hadoop.ha.protocolPB.HAServiceProtocolServerSideTranslatorPB.transitionToActive(HAServiceProtocolServerSideTranslatorPB.java:107)
        at 
org.apache.hadoop.ha.proto.HAServiceProtocolProtos$HAServiceProtocolService$2.callBlockingMethod(HAServiceProtocolProtos.java:3633)
        at 
org.apache.hadoop.ipc.ProtobufRpcEngine$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine.java:427)
        at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:924)
        at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:1692)
        at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:1688)
        at java.security.AccessController.doPrivileged(Native Method)
        at javax.security.auth.Subject.doAs(Subject.java:396)
        at 
org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1232)
        at org.apache.hadoop.ipc.Server$Handler.run(Server.java:1686)
2012-09-08 06:26:10,036 INFO org.apache.hadoop.hdfs.server.namenode.NameNode: 
SHUTDOWN_MSG:
/************************************************************
SHUTDOWN_MSG: Shutting down NameNode at vm2/160.161.0.155
************************************************************/

    
> In HA mode, when there is a ledger in BK, which is generated after the last 
> checkpoint, missing, NN can't recover it.
> ---------------------------------------------------------------------------------------------------------------------
>
>                 Key: HDFS-3908
>                 URL: https://issues.apache.org/jira/browse/HDFS-3908
>             Project: Hadoop HDFS
>          Issue Type: Bug
>          Components: name-node
>    Affects Versions: 2.0.1-alpha
>            Reporter: Han Xiao
>
> If not HA, when the num of edits.dir is larger than 1. Missing of one editlog 
> file in a dir will not relust problem cause of the replica in the other dir. 
> However, when in HA mode(using BK as ShareStorage), if an ledger missing, the 
> missing ledger will not restored at the phase of NN starting even if the 
> related editlog file existing in local dir.
> The missing maintains when NN is still in standby state. However, when the NN 
> enters active state, it will read the editlog file(related to the missing 
> ledger) in local. But, unfortunately, the ledger after the missing one in BK 
> can't be readed at such a phase(cause of gap).
> Therefore in the following situation, editlogs will not be restored even 
> there is an editlog file either in BK or in local dir: 
> In such a stituation, editlog can't be restored:
> 1、fsiamge file: fsimage_0000000000000005946.md5
> 2、legder in zk:
>       \[zk: localhost:2181(CONNECTED) 0\] ls 
> /hdfsEdit/ledgers/edits_00000000000000594
>       edits_000000000000005941_000000000000005942
>       edits_000000000000005943_000000000000005944
>       edits_000000000000005945_000000000000005946
>       edits_000000000000005949_000000000000005949   
> （missing edits_000000000000005947_000000000000005948）
> 3、editlog in local editlog dir：
>       \-rw-r--r-- 1 root root      30 Sep  8 03:24 
> edits_0000000000000005947-0000000000000005948
>       \-rw-r--r-- 1 root root 1048576 Sep  8 03:35 
> edits_0000000000000005950-0000000000000005950
>       \-rw-r--r-- 1 root root 1048576 Sep  8 04:42 
> edits_0000000000000005951-0000000000000005951
>       （miss edits_0000000000000005949-0000000000000005919）
> 4、and the seen_txid
>       vm2:/tmp/hadoop-root/dfs/name/current # cat seen_txid
>       5949
> Here, we want to restored editlog from txid 5946(image) to txid 
> 5949(seen_txid). The 5947-5948 is missing in BK, 5949-5949 is missing in 
> local dir.
> When start the NN, the following exception is thrown:
> 2012-09-08 06:26:10,031 FATAL 
> org.apache.hadoop.hdfs.server.namenode.NameNode: Error encountered requiring 
> NN shutdown. Shutting down immediately.
> java.io.IOException: There appears to be a gap in the edit log.  We expected 
> txid 5949, but got txid 5950.
>         at 
> org.apache.hadoop.hdfs.server.namenode.MetaRecoveryContext.editLogLoaderPrompt(MetaRecoveryContext.java:94)
>         at 
> org.apache.hadoop.hdfs.server.namenode.FSEditLogLoader.loadEditRecords(FSEditLogLoader.java:163)
>         at 
> org.apache.hadoop.hdfs.server.namenode.FSEditLogLoader.loadFSEdits(FSEditLogLoader.java:93)
>         at 
> org.apache.hadoop.hdfs.server.namenode.FSImage.loadEdits(FSImage.java:692)
>         at 
> org.apache.hadoop.hdfs.server.namenode.ha.EditLogTailer.doTailEdits(EditLogTailer.java:223)
>         at 
> org.apache.hadoop.hdfs.server.namenode.ha.EditLogTailer.catchupDuringFailover(EditLogTailer.java:182)
>         at 
> org.apache.hadoop.hdfs.server.namenode.FSNamesystem.startActiveServices(FSNamesystem.java:599)
>         at 
> org.apache.hadoop.hdfs.server.namenode.NameNode$NameNodeHAContext.startActiveServices(NameNode.java:1325)
>         at 
> org.apache.hadoop.hdfs.server.namenode.ha.ActiveState.enterState(ActiveState.java:61)
>         at 
> org.apache.hadoop.hdfs.server.namenode.ha.HAState.setStateInternal(HAState.java:63)
>         at 
> org.apache.hadoop.hdfs.server.namenode.ha.StandbyState.setState(StandbyState.java:49)
>         at 
> org.apache.hadoop.hdfs.server.namenode.NameNode.transitionToActive(NameNode.java:1233)
>         at 
> org.apache.hadoop.hdfs.server.namenode.NameNodeRpcServer.transitionToActive(NameNodeRpcServer.java:990)
>         at 
> org.apache.hadoop.ha.protocolPB.HAServiceProtocolServerSideTranslatorPB.transitionToActive(HAServiceProtocolServerSideTranslatorPB.java:107)
>         at 
> org.apache.hadoop.ha.proto.HAServiceProtocolProtos$HAServiceProtocolService$2.callBlockingMethod(HAServiceProtocolProtos.java:3633)
>         at 
> org.apache.hadoop.ipc.ProtobufRpcEngine$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine.java:427)
>         at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:924)
>         at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:1692)
>         at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:1688)
>         at java.security.AccessController.doPrivileged(Native Method)
>         at javax.security.auth.Subject.doAs(Subject.java:396)
>         at 
> org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1232)
>         at org.apache.hadoop.ipc.Server$Handler.run(Server.java:1686)
> 2012-09-08 06:26:10,036 INFO org.apache.hadoop.hdfs.server.namenode.NameNode: 
> SHUTDOWN_MSG:
> /************************************************************
> SHUTDOWN_MSG: Shutting down NameNode at vm2/160.161.0.155
> ************************************************************/

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Updated] (HDFS-3908) In HA mode, when there is a ledger in BK, which is generated after the last checkpoint, missing, NN can't recover it.

Reply via email to