[ 
https://issues.apache.org/jira/browse/HDFS-5291?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jing Zhao updated HDFS-5291:
----------------------------

    Description: 
In our test, we saw NN immediately went into safemode after transitioning to 
active state. This can cause HBase region server to timeout and kill itself. We 
should allow clients to retry when HA is enabled and ANN is in SafeMode.

============================================
Some log snippets:

standby state to active transition
{code}
2013-10-02 00:13:49,482 INFO  ipc.Server (Server.java:run(2068)) - IPC Server 
handler 69 on 8020, call 
org.apache.hadoop.hdfs.protocol.ClientProtocol.renewLease from IP:33911 
Call#1483 Retry#1: error: org.apache.hadoop.ipc.StandbyException: Operation 
category WRITE is not supported in state standby
2013-10-02 00:13:49,689 INFO  ipc.Server (Server.java:saslProcess(1342)) - Auth 
successful for nn/[email protected] (auth:SIMPLE)
2013-10-02 00:13:49,696 INFO  authorize.ServiceAuthorizationManager 
(ServiceAuthorizationManager.java:authorize(111)) - Authorization successful 
for nn/[email protected] (auth:KERBEROS) for protocol=interface 
org.apache.hadoop.ha.HAServiceProtocol
2013-10-02 00:13:49,700 INFO  namenode.FSNamesystem 
(FSNamesystem.java:stopStandbyServices(1013)) - Stopping services started for 
standby state
2013-10-02 00:13:49,701 WARN  ha.EditLogTailer (EditLogTailer.java:doWork(336)) 
- Edit log tailer interrupted
java.lang.InterruptedException: sleep interrupted
        at java.lang.Thread.sleep(Native Method)
        at 
org.apache.hadoop.hdfs.server.namenode.ha.EditLogTailer$EditLogTailerThread.doWork(EditLogTailer.java:334)
        at 
org.apache.hadoop.hdfs.server.namenode.ha.EditLogTailer$EditLogTailerThread.access$200(EditLogTailer.java:279)
        at 
org.apache.hadoop.hdfs.server.namenode.ha.EditLogTailer$EditLogTailerThread$1.run(EditLogTailer.java:296)
        at java.security.AccessController.doPrivileged(Native Method)
        at javax.security.auth.Subject.doAs(Subject.java:356)
        at 
org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1463)
        at 
org.apache.hadoop.security.SecurityUtil.doAsLoginUserOrFatal(SecurityUtil.java:454)
        at org.apache.hadoop.hdfs.server.namenode.ha.EditLogTailer$EditLogTail
2013-10-02 00:13:49,704 INFO  namenode.FSNamesystem 
(FSNamesystem.java:startActiveServices(885)) - Starting services required for 
active state
2013-10-02 00:13:49,719 INFO  client.QuorumJournalManager 
(QuorumJournalManager.java:recoverUnfinalizedSegments(419)) - Starting recovery 
process for unclosed journal segments...
2013-10-02 00:13:49,755 INFO  ipc.Server (Server.java:saslProcess(1342)) - Auth 
successful for hbase/[email protected] (auth:SIMPLE)
2013-10-02 00:13:49,761 INFO  authorize.ServiceAuthorizationManager 
(ServiceAuthorizationManager.java:authorize(111)) - Authorization successful 
for hbase/[email protected] (auth:KERBEROS) for protocol=interface 
org.apache.hadoop.hdfs.protocol.ClientProtocol
2013-10-02 00:13:49,839 INFO  client.QuorumJournalManager 
(QuorumJournalManager.java:recoverUnfinalizedSegments(421)) - Successfully 
started new epoch 85
2013-10-02 00:13:49,839 INFO  client.QuorumJournalManager 
(QuorumJournalManager.java:recoverUnclosedSegment(249)) - Beginning recovery of 
unclosed segment starting at txid 887112
2013-10-02 00:13:49,874 INFO  client.QuorumJournalManager 
(QuorumJournalManager.java:recoverUnclosedSegment(258)) - Recovery prepare 
phase complete. Responses:
IP:8485: segmentState { startTxId: 887112 endTxId: 887531 isInProgress: true } 
lastWriterEpoch: 84 lastCommittedTxId: 887530
172.18.145.97:8485: segmentState { startTxId: 887112 endTxId: 887531 
isInProgress: true } lastWriterEpoch: 84 lastCommittedTxId: 887530
2013-10-02 00:13:49,875 INFO  client.QuorumJournalManager 
(QuorumJournalManager.java:recover
{code}


And then we get into safemode

{code}
Construction[IP:1019|RBW]]} size 0
2013-10-02 00:13:50,277 INFO  BlockStateChange 
(BlockManager.java:logAddStoredBlock(2237)) - BLOCK* addStoredBlock: blockMap 
updated: IP:1019 is added to blk_IP157{blockUCState=UNDER_CONSTRUCTION, 
primaryNodeIndex=-1, replicas=[ReplicaUnderConstruction[IP:1019|RBW], 
ReplicaUnderConstruction[172.18.145.96:1019|RBW], ReplicaUnde
rConstruction[IP:1019|RBW]]} size 0
2013-10-02 00:13:50,279 INFO  hdfs.StateChange 
(FSNamesystem.java:reportStatus(4703)) - STATE* Safe mode ON.
The reported blocks 1071 needs additional 5 blocks to reach the threshold 
1.0000 of total blocks 1075.
Safe mode will be turned off automatically
2013-10-02 00:13:50,279 INFO  BlockStateChange 
(BlockManager.java:logAddStoredBlock(2237)) - BLOCK* addStoredBlock: blockMap 
updated: IP:1019 is added to blk_IP158{blockUCState=UNDER_CONSTRUCTION, 
primaryNodeIndex=-1, 
replicas=[ReplicaUnderConstruction[172.18.145.99:1019|RBW], 
ReplicaUnderConstruction[172.18.145.97:1019|RBW], 
ReplicaUnderConstruction[IP:1019|RBW]]} size 0
2013-10-02 00:13:50,280 INFO  BlockStateChange 
(BlockManager.java:logAddStoredBlock(2237)) - BLOCK* addStoredBlock: blockMap 
updated: 172.18.145.99:1019 is added to 
blk_IP158{blockUCState=UNDER_CONSTRUCTION, primaryNodeIndex=-1, 
replicas=[ReplicaUnderConstruction[172.18.145.99:1019|RBW], 
ReplicaUnderConstruction[172.18.145.97:1019|RBW], 
ReplicaUnderConstruction[IP:1019|RBW]]} size 0
2013-10-02 00:13:50,281 INFO  BlockStateChange 
(BlockManager.java:logAddStoredBlock(2237)) - BLOCK* addStoredBlock: blockMap 
updated: 172.18.145.97:1019 is added to 
blk_IP158{blockUCState=UNDER_CONSTRUCTION, primaryNodeIndex=-1, 
replicas=[ReplicaUnderConstruction[172.18.145.99:1019|RBW], 
ReplicaUnderConstruction[172.18.145.97:1019|RBW], 
ReplicaUnderConstruction[IP:1019|RBW]]} size 0
{code}

  was:



============================================
Some log snippets

standby state to active transition
{code}
2013-10-02 00:13:49,482 INFO  ipc.Server (Server.java:run(2068)) - IPC Server 
handler 69 on 8020, call 
org.apache.hadoop.hdfs.protocol.ClientProtocol.renewLease from IP:33911 
Call#1483 Retry#1: error: org.apache.hadoop.ipc.StandbyException: Operation 
category WRITE is not supported in state standby
2013-10-02 00:13:49,689 INFO  ipc.Server (Server.java:saslProcess(1342)) - Auth 
successful for nn/[email protected] (auth:SIMPLE)
2013-10-02 00:13:49,696 INFO  authorize.ServiceAuthorizationManager 
(ServiceAuthorizationManager.java:authorize(111)) - Authorization successful 
for nn/[email protected] (auth:KERBEROS) for protocol=interface 
org.apache.hadoop.ha.HAServiceProtocol
2013-10-02 00:13:49,700 INFO  namenode.FSNamesystem 
(FSNamesystem.java:stopStandbyServices(1013)) - Stopping services started for 
standby state
2013-10-02 00:13:49,701 WARN  ha.EditLogTailer (EditLogTailer.java:doWork(336)) 
- Edit log tailer interrupted
java.lang.InterruptedException: sleep interrupted
        at java.lang.Thread.sleep(Native Method)
        at 
org.apache.hadoop.hdfs.server.namenode.ha.EditLogTailer$EditLogTailerThread.doWork(EditLogTailer.java:334)
        at 
org.apache.hadoop.hdfs.server.namenode.ha.EditLogTailer$EditLogTailerThread.access$200(EditLogTailer.java:279)
        at 
org.apache.hadoop.hdfs.server.namenode.ha.EditLogTailer$EditLogTailerThread$1.run(EditLogTailer.java:296)
        at java.security.AccessController.doPrivileged(Native Method)
        at javax.security.auth.Subject.doAs(Subject.java:356)
        at 
org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1463)
        at 
org.apache.hadoop.security.SecurityUtil.doAsLoginUserOrFatal(SecurityUtil.java:454)
        at org.apache.hadoop.hdfs.server.namenode.ha.EditLogTailer$EditLogTail
2013-10-02 00:13:49,704 INFO  namenode.FSNamesystem 
(FSNamesystem.java:startActiveServices(885)) - Starting services required for 
active state
2013-10-02 00:13:49,719 INFO  client.QuorumJournalManager 
(QuorumJournalManager.java:recoverUnfinalizedSegments(419)) - Starting recovery 
process for unclosed journal segments...
2013-10-02 00:13:49,755 INFO  ipc.Server (Server.java:saslProcess(1342)) - Auth 
successful for hbase/[email protected] (auth:SIMPLE)
2013-10-02 00:13:49,761 INFO  authorize.ServiceAuthorizationManager 
(ServiceAuthorizationManager.java:authorize(111)) - Authorization successful 
for hbase/[email protected] (auth:KERBEROS) for protocol=interface 
org.apache.hadoop.hdfs.protocol.ClientProtocol
2013-10-02 00:13:49,839 INFO  client.QuorumJournalManager 
(QuorumJournalManager.java:recoverUnfinalizedSegments(421)) - Successfully 
started new epoch 85
2013-10-02 00:13:49,839 INFO  client.QuorumJournalManager 
(QuorumJournalManager.java:recoverUnclosedSegment(249)) - Beginning recovery of 
unclosed segment starting at txid 887112
2013-10-02 00:13:49,874 INFO  client.QuorumJournalManager 
(QuorumJournalManager.java:recoverUnclosedSegment(258)) - Recovery prepare 
phase complete. Responses:
IP:8485: segmentState { startTxId: 887112 endTxId: 887531 isInProgress: true } 
lastWriterEpoch: 84 lastCommittedTxId: 887530
172.18.145.97:8485: segmentState { startTxId: 887112 endTxId: 887531 
isInProgress: true } lastWriterEpoch: 84 lastCommittedTxId: 887530
2013-10-02 00:13:49,875 INFO  client.QuorumJournalManager 
(QuorumJournalManager.java:recover
{code}


And then we get into safemode

{code}
Construction[IP:1019|RBW]]} size 0
2013-10-02 00:13:50,277 INFO  BlockStateChange 
(BlockManager.java:logAddStoredBlock(2237)) - BLOCK* addStoredBlock: blockMap 
updated: IP:1019 is added to blk_IP157{blockUCState=UNDER_CONSTRUCTION, 
primaryNodeIndex=-1, replicas=[ReplicaUnderConstruction[IP:1019|RBW], 
ReplicaUnderConstruction[172.18.145.96:1019|RBW], ReplicaUnde
rConstruction[IP:1019|RBW]]} size 0
2013-10-02 00:13:50,279 INFO  hdfs.StateChange 
(FSNamesystem.java:reportStatus(4703)) - STATE* Safe mode ON.
The reported blocks 1071 needs additional 5 blocks to reach the threshold 
1.0000 of total blocks 1075.
Safe mode will be turned off automatically
2013-10-02 00:13:50,279 INFO  BlockStateChange 
(BlockManager.java:logAddStoredBlock(2237)) - BLOCK* addStoredBlock: blockMap 
updated: IP:1019 is added to blk_IP158{blockUCState=UNDER_CONSTRUCTION, 
primaryNodeIndex=-1, 
replicas=[ReplicaUnderConstruction[172.18.145.99:1019|RBW], 
ReplicaUnderConstruction[172.18.145.97:1019|RBW], 
ReplicaUnderConstruction[IP:1019|RBW]]} size 0
2013-10-02 00:13:50,280 INFO  BlockStateChange 
(BlockManager.java:logAddStoredBlock(2237)) - BLOCK* addStoredBlock: blockMap 
updated: 172.18.145.99:1019 is added to 
blk_IP158{blockUCState=UNDER_CONSTRUCTION, primaryNodeIndex=-1, 
replicas=[ReplicaUnderConstruction[172.18.145.99:1019|RBW], 
ReplicaUnderConstruction[172.18.145.97:1019|RBW], 
ReplicaUnderConstruction[IP:1019|RBW]]} size 0
2013-10-02 00:13:50,281 INFO  BlockStateChange 
(BlockManager.java:logAddStoredBlock(2237)) - BLOCK* addStoredBlock: blockMap 
updated: 172.18.145.97:1019 is added to 
blk_IP158{blockUCState=UNDER_CONSTRUCTION, primaryNodeIndex=-1, 
replicas=[ReplicaUnderConstruction[172.18.145.99:1019|RBW], 
ReplicaUnderConstruction[172.18.145.97:1019|RBW], 
ReplicaUnderConstruction[IP:1019|RBW]]} size 0
{code}


> Allow clients to retry when Active NN is in SafeMode
> ----------------------------------------------------
>
>                 Key: HDFS-5291
>                 URL: https://issues.apache.org/jira/browse/HDFS-5291
>             Project: Hadoop HDFS
>          Issue Type: Bug
>          Components: ha
>    Affects Versions: 2.1.1-beta
>            Reporter: Arpit Gupta
>            Assignee: Jing Zhao
>            Priority: Critical
>             Fix For: 2.2.1
>
>         Attachments: HDFS-5291.000.patch, HDFS-5291.001.patch, 
> HDFS-5291.002.patch, HDFS-5291.003.patch, HDFS-5291.003.patch, nn.log
>
>
> In our test, we saw NN immediately went into safemode after transitioning to 
> active state. This can cause HBase region server to timeout and kill itself. 
> We should allow clients to retry when HA is enabled and ANN is in SafeMode.
> ============================================
> Some log snippets:
> standby state to active transition
> {code}
> 2013-10-02 00:13:49,482 INFO  ipc.Server (Server.java:run(2068)) - IPC Server 
> handler 69 on 8020, call 
> org.apache.hadoop.hdfs.protocol.ClientProtocol.renewLease from IP:33911 
> Call#1483 Retry#1: error: org.apache.hadoop.ipc.StandbyException: Operation 
> category WRITE is not supported in state standby
> 2013-10-02 00:13:49,689 INFO  ipc.Server (Server.java:saslProcess(1342)) - 
> Auth successful for nn/[email protected] (auth:SIMPLE)
> 2013-10-02 00:13:49,696 INFO  authorize.ServiceAuthorizationManager 
> (ServiceAuthorizationManager.java:authorize(111)) - Authorization successful 
> for nn/[email protected] (auth:KERBEROS) for protocol=interface 
> org.apache.hadoop.ha.HAServiceProtocol
> 2013-10-02 00:13:49,700 INFO  namenode.FSNamesystem 
> (FSNamesystem.java:stopStandbyServices(1013)) - Stopping services started for 
> standby state
> 2013-10-02 00:13:49,701 WARN  ha.EditLogTailer 
> (EditLogTailer.java:doWork(336)) - Edit log tailer interrupted
> java.lang.InterruptedException: sleep interrupted
>         at java.lang.Thread.sleep(Native Method)
>         at 
> org.apache.hadoop.hdfs.server.namenode.ha.EditLogTailer$EditLogTailerThread.doWork(EditLogTailer.java:334)
>         at 
> org.apache.hadoop.hdfs.server.namenode.ha.EditLogTailer$EditLogTailerThread.access$200(EditLogTailer.java:279)
>         at 
> org.apache.hadoop.hdfs.server.namenode.ha.EditLogTailer$EditLogTailerThread$1.run(EditLogTailer.java:296)
>         at java.security.AccessController.doPrivileged(Native Method)
>         at javax.security.auth.Subject.doAs(Subject.java:356)
>         at 
> org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1463)
>         at 
> org.apache.hadoop.security.SecurityUtil.doAsLoginUserOrFatal(SecurityUtil.java:454)
>         at org.apache.hadoop.hdfs.server.namenode.ha.EditLogTailer$EditLogTail
> 2013-10-02 00:13:49,704 INFO  namenode.FSNamesystem 
> (FSNamesystem.java:startActiveServices(885)) - Starting services required for 
> active state
> 2013-10-02 00:13:49,719 INFO  client.QuorumJournalManager 
> (QuorumJournalManager.java:recoverUnfinalizedSegments(419)) - Starting 
> recovery process for unclosed journal segments...
> 2013-10-02 00:13:49,755 INFO  ipc.Server (Server.java:saslProcess(1342)) - 
> Auth successful for hbase/[email protected] (auth:SIMPLE)
> 2013-10-02 00:13:49,761 INFO  authorize.ServiceAuthorizationManager 
> (ServiceAuthorizationManager.java:authorize(111)) - Authorization successful 
> for hbase/[email protected] (auth:KERBEROS) for protocol=interface 
> org.apache.hadoop.hdfs.protocol.ClientProtocol
> 2013-10-02 00:13:49,839 INFO  client.QuorumJournalManager 
> (QuorumJournalManager.java:recoverUnfinalizedSegments(421)) - Successfully 
> started new epoch 85
> 2013-10-02 00:13:49,839 INFO  client.QuorumJournalManager 
> (QuorumJournalManager.java:recoverUnclosedSegment(249)) - Beginning recovery 
> of unclosed segment starting at txid 887112
> 2013-10-02 00:13:49,874 INFO  client.QuorumJournalManager 
> (QuorumJournalManager.java:recoverUnclosedSegment(258)) - Recovery prepare 
> phase complete. Responses:
> IP:8485: segmentState { startTxId: 887112 endTxId: 887531 isInProgress: true 
> } lastWriterEpoch: 84 lastCommittedTxId: 887530
> 172.18.145.97:8485: segmentState { startTxId: 887112 endTxId: 887531 
> isInProgress: true } lastWriterEpoch: 84 lastCommittedTxId: 887530
> 2013-10-02 00:13:49,875 INFO  client.QuorumJournalManager 
> (QuorumJournalManager.java:recover
> {code}
> And then we get into safemode
> {code}
> Construction[IP:1019|RBW]]} size 0
> 2013-10-02 00:13:50,277 INFO  BlockStateChange 
> (BlockManager.java:logAddStoredBlock(2237)) - BLOCK* addStoredBlock: blockMap 
> updated: IP:1019 is added to blk_IP157{blockUCState=UNDER_CONSTRUCTION, 
> primaryNodeIndex=-1, replicas=[ReplicaUnderConstruction[IP:1019|RBW], 
> ReplicaUnderConstruction[172.18.145.96:1019|RBW], ReplicaUnde
> rConstruction[IP:1019|RBW]]} size 0
> 2013-10-02 00:13:50,279 INFO  hdfs.StateChange 
> (FSNamesystem.java:reportStatus(4703)) - STATE* Safe mode ON.
> The reported blocks 1071 needs additional 5 blocks to reach the threshold 
> 1.0000 of total blocks 1075.
> Safe mode will be turned off automatically
> 2013-10-02 00:13:50,279 INFO  BlockStateChange 
> (BlockManager.java:logAddStoredBlock(2237)) - BLOCK* addStoredBlock: blockMap 
> updated: IP:1019 is added to blk_IP158{blockUCState=UNDER_CONSTRUCTION, 
> primaryNodeIndex=-1, 
> replicas=[ReplicaUnderConstruction[172.18.145.99:1019|RBW], 
> ReplicaUnderConstruction[172.18.145.97:1019|RBW], 
> ReplicaUnderConstruction[IP:1019|RBW]]} size 0
> 2013-10-02 00:13:50,280 INFO  BlockStateChange 
> (BlockManager.java:logAddStoredBlock(2237)) - BLOCK* addStoredBlock: blockMap 
> updated: 172.18.145.99:1019 is added to 
> blk_IP158{blockUCState=UNDER_CONSTRUCTION, primaryNodeIndex=-1, 
> replicas=[ReplicaUnderConstruction[172.18.145.99:1019|RBW], 
> ReplicaUnderConstruction[172.18.145.97:1019|RBW], 
> ReplicaUnderConstruction[IP:1019|RBW]]} size 0
> 2013-10-02 00:13:50,281 INFO  BlockStateChange 
> (BlockManager.java:logAddStoredBlock(2237)) - BLOCK* addStoredBlock: blockMap 
> updated: 172.18.145.97:1019 is added to 
> blk_IP158{blockUCState=UNDER_CONSTRUCTION, primaryNodeIndex=-1, 
> replicas=[ReplicaUnderConstruction[172.18.145.99:1019|RBW], 
> ReplicaUnderConstruction[172.18.145.97:1019|RBW], 
> ReplicaUnderConstruction[IP:1019|RBW]]} size 0
> {code}



--
This message was sent by Atlassian JIRA
(v6.1#6144)

Reply via email to