[
https://issues.apache.org/jira/browse/HDFS-5291?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Jing Zhao updated HDFS-5291:
----------------------------
Summary: Clients need to retry when Active NN is in SafeMode (was: Allow
clients to retry when Active NN is in SafeMode)
> Clients need to retry when Active NN is in SafeMode
> ---------------------------------------------------
>
> Key: HDFS-5291
> URL: https://issues.apache.org/jira/browse/HDFS-5291
> Project: Hadoop HDFS
> Issue Type: Bug
> Components: ha
> Affects Versions: 2.1.1-beta
> Reporter: Arpit Gupta
> Assignee: Jing Zhao
> Priority: Critical
> Fix For: 2.2.1
>
> Attachments: HDFS-5291.000.patch, HDFS-5291.001.patch,
> HDFS-5291.002.patch, HDFS-5291.003.patch, HDFS-5291.003.patch, nn.log
>
>
> In our test, we saw NN immediately went into safemode after transitioning to
> active state. This can cause HBase region server to timeout and kill itself.
> We should allow clients to retry when HA is enabled and ANN is in SafeMode.
> ============================================
> Some log snippets:
> standby state to active transition
> {code}
> 2013-10-02 00:13:49,482 INFO ipc.Server (Server.java:run(2068)) - IPC Server
> handler 69 on 8020, call
> org.apache.hadoop.hdfs.protocol.ClientProtocol.renewLease from IP:33911
> Call#1483 Retry#1: error: org.apache.hadoop.ipc.StandbyException: Operation
> category WRITE is not supported in state standby
> 2013-10-02 00:13:49,689 INFO ipc.Server (Server.java:saslProcess(1342)) -
> Auth successful for nn/[email protected] (auth:SIMPLE)
> 2013-10-02 00:13:49,696 INFO authorize.ServiceAuthorizationManager
> (ServiceAuthorizationManager.java:authorize(111)) - Authorization successful
> for nn/[email protected] (auth:KERBEROS) for protocol=interface
> org.apache.hadoop.ha.HAServiceProtocol
> 2013-10-02 00:13:49,700 INFO namenode.FSNamesystem
> (FSNamesystem.java:stopStandbyServices(1013)) - Stopping services started for
> standby state
> 2013-10-02 00:13:49,701 WARN ha.EditLogTailer
> (EditLogTailer.java:doWork(336)) - Edit log tailer interrupted
> java.lang.InterruptedException: sleep interrupted
> at java.lang.Thread.sleep(Native Method)
> at
> org.apache.hadoop.hdfs.server.namenode.ha.EditLogTailer$EditLogTailerThread.doWork(EditLogTailer.java:334)
> at
> org.apache.hadoop.hdfs.server.namenode.ha.EditLogTailer$EditLogTailerThread.access$200(EditLogTailer.java:279)
> at
> org.apache.hadoop.hdfs.server.namenode.ha.EditLogTailer$EditLogTailerThread$1.run(EditLogTailer.java:296)
> at java.security.AccessController.doPrivileged(Native Method)
> at javax.security.auth.Subject.doAs(Subject.java:356)
> at
> org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1463)
> at
> org.apache.hadoop.security.SecurityUtil.doAsLoginUserOrFatal(SecurityUtil.java:454)
> at org.apache.hadoop.hdfs.server.namenode.ha.EditLogTailer$EditLogTail
> 2013-10-02 00:13:49,704 INFO namenode.FSNamesystem
> (FSNamesystem.java:startActiveServices(885)) - Starting services required for
> active state
> 2013-10-02 00:13:49,719 INFO client.QuorumJournalManager
> (QuorumJournalManager.java:recoverUnfinalizedSegments(419)) - Starting
> recovery process for unclosed journal segments...
> 2013-10-02 00:13:49,755 INFO ipc.Server (Server.java:saslProcess(1342)) -
> Auth successful for hbase/[email protected] (auth:SIMPLE)
> 2013-10-02 00:13:49,761 INFO authorize.ServiceAuthorizationManager
> (ServiceAuthorizationManager.java:authorize(111)) - Authorization successful
> for hbase/[email protected] (auth:KERBEROS) for protocol=interface
> org.apache.hadoop.hdfs.protocol.ClientProtocol
> 2013-10-02 00:13:49,839 INFO client.QuorumJournalManager
> (QuorumJournalManager.java:recoverUnfinalizedSegments(421)) - Successfully
> started new epoch 85
> 2013-10-02 00:13:49,839 INFO client.QuorumJournalManager
> (QuorumJournalManager.java:recoverUnclosedSegment(249)) - Beginning recovery
> of unclosed segment starting at txid 887112
> 2013-10-02 00:13:49,874 INFO client.QuorumJournalManager
> (QuorumJournalManager.java:recoverUnclosedSegment(258)) - Recovery prepare
> phase complete. Responses:
> IP:8485: segmentState { startTxId: 887112 endTxId: 887531 isInProgress: true
> } lastWriterEpoch: 84 lastCommittedTxId: 887530
> 172.18.145.97:8485: segmentState { startTxId: 887112 endTxId: 887531
> isInProgress: true } lastWriterEpoch: 84 lastCommittedTxId: 887530
> 2013-10-02 00:13:49,875 INFO client.QuorumJournalManager
> (QuorumJournalManager.java:recover
> {code}
> And then we get into safemode
> {code}
> Construction[IP:1019|RBW]]} size 0
> 2013-10-02 00:13:50,277 INFO BlockStateChange
> (BlockManager.java:logAddStoredBlock(2237)) - BLOCK* addStoredBlock: blockMap
> updated: IP:1019 is added to blk_IP157{blockUCState=UNDER_CONSTRUCTION,
> primaryNodeIndex=-1, replicas=[ReplicaUnderConstruction[IP:1019|RBW],
> ReplicaUnderConstruction[172.18.145.96:1019|RBW], ReplicaUnde
> rConstruction[IP:1019|RBW]]} size 0
> 2013-10-02 00:13:50,279 INFO hdfs.StateChange
> (FSNamesystem.java:reportStatus(4703)) - STATE* Safe mode ON.
> The reported blocks 1071 needs additional 5 blocks to reach the threshold
> 1.0000 of total blocks 1075.
> Safe mode will be turned off automatically
> 2013-10-02 00:13:50,279 INFO BlockStateChange
> (BlockManager.java:logAddStoredBlock(2237)) - BLOCK* addStoredBlock: blockMap
> updated: IP:1019 is added to blk_IP158{blockUCState=UNDER_CONSTRUCTION,
> primaryNodeIndex=-1,
> replicas=[ReplicaUnderConstruction[172.18.145.99:1019|RBW],
> ReplicaUnderConstruction[172.18.145.97:1019|RBW],
> ReplicaUnderConstruction[IP:1019|RBW]]} size 0
> 2013-10-02 00:13:50,280 INFO BlockStateChange
> (BlockManager.java:logAddStoredBlock(2237)) - BLOCK* addStoredBlock: blockMap
> updated: 172.18.145.99:1019 is added to
> blk_IP158{blockUCState=UNDER_CONSTRUCTION, primaryNodeIndex=-1,
> replicas=[ReplicaUnderConstruction[172.18.145.99:1019|RBW],
> ReplicaUnderConstruction[172.18.145.97:1019|RBW],
> ReplicaUnderConstruction[IP:1019|RBW]]} size 0
> 2013-10-02 00:13:50,281 INFO BlockStateChange
> (BlockManager.java:logAddStoredBlock(2237)) - BLOCK* addStoredBlock: blockMap
> updated: 172.18.145.97:1019 is added to
> blk_IP158{blockUCState=UNDER_CONSTRUCTION, primaryNodeIndex=-1,
> replicas=[ReplicaUnderConstruction[172.18.145.99:1019|RBW],
> ReplicaUnderConstruction[172.18.145.97:1019|RBW],
> ReplicaUnderConstruction[IP:1019|RBW]]} size 0
> {code}
--
This message was sent by Atlassian JIRA
(v6.1#6144)