[
https://issues.apache.org/jira/browse/HDFS-5291?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13787958#comment-13787958
]
Jing Zhao commented on HDFS-5291:
---------------------------------
Vinay, sorry I forgot to add more description. In our test, we saw NN
immediately went into safemode after transitioning to active state. So the
issue is not that the SBN throws SafeModeException. So the current patch only
handles "2. Active when throws safemode exception, the client must retry".
And the above safemode issue is not rare in our test. I'm guessing it may be
because that while the SBN tails the edit log, it keeps checking the safemode
for the updateBlocks call, and always grabs the FSNamesystem write lock. Thus
it is very possible that the SBN stays in safemode for a long time. And even if
it enters the safemode extension time, since the safemode monitor thread also
needs to acquire the FSNamesystem write lock, the extension time may be longer.
Thus when the SBN transitions to active, the safemode object is not null, and
triggers the safemode when checkMode is called.
> Standby namenode after transition to active goes into safemode
> --------------------------------------------------------------
>
> Key: HDFS-5291
> URL: https://issues.apache.org/jira/browse/HDFS-5291
> Project: Hadoop HDFS
> Issue Type: Bug
> Components: ha
> Affects Versions: 2.1.1-beta
> Reporter: Arpit Gupta
> Assignee: Jing Zhao
> Priority: Critical
> Attachments: HDFS-5291.000.patch, HDFS-5291.001.patch, nn.log
>
>
> Some log snippets
> standby state to active transition
> {code}
> 2013-10-02 00:13:49,482 INFO ipc.Server (Server.java:run(2068)) - IPC Server
> handler 69 on 8020, call
> org.apache.hadoop.hdfs.protocol.ClientProtocol.renewLease from IP:33911
> Call#1483 Retry#1: error: org.apache.hadoop.ipc.StandbyException: Operation
> category WRITE is not supported in state standby
> 2013-10-02 00:13:49,689 INFO ipc.Server (Server.java:saslProcess(1342)) -
> Auth successful for nn/[email protected] (auth:SIMPLE)
> 2013-10-02 00:13:49,696 INFO authorize.ServiceAuthorizationManager
> (ServiceAuthorizationManager.java:authorize(111)) - Authorization successful
> for nn/[email protected] (auth:KERBEROS) for protocol=interface
> org.apache.hadoop.ha.HAServiceProtocol
> 2013-10-02 00:13:49,700 INFO namenode.FSNamesystem
> (FSNamesystem.java:stopStandbyServices(1013)) - Stopping services started for
> standby state
> 2013-10-02 00:13:49,701 WARN ha.EditLogTailer
> (EditLogTailer.java:doWork(336)) - Edit log tailer interrupted
> java.lang.InterruptedException: sleep interrupted
> at java.lang.Thread.sleep(Native Method)
> at
> org.apache.hadoop.hdfs.server.namenode.ha.EditLogTailer$EditLogTailerThread.doWork(EditLogTailer.java:334)
> at
> org.apache.hadoop.hdfs.server.namenode.ha.EditLogTailer$EditLogTailerThread.access$200(EditLogTailer.java:279)
> at
> org.apache.hadoop.hdfs.server.namenode.ha.EditLogTailer$EditLogTailerThread$1.run(EditLogTailer.java:296)
> at java.security.AccessController.doPrivileged(Native Method)
> at javax.security.auth.Subject.doAs(Subject.java:356)
> at
> org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1463)
> at
> org.apache.hadoop.security.SecurityUtil.doAsLoginUserOrFatal(SecurityUtil.java:454)
> at org.apache.hadoop.hdfs.server.namenode.ha.EditLogTailer$EditLogTail
> 2013-10-02 00:13:49,704 INFO namenode.FSNamesystem
> (FSNamesystem.java:startActiveServices(885)) - Starting services required for
> active state
> 2013-10-02 00:13:49,719 INFO client.QuorumJournalManager
> (QuorumJournalManager.java:recoverUnfinalizedSegments(419)) - Starting
> recovery process for unclosed journal segments...
> 2013-10-02 00:13:49,755 INFO ipc.Server (Server.java:saslProcess(1342)) -
> Auth successful for hbase/[email protected] (auth:SIMPLE)
> 2013-10-02 00:13:49,761 INFO authorize.ServiceAuthorizationManager
> (ServiceAuthorizationManager.java:authorize(111)) - Authorization successful
> for hbase/[email protected] (auth:KERBEROS) for protocol=interface
> org.apache.hadoop.hdfs.protocol.ClientProtocol
> 2013-10-02 00:13:49,839 INFO client.QuorumJournalManager
> (QuorumJournalManager.java:recoverUnfinalizedSegments(421)) - Successfully
> started new epoch 85
> 2013-10-02 00:13:49,839 INFO client.QuorumJournalManager
> (QuorumJournalManager.java:recoverUnclosedSegment(249)) - Beginning recovery
> of unclosed segment starting at txid 887112
> 2013-10-02 00:13:49,874 INFO client.QuorumJournalManager
> (QuorumJournalManager.java:recoverUnclosedSegment(258)) - Recovery prepare
> phase complete. Responses:
> IP:8485: segmentState { startTxId: 887112 endTxId: 887531 isInProgress: true
> } lastWriterEpoch: 84 lastCommittedTxId: 887530
> 172.18.145.97:8485: segmentState { startTxId: 887112 endTxId: 887531
> isInProgress: true } lastWriterEpoch: 84 lastCommittedTxId: 887530
> 2013-10-02 00:13:49,875 INFO client.QuorumJournalManager
> (QuorumJournalManager.java:recover
> {code}
> And then we get into safemode
> {code}
> Construction[IP:1019|RBW]]} size 0
> 2013-10-02 00:13:50,277 INFO BlockStateChange
> (BlockManager.java:logAddStoredBlock(2237)) - BLOCK* addStoredBlock: blockMap
> updated: IP:1019 is added to blk_IP157{blockUCState=UNDER_CONSTRUCTION,
> primaryNodeIndex=-1, replicas=[ReplicaUnderConstruction[IP:1019|RBW],
> ReplicaUnderConstruction[172.18.145.96:1019|RBW], ReplicaUnde
> rConstruction[IP:1019|RBW]]} size 0
> 2013-10-02 00:13:50,279 INFO hdfs.StateChange
> (FSNamesystem.java:reportStatus(4703)) - STATE* Safe mode ON.
> The reported blocks 1071 needs additional 5 blocks to reach the threshold
> 1.0000 of total blocks 1075.
> Safe mode will be turned off automatically
> 2013-10-02 00:13:50,279 INFO BlockStateChange
> (BlockManager.java:logAddStoredBlock(2237)) - BLOCK* addStoredBlock: blockMap
> updated: IP:1019 is added to blk_IP158{blockUCState=UNDER_CONSTRUCTION,
> primaryNodeIndex=-1,
> replicas=[ReplicaUnderConstruction[172.18.145.99:1019|RBW],
> ReplicaUnderConstruction[172.18.145.97:1019|RBW],
> ReplicaUnderConstruction[IP:1019|RBW]]} size 0
> 2013-10-02 00:13:50,280 INFO BlockStateChange
> (BlockManager.java:logAddStoredBlock(2237)) - BLOCK* addStoredBlock: blockMap
> updated: 172.18.145.99:1019 is added to
> blk_IP158{blockUCState=UNDER_CONSTRUCTION, primaryNodeIndex=-1,
> replicas=[ReplicaUnderConstruction[172.18.145.99:1019|RBW],
> ReplicaUnderConstruction[172.18.145.97:1019|RBW],
> ReplicaUnderConstruction[IP:1019|RBW]]} size 0
> 2013-10-02 00:13:50,281 INFO BlockStateChange
> (BlockManager.java:logAddStoredBlock(2237)) - BLOCK* addStoredBlock: blockMap
> updated: 172.18.145.97:1019 is added to
> blk_IP158{blockUCState=UNDER_CONSTRUCTION, primaryNodeIndex=-1,
> replicas=[ReplicaUnderConstruction[172.18.145.99:1019|RBW],
> ReplicaUnderConstruction[172.18.145.97:1019|RBW],
> ReplicaUnderConstruction[IP:1019|RBW]]} size 0
> {code}
--
This message was sent by Atlassian JIRA
(v6.1#6144)