[jira] [Commented] (AMBARI-11743) NameNode is forced to leave safemode, which causes HBMaster master to crash if done too quickly

Alejandro Fernandez (JIRA) Fri, 05 Jun 2015 17:24:33 -0700

    [ 
https://issues.apache.org/jira/browse/AMBARI-11743?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14575449#comment-14575449
 ]


Alejandro Fernandez commented on AMBARI-11743:
----------------------------------------------

In Ambari 1.7.0, 
https://github.com/apache/ambari/blob/branch-1.7.0/ambari-server/src/main/resources/stacks/HDP/2.0.6/services/HDFS/package/scripts/hdfs_namenode.py
Ambari would *never* force NameNode to leave safemode, 
https://github.com/apache/ambari/blob/branch-1.7.0/ambari-

In Ambari 2.0.0,
https://github.com/apache/ambari/blob/branch-2.0.0/ambari-server/src/main/resources/common-services/HDFS/2.1.0.2.0/package/scripts/hdfs_namenode.py
 
Ambari would *force* NameNode to leave safe mode in certain conditions; this 
arose due to requirements out of Rolling Upgrade, but the code was performed 
regardless of RU.

In Ambari 2.1.0,
https://github.com/apache/ambari/blob/branch-2.1/ambari-server/src/main/resources/common-services/HDFS/2.1.0.2.0/package/scripts/hdfs_namenode.py
The performance of HDFS commands improved, so Ambari would spend less time 
between NameNode start and checking for safemode state and then leave, which I 
believe is the new change that is surfacing these latent issues.

Starting NameNode and waiting for safemode OFF should be independent of whether 
an RU is happening. However, RU immediately runs HistoryServer start and MR 
Service Checks after starting NameNode, and those 2 steps require NameNode have 
safemode OFF.

In summary, I believe the fix is to wait longer for NameNode to reach safemode 
OFF, by waiting up to 10 mins (since more than 10 will cause the step to 
timeout)
If NameNode is still in safemode after 10 mins, it is up to the user to retry 
any subsequent steps. During RU, the user is allowed to retry HistoryServer 
start and MR Service Check

> NameNode is forced to leave safemode, which causes HBMaster master to crash 
> if done too quickly
> -----------------------------------------------------------------------------------------------
>
>                 Key: AMBARI-11743
>                 URL: https://issues.apache.org/jira/browse/AMBARI-11743
>             Project: Ambari
>          Issue Type: Bug
>            Reporter: Alejandro Fernandez
>            Assignee: Alejandro Fernandez
>
> 1. Install cluster with Ambari 2.1 and HDP 2.3
> 2. Add services HDFS, YARN, MR, ZK, and HBaste
> 3. Perform several Stop All and Start All on HDFS service
> 4. Periodically, HBase Master will crash
> This was a non-HA cluster.
> {code}
> 2015-06-02 09:34:24,865 WARN  [ip-172-31-33-225:16000.activeMasterManager] 
> hdfs.DFSClient: Could not obtain block: 
> BP-925466282-172.31.33.226-1433234647051:blk_1073741829_1005 
> file=/apps/hbase/data/hbase.id No live nodes contain current block Block 
> locations: Dead nodes: . Throwing a BlockMissingException
> 2015-06-02 09:34:24,866 WARN  [ip-172-31-33-225:16000.activeMasterManager] 
> hdfs.DFSClient: DFS Read
> org.apache.hadoop.hdfs.BlockMissingException: Could not obtain block: 
> BP-925466282-172.31.33.226-1433234647051:blk_1073741829_1005 
> file=/apps/hbase/data/hbase.id
>       at 
> org.apache.hadoop.hdfs.DFSInputStream.chooseDataNode(DFSInputStream.java:945)
>       at 
> org.apache.hadoop.hdfs.DFSInputStream.blockSeekTo(DFSInputStream.java:604)
>       at 
> org.apache.hadoop.hdfs.DFSInputStream.readWithStrategy(DFSInputStream.java:844)
>       at org.apache.hadoop.hdfs.DFSInputStream.read(DFSInputStream.java:896)
>       at java.io.DataInputStream.readFully(DataInputStream.java:195)
>       at java.io.DataInputStream.readFully(DataInputStream.java:169)
>       at org.apache.hadoop.hbase.util.FSUtils.getClusterId(FSUtils.java:816)
>       at 
> org.apache.hadoop.hbase.master.MasterFileSystem.checkRootDir(MasterFileSystem.java:474)
>       at 
> org.apache.hadoop.hbase.master.MasterFileSystem.createInitialFileSystemLayout(MasterFileSystem.java:146)
>       at 
> org.apache.hadoop.hbase.master.MasterFileSystem.<init>(MasterFileSystem.java:126)
>       at 
> org.apache.hadoop.hbase.master.HMaster.finishActiveMasterInitialization(HMaster.java:649)
>       at org.apache.hadoop.hbase.master.HMaster.access$500(HMaster.java:182)
>       at org.apache.hadoop.hbase.master.HMaster$1.run(HMaster.java:1646)
>       at java.lang.Thread.run(Thread.java:745)
> 2015-06-02 09:34:24,870 FATAL [ip-172-31-33-225:16000.activeMasterManager] 
> master.HMaster: Failed to become active master
> org.apache.hadoop.hdfs.BlockMissingException: Could not obtain block: 
> BP-925466282-172.31.33.226-1433234647051:blk_1073741829_1005 
> file=/apps/hbase/data/hbase.id
>       at 
> org.apache.hadoop.hdfs.DFSInputStream.chooseDataNode(DFSInputStream.java:945)
>       at 
> org.apache.hadoop.hdfs.DFSInputStream.blockSeekTo(DFSInputStream.java:604)
>       at 
> org.apache.hadoop.hdfs.DFSInputStream.readWithStrategy(DFSInputStream.java:844)
>       at org.apache.hadoop.hdfs.DFSInputStream.read(DFSInputStream.java:896)
>       at java.io.DataInputStream.readFully(DataInputStream.java:195)
>       at java.io.DataInputStream.readFully(DataInputStream.java:169)
>       at org.apache.hadoop.hbase.util.FSUtils.getClusterId(FSUtils.java:816)
>       at 
> org.apache.hadoop.hbase.master.MasterFileSystem.checkRootDir(MasterFileSystem.java:474)
>       at 
> org.apache.hadoop.hbase.master.MasterFileSystem.createInitialFileSystemLayout(MasterFileSystem.java:146)
>       at 
> org.apache.hadoop.hbase.master.MasterFileSystem.<init>(MasterFileSystem.java:126)
>       at 
> org.apache.hadoop.hbase.master.HMaster.finishActiveMasterInitialization(HMaster.java:649)
>       at org.apache.hadoop.hbase.master.HMaster.access$500(HMaster.java:182)
>       at org.apache.hadoop.hbase.master.HMaster$1.run(HMaster.java:1646)
>       at java.lang.Thread.run(Thread.java:745)
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (AMBARI-11743) NameNode is forced to leave safemode, which causes HBMaster master to crash if done too quickly

Reply via email to