Pankaj Kumar created HBASE-18167:
------------------------------------

             Summary: OfflineMetaRepair tool may cause HMaster abort always
                 Key: HBASE-18167
                 URL: https://issues.apache.org/jira/browse/HBASE-18167
             Project: HBase
          Issue Type: Bug
          Components: master
    Affects Versions: 1.3.1, 1.4.0, 1.3.2
            Reporter: Pankaj Kumar
            Assignee: Pankaj Kumar
            Priority: Critical


In the production environment, we met a weird scenario where some Meta table 
HFile blocks were missing due to some reason.
To recover the environment we tried to rebuild the meta using OfflineMetaRepair 
tool and restart the cluster, but HMaster couldn't finish it's initialization. 
It always timed out as namespace table region was never assigned.

Steps to reproduce
==================
1. Assign meta table region to HMaster (it can be on any RS, just to reproduce 
the  scenario)
{noformat}
        <property>
            <name>hbase.balancer.tablesOnMaster</name>
            <value>hbase:meta</value>
        </property>
{noformat}
2. Start HMaster and RegionServer
2. Create two namespace, say "ns1" & "ns2"
3. Create two tables "ns1:t1' & "ns2:t1'
4. flush 'hbase:meta"
5. Stop HMaster (graceful shutdown)
6. Kill -9 RegionServer (Abnormal shutdown)
7. Run OfflineMetaRepair as follows,
{noformat}
        hbase org.apache.hadoop.hbase.util.hbck.OfflineMetaRepair -fix
{noformat}
8. Restart HMaster and RegionServer
9. HMaster will never be able to finish its initialization and abort always 
with below message,
{code}
2017-06-06 15:11:07,582 FATAL [Hostname:16000.activeMasterManager] 
master.HMaster: Unhandled exception. Starting shutdown.
java.io.IOException: Timedout 120000ms waiting for namespace table to be 
assigned
        at 
org.apache.hadoop.hbase.master.TableNamespaceManager.start(TableNamespaceManager.java:98)
        at 
org.apache.hadoop.hbase.master.HMaster.initNamespace(HMaster.java:1054)
        at 
org.apache.hadoop.hbase.master.HMaster.finishActiveMasterInitialization(HMaster.java:848)
        at org.apache.hadoop.hbase.master.HMaster.access$600(HMaster.java:199)
        at org.apache.hadoop.hbase.master.HMaster$2.run(HMaster.java:1871)
        at java.lang.Thread.run(Thread.java:745)
{code}

Root cause
==========
1. During HM start up AM assumes that it's a failover scenario based on the 
existing old WAL files, so SSH/SCP will split WAL files and assign the holding 
regions. 
2. During SSH/SCP it retrieves the server holding regions from meta/AM's 
in-memory-state, but meta only had "regioninfo" entry (as already rebuild by 
OfflineMetaRepair). So empty region will be returned and it wont trigger any 
assignment.
3. HMaster which is waiting for namespace table to be assigned will timeout and 
abort always.




--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

Reply via email to