Ke Han created HBASE-28109:
------------------------------

             Summary: NPE for the region state: Failed to become active master 
(HMaster)
                 Key: HBASE-28109
                 URL: https://issues.apache.org/jira/browse/HBASE-28109
             Project: HBase
          Issue Type: Bug
    Affects Versions: 2.4.17
            Reporter: Ke Han
         Attachments: hbase--master-ee4a85363fe2.log

When starting up HBase cluster (2.4.17), I met NPE and it prevents HMaster from 
starting up. I have to restart the HMaster.

My cluster contains 1 HMaster, 2 RS (HBase-2.4.17) and 1 Hadoop node (2.10.2).

 
{code:java}
2023-09-18 14:17:35,931 INFO  [PEWorker-1] procedure2.ProcedureExecutor: Rolled 
back pid=1, state=ROLLEDBACK, 
exception=org.apache.hadoop.hbase.exceptions.TimeoutIOException via 
ProcedureExecutor:org.apache.hadoop.hbase.exceptions.TimeoutIOException: 
Operation timed out after 1.0010 sec; InitMetaProcedure table=hbase:meta 
exec-time=1.4660 sec
2023-09-18 14:17:35,931 INFO  [master/hmaster:16000:becomeActiveMaster] 
master.HMaster: Wait for region servers to report in: status=null, 
state=RUNNING, startTime=1695046655931, completionTime=-1
2023-09-18 14:17:35,932 INFO  [master/hmaster:16000:becomeActiveMaster] 
master.ServerManager: Waiting on regionserver count=2; waited=0ms, expecting 
min=1 server(s), max=NO_LIMIT server(s), timeout=4500ms, lastChange=0ms
2023-09-18 14:17:37,438 INFO  [master/hmaster:16000:becomeActiveMaster] 
master.ServerManager: Waiting on regionserver count=2; waited=1505ms, expecting 
min=1 server(s), max=NO_LIMIT server(s), timeout=4500ms, lastChange=1505ms
2023-09-18 14:17:38,941 INFO  [master/hmaster:16000:becomeActiveMaster] 
master.ServerManager: Waiting on regionserver count=2; waited=3009ms, expecting 
min=1 server(s), max=NO_LIMIT server(s), timeout=4500ms, lastChange=3009ms
2023-09-18 14:17:40,445 INFO  [master/hmaster:16000:becomeActiveMaster] 
master.ServerManager: Finished waiting on RegionServer count=2; waited=4513ms, 
expected min=1 server(s), max=NO_LIMIT server(s), master is running
2023-09-18 14:17:40,452 ERROR [master/hmaster:16000:becomeActiveMaster] 
master.HMaster: Failed to become active master
java.lang.NullPointerException
        at 
org.apache.hadoop.hbase.master.HMaster.isRegionOnline(HMaster.java:1229)
        at 
org.apache.hadoop.hbase.master.HMaster.waitForMetaOnline(HMaster.java:1218)
        at 
org.apache.hadoop.hbase.master.HMaster.finishActiveMasterInitialization(HMaster.java:968)
        at 
org.apache.hadoop.hbase.master.HMaster.startActiveMasterManager(HMaster.java:2193)
        at org.apache.hadoop.hbase.master.HMaster.lambda$run$0(HMaster.java:528)
        at java.lang.Thread.run(Thread.java:750)
2023-09-18 14:17:40,453 ERROR [master/hmaster:16000:becomeActiveMaster] 
master.HMaster: Master server abort: loaded coprocessors are: 
[org.apache.hadoop.hbase.quotas.MasterQuotasObserver] {code}
 
h1. Root Cause

>From the stack trace, the rs variable is NULL and it's directly used without 
>checking.

 
{code:java}
// hbase-server/src/main/java/org/apache/hadoop/hbase/master/HMaster.java

  /**
   * @return True if region is online and scannable else false if an error or 
shutdown (Otherwise we
   *         just block in here holding up all forward-progess).
   */
  private boolean isRegionOnline(RegionInfo ri) {
    RetryCounter rc = null;
    while (!isStopped()) {
      // NPE line
      RegionState rs = 
this.assignmentManager.getRegionStates().getRegionState(ri);
      if (rs.isOpened()) {
        if (this.getServerManager().isServerOnline(rs.getServerName())) {
          return true;
        }
      }
      // Region{code}
 

I am not sure what causes the rs to be null but maybe we can add a check to 
make sure this NPE is captured and properly handled.

Restart the HMaster and this exception will disappear. I have attached the full 
log from HMaster for this case. I run into this exception when using HBase 
2.4.17 but I think it might also happen in the latest branch since the code of 
isRegionOnline is the same.
h1. Fix

This bug happens rarely. I think we can add a simple check to know whether rs 
is null and then decide whether to keep waiting or directly shutdown the 
HMaster.

I assume that if HMaster wait for more time, it will get correct responses from 
regionservers.

I have a simple PR to fix it.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

Reply via email to