[jira] [Commented] (HBASE-28109) NPE for the region state: Failed to become active master (HMaster)

Hudson (Jira) Sat, 07 Oct 2023 21:40:06 -0700


    [ 
https://issues.apache.org/jira/browse/HBASE-28109?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17772923#comment-17772923
 ]


Hudson commented on HBASE-28109:
--------------------------------

Results for branch branch-2.4
        [build #633 on 
builds.a.o|https://ci-hbase.apache.org/job/HBase%20Nightly/job/branch-2.4/633/]:
 (/) *{color:green}+1 overall{color}*
----
details (if available):

(/) {color:green}+1 general checks{color}
-- For more information [see general 
report|https://ci-hbase.apache.org/job/HBase%20Nightly/job/branch-2.4/633/General_20Nightly_20Build_20Report/]


(/) {color:green}+1 jdk8 hadoop2 checks{color}
-- For more information [see jdk8 (hadoop2) 
report|https://ci-hbase.apache.org/job/HBase%20Nightly/job/branch-2.4/633/JDK8_20Nightly_20Build_20Report_20_28Hadoop2_29/]


(/) {color:green}+1 jdk8 hadoop3 checks{color}
-- For more information [see jdk8 (hadoop3) 
report|https://ci-hbase.apache.org/job/HBase%20Nightly/job/branch-2.4/633/JDK8_20Nightly_20Build_20Report_20_28Hadoop3_29/]


(/) {color:green}+1 jdk11 hadoop3 checks{color}
-- For more information [see jdk11 
report|https://ci-hbase.apache.org/job/HBase%20Nightly/job/branch-2.4/633/JDK11_20Nightly_20Build_20Report_20_28Hadoop3_29/]


(/) {color:green}+1 source release artifact{color}
-- See build output for details.


(/) {color:green}+1 client integration test{color}


> NPE for the region state: Failed to become active master (HMaster)
> ------------------------------------------------------------------
>
>                 Key: HBASE-28109
>                 URL: https://issues.apache.org/jira/browse/HBASE-28109
>             Project: HBase
>          Issue Type: Bug
>          Components: master
>    Affects Versions: 2.4.17
>            Reporter: Ke Han
>            Assignee: Ke Han
>            Priority: Major
>             Fix For: 2.6.0, 2.4.18, 2.5.6, 3.0.0-beta-1
>
>         Attachments: hbase--master-ee4a85363fe2.log
>
>
> When starting up HBase cluster (2.4.17), I met NPE and it prevents HMaster 
> from starting up. I have to restart the HMaster.
> My cluster contains 1 HMaster, 2 RS (HBase-2.4.17) and 1 Hadoop node (2.10.2).
> {code:java}
> 2023-09-18 14:17:35,931 INFO  [PEWorker-1] procedure2.ProcedureExecutor: 
> Rolled back pid=1, state=ROLLEDBACK, 
> exception=org.apache.hadoop.hbase.exceptions.TimeoutIOException via 
> ProcedureExecutor:org.apache.hadoop.hbase.exceptions.TimeoutIOException: 
> Operation timed out after 1.0010 sec; InitMetaProcedure table=hbase:meta 
> exec-time=1.4660 sec
> 2023-09-18 14:17:35,931 INFO  [master/hmaster:16000:becomeActiveMaster] 
> master.HMaster: Wait for region servers to report in: status=null, 
> state=RUNNING, startTime=1695046655931, completionTime=-1
> 2023-09-18 14:17:35,932 INFO  [master/hmaster:16000:becomeActiveMaster] 
> master.ServerManager: Waiting on regionserver count=2; waited=0ms, expecting 
> min=1 server(s), max=NO_LIMIT server(s), timeout=4500ms, lastChange=0ms
> 2023-09-18 14:17:37,438 INFO  [master/hmaster:16000:becomeActiveMaster] 
> master.ServerManager: Waiting on regionserver count=2; waited=1505ms, 
> expecting min=1 server(s), max=NO_LIMIT server(s), timeout=4500ms, 
> lastChange=1505ms
> 2023-09-18 14:17:38,941 INFO  [master/hmaster:16000:becomeActiveMaster] 
> master.ServerManager: Waiting on regionserver count=2; waited=3009ms, 
> expecting min=1 server(s), max=NO_LIMIT server(s), timeout=4500ms, 
> lastChange=3009ms
> 2023-09-18 14:17:40,445 INFO  [master/hmaster:16000:becomeActiveMaster] 
> master.ServerManager: Finished waiting on RegionServer count=2; 
> waited=4513ms, expected min=1 server(s), max=NO_LIMIT server(s), master is 
> running
> 2023-09-18 14:17:40,452 ERROR [master/hmaster:16000:becomeActiveMaster] 
> master.HMaster: Failed to become active master
> java.lang.NullPointerException
>         at 
> org.apache.hadoop.hbase.master.HMaster.isRegionOnline(HMaster.java:1229)
>         at 
> org.apache.hadoop.hbase.master.HMaster.waitForMetaOnline(HMaster.java:1218)
>         at 
> org.apache.hadoop.hbase.master.HMaster.finishActiveMasterInitialization(HMaster.java:968)
>         at 
> org.apache.hadoop.hbase.master.HMaster.startActiveMasterManager(HMaster.java:2193)
>         at 
> org.apache.hadoop.hbase.master.HMaster.lambda$run$0(HMaster.java:528)
>         at java.lang.Thread.run(Thread.java:750)
> 2023-09-18 14:17:40,453 ERROR [master/hmaster:16000:becomeActiveMaster] 
> master.HMaster: Master server abort: loaded coprocessors are: 
> [org.apache.hadoop.hbase.quotas.MasterQuotasObserver] {code}
> h1. Root Cause
> From the stack trace, the rs variable is NULL and it's directly used without 
> checking.
> {code:java}
> // hbase-server/src/main/java/org/apache/hadoop/hbase/master/HMaster.java
>   /**
>    * @return True if region is online and scannable else false if an error or 
> shutdown (Otherwise we
>    *         just block in here holding up all forward-progess).
>    */
>   private boolean isRegionOnline(RegionInfo ri) {
>     RetryCounter rc = null;
>     while (!isStopped()) {
>       // NPE line
>       RegionState rs = 
> this.assignmentManager.getRegionStates().getRegionState(ri);
>       if (rs.isOpened()) {
>         if (this.getServerManager().isServerOnline(rs.getServerName())) {
>           return true;
>         }
>       }
>       // Region{code}
> I am not sure what causes the *rs* to be null but maybe we can add a check to 
> make sure this NPE is captured and properly handled.
> Restart the HMaster and this exception will disappear. I have attached the 
> full log from HMaster for this case. I run into this exception when using 
> HBase 2.4.17 but I think it might also happen in the latest branch since the 
> code of isRegionOnline is the same.
> h1. Fix
> This bug happens rarely. I think we can add a simple check to know whether rs 
> is null and then decide whether to keep waiting or directly shutdown the 
> HMaster.
> I assume that if HMaster wait for more time, it will get correct responses 
> from regionservers.
> I have a simple PR to fix it.
> https://github.com/apache/hbase/pull/5432



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[jira] [Commented] (HBASE-28109) NPE for the region state: Failed to become active master (HMaster)

Reply via email to