[
https://issues.apache.org/jira/browse/HBASE-19218?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16299221#comment-16299221
]
Yi Liang commented on HBASE-19218:
----------------------------------
I have seen this error in my cluster as well
{quote}
2017-11-08 00:25:25,260 INFO [ProcExecWrkr-18]
procedure.MasterProcedureScheduler: pid=40, ppid=37,
state=RUNNABLE:REGION_TRANSITION_QUEUE; AssignProcedure table=hbase:namespace,
region=f147f204a579b885c351bdc0a7ebbf94 hbase:namespace
hbase:namespace,,1510084256045.f147f204a579b885c351bdc0a7ebbf94.
2017-11-08 00:25:25,263 INFO [ProcExecWrkr-18] assignment.AssignProcedure:
Start pid=40, ppid=37, state=RUNNABLE:REGION_TRANSITION_QUEUE; AssignProcedure
table=hbase:namespace, region=f147f204a579b885c351bdc0a7ebbf94; rit=OFFLINE,
location=node_5,16020,1510084579728; forceNewPlan=false, retain=true
.....
2017-11-08 00:25:26,040 INFO [ProcExecWrkr-23] procedure.ServerCrashProcedure:
pid=42 found RIT pid=40, ppid=37, state=RUNNABLE:REGION_TRANSITION_DISPATCH;
AssignProcedure table=hbase:namespace, region=f147f204a579b885c351bdc0a7ebbf94;
rit=OPENING, location=node_5,16020,1510100695534
2017-11-08 00:25:26,040 WARN [ProcExecWrkr-23]
assignment.RegionTransitionProcedure: Remote call failed pid=40, ppid=37,
state=RUNNABLE:REGION_TRANSITION_DISPATCH; AssignProcedure
table=hbase:namespace, region=f147f204a579b885c351bdc0a7ebbf94; rit=OPENING,
location=node_5,16020,1510100695534; exception=ServerCrashProcedure pid=42,
server=node_5,16020,1510100695534
2017-11-08 00:25:26,041 INFO [ProcExecWrkr-23] assignment.AssignProcedure:
Retry=1 of max=10; pid=40, ppid=37, state=RUNNABLE:REGION_TRANSITION_DISPATCH;
AssignProcedure table=hbase:namespace, region=f147f204a579b885c351bdc0a7ebbf94;
rit=OPENING, location=node_5,16020,1510100695534
2017-11-08 00:25:26,193 INFO [ProcExecWrkr-25] zookeeper.MetaTableLocator:
Setting hbase:meta (replicaId=0) location in ZooKeeper as
node_2,16020,1510100696388
2017-11-08 00:25:26,195 INFO [ProcExecWrkr-25]
assignment.RegionTransitionProcedure: Dispatch pid=44, ppid=43,
state=RUNNABLE:REGION_TRANSITION_DISPATCH; AssignProcedure table=hbase:meta,
region=1588230740; rit=OPENING, location=node_2,16020,1510100696388
2017-11-08 00:25:26,346 INFO [ProcedureDispatcherTimeoutThread]
procedure.RSProcedureDispatcher: Using procedure batch rpc execution for
serverName=node_2,16020,1510100696388 version=2097152
2017-11-08 00:25:27,187 INFO [ProcExecWrkr-4] hbase.MetaTableAccessor: Updated
table hbase:meta state to ENABLED in META
2017-11-08 00:25:27,187 INFO [ProcExecWrkr-4] zookeeper.MetaTableLocator:
Setting hbase:meta (replicaId=0) location in ZooKeeper as
node_2,16020,1510100696388
2017-11-08 00:25:27,209 INFO [ProcExecWrkr-22]
assignment.RegionTransitionProcedure: Dispatch pid=40, ppid=37,
state=RUNNABLE:REGION_TRANSITION_DISPATCH; AssignProcedure
table=hbase:namespace, region=f147f204a579b885c351bdc0a7ebbf94; rit=OFFLINE,
location=null
2017-11-08 00:25:27,210 INFO [ProcExecWrkr-21]
assignment.RegionTransitionProcedure: Dispatch pid=39, ppid=36,
state=RUNNABLE:REGION_TRANSITION_DISPATCH; AssignProcedure table=hbase:acl,
region=24aadcb52fdc43e2ebcffe95d39b43ab; rit=OPENING,
location=node_2,16020,1510100696388
2017-11-08 00:25:27,211 ERROR [ProcExecWrkr-22] procedure2.ProcedureExecutor:
CODE-BUG: Uncaught runtime exception: pid=40, ppid=37,
state=RUNNABLE:REGION_TRANSITION_QUEUE; AssignProcedure table=hbase:namespace,
region=f147f204a579b885c351bdc0a7ebbf94
java.lang.NullPointerException
at java.util.concurrent.ConcurrentHashMap.get(ConcurrentHashMap.java:936)
{quote}
I track the AssignProcedure of hbase:namespace to above log
1. AssignProcedure created for table=hbase:namespace; and state is
RUNNABLE:REGION_TRANSITION_QUEUE
2. ServerCrashProcedure happens and will handle AssignProcedure of
table=hbase:namespace,
and from the log, this AssignProcedure state has already been set to
REGION_TRANSITION_DISPATCH when SCP handle it.
The SCP will set the AssignProcedure back to REGION_TRANSITION_QUEUE and
offline related region.
3. However, The AssignProcedure of hbase:namespace resume from state
REGION_TRANSITION_DISPATCH,
and SCP in step2 has already offline the region and set location as null, so
null pointer exception may happen
The problem happens at step 3 above, the AssignProcedure should resume from
state REGION_TRANSITION_QUEUE, but it actually from REGION_TRANSITION_DISPATCH.
This could happen, since when SCP call remoteCallFailed for AssignProcedure of
hbase:namespace,
ProcedureExecutor is running AssignProcedure of hbase:namespace as state
REGION_TRANSITION_DISPATCH at same time.
and if SCP's handleFailure(which set region location as null) for
hbase:namespace happens before AssignProcedure#addToRemoteDispatcher, and then
null pointer happens
And if we can catch the nullpointerexception and set state back to
REGION_TRANSITION_QUEUE, that may work for this problem
> Master stuck thinking hbase:namespace is assigned after restart preventing
> intialization
> ----------------------------------------------------------------------------------------
>
> Key: HBASE-19218
> URL: https://issues.apache.org/jira/browse/HBASE-19218
> Project: HBase
> Issue Type: Bug
> Reporter: Josh Elser
> Assignee: Umesh Agashe
> Priority: Critical
> Fix For: 2.0.0-beta-1
>
> Attachments:
> hbase-hbase-master-ctr-e134-1499953498516-282290-01-000003.hwx.site.log.zip,
> hbase-site.xml
>
>
> Our [~romil.choksi] brought this one to my attention after trying to get some
> cluster tests running.
> The Master seems to have gotten stuck never initializing after it thinks that
> hbase:namespace was already deployed on the cluster when it actually was not.
> On a Master restart, it reads the location out of meta and assumes that it's
> there (I assume this invalid entry is the issue):
> {noformat}
> 2017-11-08 00:29:17,556 INFO
> [ctr-e134-1499953498516-282290-01-000003:20000.masterManager]
> assignment.RegionStateStore: Load hbase:meta entry region={ENCODED =>
> f147f204a579b885c351bdc0a7ebbf94, NAME =>
> 'hbase:namespace,,1510084256045.f147f204a579b885c351bdc0a7ebbf94.', STARTKEY
> => '', ENDKEY => ''} regionState=OPENING
> lastHost=ctr-e134-1499953498516-282290-01-000005.hwx.site,16020,1510084579728
> regionLocation=ctr-e134-1499953498516-282290-01-000005.hwx.site,16020,1510100695534
> {noformat}
> Prior to this, the RS5 went through the ServerCrashProcedure, but it looks
> like this bailed out unexpectedly:
> {noformat}
> 2017-11-08 00:25:25,187 WARN
> [ctr-e134-1499953498516-282290-01-000003:20000.masterManager]
> master.ServerManager: Expiration of
> ctr-e134-1499953498516-282290-01-000005.hwx.site,16020,1510084579728 but
> server not online
> 2017-11-08 00:25:25,187 INFO [ProcExecWrkr-5]
> procedure.ServerCrashProcedure: Start pid=36,
> state=RUNNABLE:SERVER_CRASH_START; ServerCrashProcedure
> server=ctr-e134-1499953498516-282290-01-000003.hwx.site,16020,1510084580111,
> splitWal=t
> rue, meta=false
> 2017-11-08 00:25:25,188 INFO
> [ctr-e134-1499953498516-282290-01-000003:20000.masterManager]
> master.ServerManager: Processing expiration of
> ctr-e134-1499953498516-282290-01-000005.hwx.site,16020,1510084579728 on
> ctr-e134-1499953498516-28
> 2290-01-000003.hwx.site,20000,1510100690324
> ...
> 2017-11-08 00:25:27,211 ERROR [ProcExecWrkr-22] procedure2.ProcedureExecutor:
> CODE-BUG: Uncaught runtime exception: pid=40, ppid=37,
> state=RUNNABLE:REGION_TRANSITION_QUEUE; AssignProcedure
> table=hbase:namespace, region=f147f204a579b885c351bdc0a7ebbf94
> java.lang.NullPointerException
> at
> java.util.concurrent.ConcurrentHashMap.get(ConcurrentHashMap.java:936)
> at
> org.apache.hadoop.hbase.procedure2.RemoteProcedureDispatcher.addOperationToNode(RemoteProcedureDispatcher.java:171)
> at
> org.apache.hadoop.hbase.master.assignment.RegionTransitionProcedure.addToRemoteDispatcher(RegionTransitionProcedure.java:223)
> at
> org.apache.hadoop.hbase.master.assignment.AssignProcedure.updateTransition(AssignProcedure.java:252)
> at
> org.apache.hadoop.hbase.master.assignment.RegionTransitionProcedure.execute(RegionTransitionProcedure.java:309)
> at
> org.apache.hadoop.hbase.master.assignment.RegionTransitionProcedure.execute(RegionTransitionProcedure.java:82)
> at
> org.apache.hadoop.hbase.procedure2.Procedure.doExecute(Procedure.java:845)
> at
> org.apache.hadoop.hbase.procedure2.ProcedureExecutor.execProcedure(ProcedureExecutor.java:1452)
> at
> org.apache.hadoop.hbase.procedure2.ProcedureExecutor.executeProcedure(ProcedureExecutor.java:1221)
> at
> org.apache.hadoop.hbase.procedure2.ProcedureExecutor.access$800(ProcedureExecutor.java:77)
> at
> org.apache.hadoop.hbase.procedure2.ProcedureExecutor$WorkerThread.run(ProcedureExecutor.java:1731)
> 2017-11-08 00:25:27,239 FATAL [ProcExecWrkr-22] procedure2.ProcedureExecutor:
> CODE-BUG: Uncaught runtime exception for pid=37,
> state=FAILED:SERVER_CRASH_FINISH, exception=java.lang.NullPointerException
> via CODE-BUG: Uncaught runtime exception: pid=40, ppid=37,
> state=RUNNABLE:REGION_TRANSITION_QUEUE; AssignProcedure
> table=hbase:namespace,
> region=f147f204a579b885c351bdc0a7ebbf94:java.lang.NullPointerException;
> ServerCrashProcedure
> server=ctr-e134-1499953498516-282290-01-000005.hwx.site,16020,1510084579728,
> splitWal=true, meta=false
> java.lang.UnsupportedOperationException: unhandled state=SERVER_CRASH_FINISH
> at
> org.apache.hadoop.hbase.master.procedure.ServerCrashProcedure.rollbackState(ServerCrashProcedure.java:236)
> at
> org.apache.hadoop.hbase.master.procedure.ServerCrashProcedure.rollbackState(ServerCrashProcedure.java:56)
> at
> org.apache.hadoop.hbase.procedure2.StateMachineProcedure.rollback(StateMachineProcedure.java:198)
> at
> org.apache.hadoop.hbase.procedure2.Procedure.doRollback(Procedure.java:859)
> at
> org.apache.hadoop.hbase.procedure2.ProcedureExecutor.executeRollback(ProcedureExecutor.java:1353)
> at
> org.apache.hadoop.hbase.procedure2.ProcedureExecutor.executeRollback(ProcedureExecutor.java:1309)
> at
> org.apache.hadoop.hbase.procedure2.ProcedureExecutor.executeProcedure(ProcedureExecutor.java:1178)
> at
> org.apache.hadoop.hbase.procedure2.ProcedureExecutor.access$800(ProcedureExecutor.java:77)
> at
> org.apache.hadoop.hbase.procedure2.ProcedureExecutor$WorkerThread.run(ProcedureExecutor.java:1731)
> 2017-11-08 00:25:27,344 FATAL [ProcExecWrkr-8] procedure2.ProcedureExecutor:
> CODE-BUG: Uncaught runtime exception for pid=37,
> state=FAILED:SERVER_CRASH_FINISH, exception=java.lang.NullPointerException
> via CODE-BUG: Uncaught runtime exception: pid=40, ppid=37,
> state=RUNNABLE:REGION_TRANSITION_QUEUE; AssignProcedure
> table=hbase:namespace,
> region=f147f204a579b885c351bdc0a7ebbf94:java.lang.NullPointerException;
> ServerCrashProcedure
> server=ctr-e134-1499953498516-282290-01-000005.hwx.site,16020,1510084579728,
> splitWal=true, meta=false
> java.lang.UnsupportedOperationException: unhandled state=SERVER_CRASH_FINISH
> at
> org.apache.hadoop.hbase.master.procedure.ServerCrashProcedure.rollbackState(ServerCrashProcedure.java:236)
> at
> org.apache.hadoop.hbase.master.procedure.ServerCrashProcedure.rollbackState(ServerCrashProcedure.java:56)
> at
> org.apache.hadoop.hbase.procedure2.StateMachineProcedure.rollback(StateMachineProcedure.java:198)
> at
> org.apache.hadoop.hbase.procedure2.Procedure.doRollback(Procedure.java:859)
> at
> org.apache.hadoop.hbase.procedure2.ProcedureExecutor.executeRollback(ProcedureExecutor.java:1353)
> at
> org.apache.hadoop.hbase.procedure2.ProcedureExecutor.executeRollback(ProcedureExecutor.java:1309)
> at
> org.apache.hadoop.hbase.procedure2.ProcedureExecutor.executeProcedure(ProcedureExecutor.java:1178)
> at
> org.apache.hadoop.hbase.procedure2.ProcedureExecutor.access$800(ProcedureExecutor.java:77)
> at
> org.apache.hadoop.hbase.procedure2.ProcedureExecutor$WorkerThread.run(ProcedureExecutor.java:1731)
> 2017-11-08 00:25:27,356 INFO [ProcExecWrkr-5] procedure2.ProcedureExecutor:
> Rolled back pid=37, state=ROLLEDBACK,
> exception=java.lang.NullPointerException via CODE-BUG: Uncaught runtime
> exception: pid=40, ppid=37, state=RUNNABLE:REGION_TRANSITION_QUEUE;
> AssignProcedure table=hbase:namespace,
> region=f147f204a579b885c351bdc0a7ebbf94:java.lang.NullPointerException;
> ServerCrashProcedure
> server=ctr-e134-1499953498516-282290-01-000005.hwx.site,16020,1510084579728,
> splitWal=true, meta=false exec-time=2.1650sec
> {noformat}
> Shortly after this, the master was restarted.
> My hunch is that because the SCP crashed, we never invalidated the meta
> entries and got us stuck thinking the region was assigned when it wasn't?
--
This message was sent by Atlassian JIRA
(v6.4.14#64029)