[
https://issues.apache.org/jira/browse/HBASE-19218?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16254367#comment-16254367
]
Josh Elser commented on HBASE-19218:
------------------------------------
Got confirmation from our Romil that this does only happen with RSGroups (not
under normal circumstances) and happened (essentially) on top of
df98d6848f2848579c1893a42132dee3cc5d907d.
Wouldn't be surprised if this is something to do with RSGroups instead of a
generic problem.
> Master stuck thinking hbase:namespace is assigned after restart preventing
> intialization
> ----------------------------------------------------------------------------------------
>
> Key: HBASE-19218
> URL: https://issues.apache.org/jira/browse/HBASE-19218
> Project: HBase
> Issue Type: Bug
> Reporter: Josh Elser
> Priority: Critical
> Fix For: 2.0.0-beta-1
>
> Attachments:
> hbase-hbase-master-ctr-e134-1499953498516-282290-01-000003.hwx.site.log.zip,
> hbase-site.xml
>
>
> Our [~romil.choksi] brought this one to my attention after trying to get some
> cluster tests running.
> The Master seems to have gotten stuck never initializing after it thinks that
> hbase:namespace was already deployed on the cluster when it actually was not.
> On a Master restart, it reads the location out of meta and assumes that it's
> there (I assume this invalid entry is the issue):
> {noformat}
> 2017-11-08 00:29:17,556 INFO
> [ctr-e134-1499953498516-282290-01-000003:20000.masterManager]
> assignment.RegionStateStore: Load hbase:meta entry region={ENCODED =>
> f147f204a579b885c351bdc0a7ebbf94, NAME =>
> 'hbase:namespace,,1510084256045.f147f204a579b885c351bdc0a7ebbf94.', STARTKEY
> => '', ENDKEY => ''} regionState=OPENING
> lastHost=ctr-e134-1499953498516-282290-01-000005.hwx.site,16020,1510084579728
> regionLocation=ctr-e134-1499953498516-282290-01-000005.hwx.site,16020,1510100695534
> {noformat}
> Prior to this, the RS5 went through the ServerCrashProcedure, but it looks
> like this bailed out unexpectedly:
> {noformat}
> 2017-11-08 00:25:25,187 WARN
> [ctr-e134-1499953498516-282290-01-000003:20000.masterManager]
> master.ServerManager: Expiration of
> ctr-e134-1499953498516-282290-01-000005.hwx.site,16020,1510084579728 but
> server not online
> 2017-11-08 00:25:25,187 INFO [ProcExecWrkr-5]
> procedure.ServerCrashProcedure: Start pid=36,
> state=RUNNABLE:SERVER_CRASH_START; ServerCrashProcedure
> server=ctr-e134-1499953498516-282290-01-000003.hwx.site,16020,1510084580111,
> splitWal=t
> rue, meta=false
> 2017-11-08 00:25:25,188 INFO
> [ctr-e134-1499953498516-282290-01-000003:20000.masterManager]
> master.ServerManager: Processing expiration of
> ctr-e134-1499953498516-282290-01-000005.hwx.site,16020,1510084579728 on
> ctr-e134-1499953498516-28
> 2290-01-000003.hwx.site,20000,1510100690324
> ...
> 2017-11-08 00:25:27,211 ERROR [ProcExecWrkr-22] procedure2.ProcedureExecutor:
> CODE-BUG: Uncaught runtime exception: pid=40, ppid=37,
> state=RUNNABLE:REGION_TRANSITION_QUEUE; AssignProcedure
> table=hbase:namespace, region=f147f204a579b885c351bdc0a7ebbf94
> java.lang.NullPointerException
> at
> java.util.concurrent.ConcurrentHashMap.get(ConcurrentHashMap.java:936)
> at
> org.apache.hadoop.hbase.procedure2.RemoteProcedureDispatcher.addOperationToNode(RemoteProcedureDispatcher.java:171)
> at
> org.apache.hadoop.hbase.master.assignment.RegionTransitionProcedure.addToRemoteDispatcher(RegionTransitionProcedure.java:223)
> at
> org.apache.hadoop.hbase.master.assignment.AssignProcedure.updateTransition(AssignProcedure.java:252)
> at
> org.apache.hadoop.hbase.master.assignment.RegionTransitionProcedure.execute(RegionTransitionProcedure.java:309)
> at
> org.apache.hadoop.hbase.master.assignment.RegionTransitionProcedure.execute(RegionTransitionProcedure.java:82)
> at
> org.apache.hadoop.hbase.procedure2.Procedure.doExecute(Procedure.java:845)
> at
> org.apache.hadoop.hbase.procedure2.ProcedureExecutor.execProcedure(ProcedureExecutor.java:1452)
> at
> org.apache.hadoop.hbase.procedure2.ProcedureExecutor.executeProcedure(ProcedureExecutor.java:1221)
> at
> org.apache.hadoop.hbase.procedure2.ProcedureExecutor.access$800(ProcedureExecutor.java:77)
> at
> org.apache.hadoop.hbase.procedure2.ProcedureExecutor$WorkerThread.run(ProcedureExecutor.java:1731)
> 2017-11-08 00:25:27,239 FATAL [ProcExecWrkr-22] procedure2.ProcedureExecutor:
> CODE-BUG: Uncaught runtime exception for pid=37,
> state=FAILED:SERVER_CRASH_FINISH, exception=java.lang.NullPointerException
> via CODE-BUG: Uncaught runtime exception: pid=40, ppid=37,
> state=RUNNABLE:REGION_TRANSITION_QUEUE; AssignProcedure
> table=hbase:namespace,
> region=f147f204a579b885c351bdc0a7ebbf94:java.lang.NullPointerException;
> ServerCrashProcedure
> server=ctr-e134-1499953498516-282290-01-000005.hwx.site,16020,1510084579728,
> splitWal=true, meta=false
> java.lang.UnsupportedOperationException: unhandled state=SERVER_CRASH_FINISH
> at
> org.apache.hadoop.hbase.master.procedure.ServerCrashProcedure.rollbackState(ServerCrashProcedure.java:236)
> at
> org.apache.hadoop.hbase.master.procedure.ServerCrashProcedure.rollbackState(ServerCrashProcedure.java:56)
> at
> org.apache.hadoop.hbase.procedure2.StateMachineProcedure.rollback(StateMachineProcedure.java:198)
> at
> org.apache.hadoop.hbase.procedure2.Procedure.doRollback(Procedure.java:859)
> at
> org.apache.hadoop.hbase.procedure2.ProcedureExecutor.executeRollback(ProcedureExecutor.java:1353)
> at
> org.apache.hadoop.hbase.procedure2.ProcedureExecutor.executeRollback(ProcedureExecutor.java:1309)
> at
> org.apache.hadoop.hbase.procedure2.ProcedureExecutor.executeProcedure(ProcedureExecutor.java:1178)
> at
> org.apache.hadoop.hbase.procedure2.ProcedureExecutor.access$800(ProcedureExecutor.java:77)
> at
> org.apache.hadoop.hbase.procedure2.ProcedureExecutor$WorkerThread.run(ProcedureExecutor.java:1731)
> 2017-11-08 00:25:27,344 FATAL [ProcExecWrkr-8] procedure2.ProcedureExecutor:
> CODE-BUG: Uncaught runtime exception for pid=37,
> state=FAILED:SERVER_CRASH_FINISH, exception=java.lang.NullPointerException
> via CODE-BUG: Uncaught runtime exception: pid=40, ppid=37,
> state=RUNNABLE:REGION_TRANSITION_QUEUE; AssignProcedure
> table=hbase:namespace,
> region=f147f204a579b885c351bdc0a7ebbf94:java.lang.NullPointerException;
> ServerCrashProcedure
> server=ctr-e134-1499953498516-282290-01-000005.hwx.site,16020,1510084579728,
> splitWal=true, meta=false
> java.lang.UnsupportedOperationException: unhandled state=SERVER_CRASH_FINISH
> at
> org.apache.hadoop.hbase.master.procedure.ServerCrashProcedure.rollbackState(ServerCrashProcedure.java:236)
> at
> org.apache.hadoop.hbase.master.procedure.ServerCrashProcedure.rollbackState(ServerCrashProcedure.java:56)
> at
> org.apache.hadoop.hbase.procedure2.StateMachineProcedure.rollback(StateMachineProcedure.java:198)
> at
> org.apache.hadoop.hbase.procedure2.Procedure.doRollback(Procedure.java:859)
> at
> org.apache.hadoop.hbase.procedure2.ProcedureExecutor.executeRollback(ProcedureExecutor.java:1353)
> at
> org.apache.hadoop.hbase.procedure2.ProcedureExecutor.executeRollback(ProcedureExecutor.java:1309)
> at
> org.apache.hadoop.hbase.procedure2.ProcedureExecutor.executeProcedure(ProcedureExecutor.java:1178)
> at
> org.apache.hadoop.hbase.procedure2.ProcedureExecutor.access$800(ProcedureExecutor.java:77)
> at
> org.apache.hadoop.hbase.procedure2.ProcedureExecutor$WorkerThread.run(ProcedureExecutor.java:1731)
> 2017-11-08 00:25:27,356 INFO [ProcExecWrkr-5] procedure2.ProcedureExecutor:
> Rolled back pid=37, state=ROLLEDBACK,
> exception=java.lang.NullPointerException via CODE-BUG: Uncaught runtime
> exception: pid=40, ppid=37, state=RUNNABLE:REGION_TRANSITION_QUEUE;
> AssignProcedure table=hbase:namespace,
> region=f147f204a579b885c351bdc0a7ebbf94:java.lang.NullPointerException;
> ServerCrashProcedure
> server=ctr-e134-1499953498516-282290-01-000005.hwx.site,16020,1510084579728,
> splitWal=true, meta=false exec-time=2.1650sec
> {noformat}
> Shortly after this, the master was restarted.
> My hunch is that because the SCP crashed, we never invalidated the meta
> entries and got us stuck thinking the region was assigned when it wasn't?
--
This message was sent by Atlassian JIRA
(v6.4.14#64029)