[
https://issues.apache.org/jira/browse/HBASE-19218?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16299336#comment-16299336
]
stack commented on HBASE-19218:
-------------------------------
[~easyliangjob] Good. We seem to be of same opinion. The fix here will cause us
fail the current request because targetServer is null so we'll go back around
again. It is good by you sir?
> Master stuck thinking hbase:namespace is assigned after restart preventing
> intialization
> ----------------------------------------------------------------------------------------
>
> Key: HBASE-19218
> URL: https://issues.apache.org/jira/browse/HBASE-19218
> Project: HBase
> Issue Type: Bug
> Reporter: Josh Elser
> Assignee: stack
> Priority: Critical
> Fix For: 2.0.0-beta-1
>
> Attachments: HBASE-19218.master.001.patch,
> hbase-hbase-master-ctr-e134-1499953498516-282290-01-000003.hwx.site.log.zip,
> hbase-site.xml
>
>
> Our [~romil.choksi] brought this one to my attention after trying to get some
> cluster tests running.
> The Master seems to have gotten stuck never initializing after it thinks that
> hbase:namespace was already deployed on the cluster when it actually was not.
> On a Master restart, it reads the location out of meta and assumes that it's
> there (I assume this invalid entry is the issue):
> {noformat}
> 2017-11-08 00:29:17,556 INFO
> [ctr-e134-1499953498516-282290-01-000003:20000.masterManager]
> assignment.RegionStateStore: Load hbase:meta entry region={ENCODED =>
> f147f204a579b885c351bdc0a7ebbf94, NAME =>
> 'hbase:namespace,,1510084256045.f147f204a579b885c351bdc0a7ebbf94.', STARTKEY
> => '', ENDKEY => ''} regionState=OPENING
> lastHost=ctr-e134-1499953498516-282290-01-000005.hwx.site,16020,1510084579728
> regionLocation=ctr-e134-1499953498516-282290-01-000005.hwx.site,16020,1510100695534
> {noformat}
> Prior to this, the RS5 went through the ServerCrashProcedure, but it looks
> like this bailed out unexpectedly:
> {noformat}
> 2017-11-08 00:25:25,187 WARN
> [ctr-e134-1499953498516-282290-01-000003:20000.masterManager]
> master.ServerManager: Expiration of
> ctr-e134-1499953498516-282290-01-000005.hwx.site,16020,1510084579728 but
> server not online
> 2017-11-08 00:25:25,187 INFO [ProcExecWrkr-5]
> procedure.ServerCrashProcedure: Start pid=36,
> state=RUNNABLE:SERVER_CRASH_START; ServerCrashProcedure
> server=ctr-e134-1499953498516-282290-01-000003.hwx.site,16020,1510084580111,
> splitWal=t
> rue, meta=false
> 2017-11-08 00:25:25,188 INFO
> [ctr-e134-1499953498516-282290-01-000003:20000.masterManager]
> master.ServerManager: Processing expiration of
> ctr-e134-1499953498516-282290-01-000005.hwx.site,16020,1510084579728 on
> ctr-e134-1499953498516-28
> 2290-01-000003.hwx.site,20000,1510100690324
> ...
> 2017-11-08 00:25:27,211 ERROR [ProcExecWrkr-22] procedure2.ProcedureExecutor:
> CODE-BUG: Uncaught runtime exception: pid=40, ppid=37,
> state=RUNNABLE:REGION_TRANSITION_QUEUE; AssignProcedure
> table=hbase:namespace, region=f147f204a579b885c351bdc0a7ebbf94
> java.lang.NullPointerException
> at
> java.util.concurrent.ConcurrentHashMap.get(ConcurrentHashMap.java:936)
> at
> org.apache.hadoop.hbase.procedure2.RemoteProcedureDispatcher.addOperationToNode(RemoteProcedureDispatcher.java:171)
> at
> org.apache.hadoop.hbase.master.assignment.RegionTransitionProcedure.addToRemoteDispatcher(RegionTransitionProcedure.java:223)
> at
> org.apache.hadoop.hbase.master.assignment.AssignProcedure.updateTransition(AssignProcedure.java:252)
> at
> org.apache.hadoop.hbase.master.assignment.RegionTransitionProcedure.execute(RegionTransitionProcedure.java:309)
> at
> org.apache.hadoop.hbase.master.assignment.RegionTransitionProcedure.execute(RegionTransitionProcedure.java:82)
> at
> org.apache.hadoop.hbase.procedure2.Procedure.doExecute(Procedure.java:845)
> at
> org.apache.hadoop.hbase.procedure2.ProcedureExecutor.execProcedure(ProcedureExecutor.java:1452)
> at
> org.apache.hadoop.hbase.procedure2.ProcedureExecutor.executeProcedure(ProcedureExecutor.java:1221)
> at
> org.apache.hadoop.hbase.procedure2.ProcedureExecutor.access$800(ProcedureExecutor.java:77)
> at
> org.apache.hadoop.hbase.procedure2.ProcedureExecutor$WorkerThread.run(ProcedureExecutor.java:1731)
> 2017-11-08 00:25:27,239 FATAL [ProcExecWrkr-22] procedure2.ProcedureExecutor:
> CODE-BUG: Uncaught runtime exception for pid=37,
> state=FAILED:SERVER_CRASH_FINISH, exception=java.lang.NullPointerException
> via CODE-BUG: Uncaught runtime exception: pid=40, ppid=37,
> state=RUNNABLE:REGION_TRANSITION_QUEUE; AssignProcedure
> table=hbase:namespace,
> region=f147f204a579b885c351bdc0a7ebbf94:java.lang.NullPointerException;
> ServerCrashProcedure
> server=ctr-e134-1499953498516-282290-01-000005.hwx.site,16020,1510084579728,
> splitWal=true, meta=false
> java.lang.UnsupportedOperationException: unhandled state=SERVER_CRASH_FINISH
> at
> org.apache.hadoop.hbase.master.procedure.ServerCrashProcedure.rollbackState(ServerCrashProcedure.java:236)
> at
> org.apache.hadoop.hbase.master.procedure.ServerCrashProcedure.rollbackState(ServerCrashProcedure.java:56)
> at
> org.apache.hadoop.hbase.procedure2.StateMachineProcedure.rollback(StateMachineProcedure.java:198)
> at
> org.apache.hadoop.hbase.procedure2.Procedure.doRollback(Procedure.java:859)
> at
> org.apache.hadoop.hbase.procedure2.ProcedureExecutor.executeRollback(ProcedureExecutor.java:1353)
> at
> org.apache.hadoop.hbase.procedure2.ProcedureExecutor.executeRollback(ProcedureExecutor.java:1309)
> at
> org.apache.hadoop.hbase.procedure2.ProcedureExecutor.executeProcedure(ProcedureExecutor.java:1178)
> at
> org.apache.hadoop.hbase.procedure2.ProcedureExecutor.access$800(ProcedureExecutor.java:77)
> at
> org.apache.hadoop.hbase.procedure2.ProcedureExecutor$WorkerThread.run(ProcedureExecutor.java:1731)
> 2017-11-08 00:25:27,344 FATAL [ProcExecWrkr-8] procedure2.ProcedureExecutor:
> CODE-BUG: Uncaught runtime exception for pid=37,
> state=FAILED:SERVER_CRASH_FINISH, exception=java.lang.NullPointerException
> via CODE-BUG: Uncaught runtime exception: pid=40, ppid=37,
> state=RUNNABLE:REGION_TRANSITION_QUEUE; AssignProcedure
> table=hbase:namespace,
> region=f147f204a579b885c351bdc0a7ebbf94:java.lang.NullPointerException;
> ServerCrashProcedure
> server=ctr-e134-1499953498516-282290-01-000005.hwx.site,16020,1510084579728,
> splitWal=true, meta=false
> java.lang.UnsupportedOperationException: unhandled state=SERVER_CRASH_FINISH
> at
> org.apache.hadoop.hbase.master.procedure.ServerCrashProcedure.rollbackState(ServerCrashProcedure.java:236)
> at
> org.apache.hadoop.hbase.master.procedure.ServerCrashProcedure.rollbackState(ServerCrashProcedure.java:56)
> at
> org.apache.hadoop.hbase.procedure2.StateMachineProcedure.rollback(StateMachineProcedure.java:198)
> at
> org.apache.hadoop.hbase.procedure2.Procedure.doRollback(Procedure.java:859)
> at
> org.apache.hadoop.hbase.procedure2.ProcedureExecutor.executeRollback(ProcedureExecutor.java:1353)
> at
> org.apache.hadoop.hbase.procedure2.ProcedureExecutor.executeRollback(ProcedureExecutor.java:1309)
> at
> org.apache.hadoop.hbase.procedure2.ProcedureExecutor.executeProcedure(ProcedureExecutor.java:1178)
> at
> org.apache.hadoop.hbase.procedure2.ProcedureExecutor.access$800(ProcedureExecutor.java:77)
> at
> org.apache.hadoop.hbase.procedure2.ProcedureExecutor$WorkerThread.run(ProcedureExecutor.java:1731)
> 2017-11-08 00:25:27,356 INFO [ProcExecWrkr-5] procedure2.ProcedureExecutor:
> Rolled back pid=37, state=ROLLEDBACK,
> exception=java.lang.NullPointerException via CODE-BUG: Uncaught runtime
> exception: pid=40, ppid=37, state=RUNNABLE:REGION_TRANSITION_QUEUE;
> AssignProcedure table=hbase:namespace,
> region=f147f204a579b885c351bdc0a7ebbf94:java.lang.NullPointerException;
> ServerCrashProcedure
> server=ctr-e134-1499953498516-282290-01-000005.hwx.site,16020,1510084579728,
> splitWal=true, meta=false exec-time=2.1650sec
> {noformat}
> Shortly after this, the master was restarted.
> My hunch is that because the SCP crashed, we never invalidated the meta
> entries and got us stuck thinking the region was assigned when it wasn't?
--
This message was sent by Atlassian JIRA
(v6.4.14#64029)