[ 
https://issues.apache.org/jira/browse/HBASE-19218?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16299186#comment-16299186
 ] 

stack commented on HBASE-19218:
-------------------------------

I took another look at the log.

We are doing dispatch to RPC but targetServer is null. The targetServer is 
usually allocated by the time we get down to RPC dispatch but what is 
interesting here is that this is our second attempt. The first was cancelled by 
a concurrently running ServerCrashProcedure:

2017-11-08 00:25:26,040 INFO  [ProcExecWrkr-23] procedure.ServerCrashProcedure: 
pid=42 found RIT pid=40, ppid=37, state=RUNNABLE:REGION_TRANSITION_DISPATCH; 
AssignProcedure table=hbase:namespace, region=f147f204a579b885c351bdc0a7ebbf94; 
rit=OPENING, location=node_5,16020,1510100695534

Here just after... is where we go to dispatch. See how target is null 
('location='):

2017-11-08 00:25:27,209 INFO  [ProcExecWrkr-22] 
assignment.RegionTransitionProcedure: Dispatch pid=40, ppid=37, 
state=RUNNABLE:REGION_TRANSITION_DISPATCH; AssignProcedure 
table=hbase:namespace, region=f147f204a579b885c351bdc0a7ebbf94; rit=OFFLINE, 
location=null

which is followed a few millis later with:

2017-11-08 00:25:27,211 ERROR [ProcExecWrkr-22] procedure2.ProcedureExecutor: 
CODE-BUG: Uncaught runtime exception: pid=40, ppid=37, 
state=RUNNABLE:REGION_TRANSITION_QUEUE; AssignProcedure table=hbase:namespace, 
region=f147f204a579b885c351bdc0a7ebbf94
java.lang.NullPointerException

The nulling of the server happens... when the AssignProcedure does 
remoteCallFailed which is what SCP calls when it cancels the AP. This is done 
in one thread. Concurrently the assign was in progress on a Procedure executor 
thread.

There is a hole then between the check that we have a target.....


  protected boolean updateTransition(final MasterProcedureEnv env, final 
RegionStateNode regionNode)
  throws IOException, ProcedureSuspendedException {
   ....
    if (regionNode.getRegionLocation() == null) {
      setTransitionState(RegionTransitionState.REGION_TRANSITION_QUEUE);
      return true;
    }


... and later in same method where the dispatch is actually done....

    if (!addToRemoteDispatcher(env, regionNode.getRegionLocation())) {
...


Its a narrow hole but one nonetheless.

A check for a zero'd targetserver when we go to use it seems right thing to do. 
If null, fail the dispatch. There are other reasons to dispatch fails. This is 
new one.

Let me put up a little patch.


> Master stuck thinking hbase:namespace is assigned after restart preventing 
> intialization
> ----------------------------------------------------------------------------------------
>
>                 Key: HBASE-19218
>                 URL: https://issues.apache.org/jira/browse/HBASE-19218
>             Project: HBase
>          Issue Type: Bug
>            Reporter: Josh Elser
>            Assignee: Umesh Agashe
>            Priority: Critical
>             Fix For: 2.0.0-beta-1
>
>         Attachments: 
> hbase-hbase-master-ctr-e134-1499953498516-282290-01-000003.hwx.site.log.zip, 
> hbase-site.xml
>
>
> Our [~romil.choksi] brought this one to my attention after trying to get some 
> cluster tests running.
> The Master seems to have gotten stuck never initializing after it thinks that 
> hbase:namespace was already deployed on the cluster when it actually was not. 
> On a Master restart, it reads the location out of meta and assumes that it's 
> there (I assume this invalid entry is the issue):
> {noformat}
> 2017-11-08 00:29:17,556 INFO  
> [ctr-e134-1499953498516-282290-01-000003:20000.masterManager] 
> assignment.RegionStateStore: Load hbase:meta entry region={ENCODED => 
> f147f204a579b885c351bdc0a7ebbf94, NAME => 
> 'hbase:namespace,,1510084256045.f147f204a579b885c351bdc0a7ebbf94.', STARTKEY 
> => '', ENDKEY => ''} regionState=OPENING 
> lastHost=ctr-e134-1499953498516-282290-01-000005.hwx.site,16020,1510084579728 
> regionLocation=ctr-e134-1499953498516-282290-01-000005.hwx.site,16020,1510100695534
> {noformat}
> Prior to this, the RS5 went through the ServerCrashProcedure, but it looks 
> like this bailed out unexpectedly:
> {noformat}
> 2017-11-08 00:25:25,187 WARN  
> [ctr-e134-1499953498516-282290-01-000003:20000.masterManager] 
> master.ServerManager: Expiration of 
> ctr-e134-1499953498516-282290-01-000005.hwx.site,16020,1510084579728 but 
> server not online
> 2017-11-08 00:25:25,187 INFO  [ProcExecWrkr-5] 
> procedure.ServerCrashProcedure: Start pid=36, 
> state=RUNNABLE:SERVER_CRASH_START; ServerCrashProcedure 
> server=ctr-e134-1499953498516-282290-01-000003.hwx.site,16020,1510084580111, 
> splitWal=t
> rue, meta=false
> 2017-11-08 00:25:25,188 INFO  
> [ctr-e134-1499953498516-282290-01-000003:20000.masterManager] 
> master.ServerManager: Processing expiration of 
> ctr-e134-1499953498516-282290-01-000005.hwx.site,16020,1510084579728 on 
> ctr-e134-1499953498516-28
> 2290-01-000003.hwx.site,20000,1510100690324
> ...
> 2017-11-08 00:25:27,211 ERROR [ProcExecWrkr-22] procedure2.ProcedureExecutor: 
> CODE-BUG: Uncaught runtime exception: pid=40, ppid=37, 
> state=RUNNABLE:REGION_TRANSITION_QUEUE; AssignProcedure 
> table=hbase:namespace, region=f147f204a579b885c351bdc0a7ebbf94
> java.lang.NullPointerException
>         at 
> java.util.concurrent.ConcurrentHashMap.get(ConcurrentHashMap.java:936)
>         at 
> org.apache.hadoop.hbase.procedure2.RemoteProcedureDispatcher.addOperationToNode(RemoteProcedureDispatcher.java:171)
>         at 
> org.apache.hadoop.hbase.master.assignment.RegionTransitionProcedure.addToRemoteDispatcher(RegionTransitionProcedure.java:223)
>         at 
> org.apache.hadoop.hbase.master.assignment.AssignProcedure.updateTransition(AssignProcedure.java:252)
>         at 
> org.apache.hadoop.hbase.master.assignment.RegionTransitionProcedure.execute(RegionTransitionProcedure.java:309)
>         at 
> org.apache.hadoop.hbase.master.assignment.RegionTransitionProcedure.execute(RegionTransitionProcedure.java:82)
>         at 
> org.apache.hadoop.hbase.procedure2.Procedure.doExecute(Procedure.java:845)
>         at 
> org.apache.hadoop.hbase.procedure2.ProcedureExecutor.execProcedure(ProcedureExecutor.java:1452)
>         at 
> org.apache.hadoop.hbase.procedure2.ProcedureExecutor.executeProcedure(ProcedureExecutor.java:1221)
>         at 
> org.apache.hadoop.hbase.procedure2.ProcedureExecutor.access$800(ProcedureExecutor.java:77)
>         at 
> org.apache.hadoop.hbase.procedure2.ProcedureExecutor$WorkerThread.run(ProcedureExecutor.java:1731)
> 2017-11-08 00:25:27,239 FATAL [ProcExecWrkr-22] procedure2.ProcedureExecutor: 
> CODE-BUG: Uncaught runtime exception for pid=37, 
> state=FAILED:SERVER_CRASH_FINISH, exception=java.lang.NullPointerException 
> via CODE-BUG: Uncaught runtime exception: pid=40, ppid=37, 
> state=RUNNABLE:REGION_TRANSITION_QUEUE; AssignProcedure 
> table=hbase:namespace, 
> region=f147f204a579b885c351bdc0a7ebbf94:java.lang.NullPointerException; 
> ServerCrashProcedure 
> server=ctr-e134-1499953498516-282290-01-000005.hwx.site,16020,1510084579728, 
> splitWal=true, meta=false
> java.lang.UnsupportedOperationException: unhandled state=SERVER_CRASH_FINISH
>         at 
> org.apache.hadoop.hbase.master.procedure.ServerCrashProcedure.rollbackState(ServerCrashProcedure.java:236)
>         at 
> org.apache.hadoop.hbase.master.procedure.ServerCrashProcedure.rollbackState(ServerCrashProcedure.java:56)
>         at 
> org.apache.hadoop.hbase.procedure2.StateMachineProcedure.rollback(StateMachineProcedure.java:198)
>         at 
> org.apache.hadoop.hbase.procedure2.Procedure.doRollback(Procedure.java:859)
>         at 
> org.apache.hadoop.hbase.procedure2.ProcedureExecutor.executeRollback(ProcedureExecutor.java:1353)
>         at 
> org.apache.hadoop.hbase.procedure2.ProcedureExecutor.executeRollback(ProcedureExecutor.java:1309)
>         at 
> org.apache.hadoop.hbase.procedure2.ProcedureExecutor.executeProcedure(ProcedureExecutor.java:1178)
>         at 
> org.apache.hadoop.hbase.procedure2.ProcedureExecutor.access$800(ProcedureExecutor.java:77)
>         at 
> org.apache.hadoop.hbase.procedure2.ProcedureExecutor$WorkerThread.run(ProcedureExecutor.java:1731)
> 2017-11-08 00:25:27,344 FATAL [ProcExecWrkr-8] procedure2.ProcedureExecutor: 
> CODE-BUG: Uncaught runtime exception for pid=37, 
> state=FAILED:SERVER_CRASH_FINISH, exception=java.lang.NullPointerException 
> via CODE-BUG: Uncaught runtime exception: pid=40, ppid=37, 
> state=RUNNABLE:REGION_TRANSITION_QUEUE; AssignProcedure 
> table=hbase:namespace, 
> region=f147f204a579b885c351bdc0a7ebbf94:java.lang.NullPointerException; 
> ServerCrashProcedure 
> server=ctr-e134-1499953498516-282290-01-000005.hwx.site,16020,1510084579728, 
> splitWal=true, meta=false
> java.lang.UnsupportedOperationException: unhandled state=SERVER_CRASH_FINISH
>         at 
> org.apache.hadoop.hbase.master.procedure.ServerCrashProcedure.rollbackState(ServerCrashProcedure.java:236)
>         at 
> org.apache.hadoop.hbase.master.procedure.ServerCrashProcedure.rollbackState(ServerCrashProcedure.java:56)
>         at 
> org.apache.hadoop.hbase.procedure2.StateMachineProcedure.rollback(StateMachineProcedure.java:198)
>         at 
> org.apache.hadoop.hbase.procedure2.Procedure.doRollback(Procedure.java:859)
>         at 
> org.apache.hadoop.hbase.procedure2.ProcedureExecutor.executeRollback(ProcedureExecutor.java:1353)
>         at 
> org.apache.hadoop.hbase.procedure2.ProcedureExecutor.executeRollback(ProcedureExecutor.java:1309)
>         at 
> org.apache.hadoop.hbase.procedure2.ProcedureExecutor.executeProcedure(ProcedureExecutor.java:1178)
>         at 
> org.apache.hadoop.hbase.procedure2.ProcedureExecutor.access$800(ProcedureExecutor.java:77)
>         at 
> org.apache.hadoop.hbase.procedure2.ProcedureExecutor$WorkerThread.run(ProcedureExecutor.java:1731)
> 2017-11-08 00:25:27,356 INFO  [ProcExecWrkr-5] procedure2.ProcedureExecutor: 
> Rolled back pid=37, state=ROLLEDBACK, 
> exception=java.lang.NullPointerException via CODE-BUG: Uncaught runtime 
> exception: pid=40, ppid=37, state=RUNNABLE:REGION_TRANSITION_QUEUE; 
> AssignProcedure table=hbase:namespace, 
> region=f147f204a579b885c351bdc0a7ebbf94:java.lang.NullPointerException; 
> ServerCrashProcedure 
> server=ctr-e134-1499953498516-282290-01-000005.hwx.site,16020,1510084579728, 
> splitWal=true, meta=false exec-time=2.1650sec
> {noformat}
> Shortly after this, the master was restarted.
> My hunch is that because the SCP crashed, we never invalidated the meta 
> entries and got us stuck thinking the region was assigned when it wasn't?



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

Reply via email to