[ 
https://issues.apache.org/jira/browse/HBASE-19287?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16275299#comment-16275299
 ] 

stack commented on HBASE-19287:
-------------------------------

Thank you for digging in here.

bq. The problem happens on step3, if the master does not receive response from 
target server for any reason; That assign procedure will become a dead 
procedure, no other mechanism will wake the procedure(i.e put it back into 
procedure scheduler) any more.

Yes. This is how it works.

If in the scheduler, it will be rescheduled and then put the back of the queue 
over and over again burning CPU, no, if it is not taken off the scheduler?

bq. So we need to come up with a idea to wake those suspend procedures.

Ok.

bq. My suggestion is that we can have a separate thread to check all those 
suspend procedures periodically, if they are timeout or their target server is 
crashed, we can do reassign.

Ok. Have to be careful though.

 * Has to be more when it times out an assign. Perhaps the RS is just taking a 
long time. We'd have to cancel any ongoing assign, confirm it cancelled, before 
we'd reassign. We want to avoid AMv1 double-assignment, the case where two 
different RS were trying to open a Region.
 * If crash, SCP will run and do clean-up?

bq. (1) The target server crashed will only suspend meta's assign since master 
is not up yet, other regions can be wake by ServerCrashProcedure. 

I don't grok the above. Please say more please.

bq. (2) Timeout mechanism for all suspend procedure. If one procedure has been 
suspended for too long, we mark it as timeout and redo the remain steps.

See above. We have to be careful here. We must avoid the mess we had in AMv1.

BTW, here where things were at at AMv2 commit time: 
https://docs.google.com/document/d/1eVKa7FHdeoJ1-9o8yZcOTAQbv0u0bblBlCCzVSIn69g/edit#heading=h.y71bhu8smpqp
 This timeout was something we were supposed to come back too.

Again, thank you [~easyliangjob] for digging in here in this important but dark 
corner





> master hangs forever if RecoverMeta send assign meta region request to target 
> server fail
> -----------------------------------------------------------------------------------------
>
>                 Key: HBASE-19287
>                 URL: https://issues.apache.org/jira/browse/HBASE-19287
>             Project: HBase
>          Issue Type: Bug
>            Reporter: Yi Liang
>            Assignee: Yi Liang
>
> 2017-11-10 19:26:56,019 INFO  [ProcExecWrkr-1] 
> procedure.RecoverMetaProcedure: pid=138, 
> state=RUNNABLE:RECOVER_META_ASSIGN_REGIONS; RecoverMetaProcedure 
> failedMetaServer=null, splitWal=true; Retaining meta assignment to 
> server=hadoop-slave1.hadoop,16020,1510341981454
> 2017-11-10 19:26:56,029 INFO  [ProcExecWrkr-1] procedure2.ProcedureExecutor: 
> Initialized subprocedures=[{pid=139, ppid=138, 
> state=RUNNABLE:REGION_TRANSITION_QUEUE; AssignProcedure table=hbase:meta, 
> region=1588230740, target=hadoop-slave1.hadoop,16020,1510341981454}]
> 2017-11-10 19:26:56,067 INFO  [ProcExecWrkr-2] 
> procedure.MasterProcedureScheduler: pid=139, ppid=138, 
> state=RUNNABLE:REGION_TRANSITION_QUEUE; AssignProcedure table=hbase:meta, 
> region=1588230740, target=hadoop-slave1.hadoop,16020,1510341981454 hbase:meta 
> hbase:meta,,1.1588230740
> 2017-11-10 19:26:56,071 INFO  [ProcExecWrkr-2] assignment.AssignProcedure: 
> Start pid=139, ppid=138, state=RUNNABLE:REGION_TRANSITION_QUEUE; 
> AssignProcedure table=hbase:meta, region=1588230740, 
> target=hadoop-slave1.hadoop,16020,1510341981454; rit=OFFLINE, 
> location=hadoop-slave1.hadoop,16020,1510341981454; forceNewPlan=false, 
> retain=false
> 2017-11-10 19:26:56,224 INFO  [ProcExecWrkr-4] zookeeper.MetaTableLocator: 
> Setting hbase:meta (replicaId=0) location in ZooKeeper as 
> hadoop-slave2.hadoop,16020,1510341988652
> 2017-11-10 19:26:56,230 INFO  [ProcExecWrkr-4] 
> assignment.RegionTransitionProcedure: Dispatch pid=139, ppid=138, 
> state=RUNNABLE:REGION_TRANSITION_DISPATCH; AssignProcedure table=hbase:meta, 
> region=1588230740, target=hadoop-slave1.hadoop,16020,1510341981454; 
> rit=OPENING, location=hadoop-slave2.hadoop,16020,1510341988652
> 2017-11-10 19:26:56,382 INFO  [ProcedureDispatcherTimeoutThread] 
> procedure.RSProcedureDispatcher: Using procedure batch rpc execution for 
> serverName=hadoop-slave2.hadoop,16020,1510341988652 version=2097152
> 2017-11-10 19:26:57,542 INFO  [main-EventThread] 
> zookeeper.RegionServerTracker: RegionServer ephemeral node deleted, 
> processing expiration [hadoop-slave2.hadoop,16020,1510341988652]
> 2017-11-10 19:26:57,543 INFO  [main-EventThread] master.ServerManager: Master 
> doesn't enable ServerShutdownHandler during initialization, delay expiring 
> server hadoop-slave2.hadoop,16020,1510341988652
> 2017-11-10 19:26:58,875 INFO  
> [RpcServer.default.FPBQ.Fifo.handler=29,queue=2,port=16000] 
> master.ServerManager: Registering 
> server=hadoop-slave1.hadoop,16020,1510342016106
> 2017-11-10 19:27:05,832 INFO  
> [RpcServer.default.FPBQ.Fifo.handler=29,queue=2,port=16000] 
> master.ServerManager: Registering 
> server=hadoop-slave2.hadoop,16020,1510342023184
> 2017-11-10 19:27:05,832 INFO  
> [RpcServer.default.FPBQ.Fifo.handler=29,queue=2,port=16000] 
> master.ServerManager: Triggering server recovery; existingServer 
> hadoop-slave2.hadoop,16020,1510341988652 looks stale, new 
> server:hadoop-slave2.hadoop,16020,1510342023184
> 2017-11-10 19:27:05,832 INFO  
> [RpcServer.default.FPBQ.Fifo.handler=29,queue=2,port=16000] 
> master.ServerManager: Master doesn't enable ServerShutdownHandler during 
> initialization, delay expiring server hadoop-slave2.hadoop,16020,1510341988652
> 2017-11-10 19:27:49,815 INFO  
> [RpcServer.default.FPBQ.Fifo.handler=29,queue=2,port=16000] 
> client.RpcRetryingCallerImpl: tarted=38594 ms ago, cancelled=false, 
> msg=org.apache.hadoop.hbase.NotServingRegionException: hbase:meta,,1 is not 
> online on hadoop-slave2.hadoop,16020,1510342023184
>         at 
> org.apache.hadoop.hbase.regionserver.HRegionServer.getRegionByEncodedName(HRegionServer.java:3290)
>         at 
> org.apache.hadoop.hbase.regionserver.RSRpcServices.getRegion(RSRpcServices.java:1370)
>         at 
> org.apache.hadoop.hbase.regionserver.RSRpcServices.get(RSRpcServices.java:2401)
>         at 
> org.apache.hadoop.hbase.shaded.protobuf.generated.ClientProtos$ClientService$2.callBlockingMethod(ClientProtos.java:41544)
>         at org.apache.hadoop.hbase.ipc.RpcServer.call(RpcServer.java:406)
>         at org.apache.hadoop.hbase.ipc.CallRunner.run(CallRunner.java:133)
>         at 
> org.apache.hadoop.hbase.ipc.RpcExecutor$Handler.run(RpcExecutor.java:278)
>         at 
> org.apache.hadoop.hbase.ipc.RpcExecutor$Handler.run(RpcExecutor.java:258)
>  row 'hbase:namespace' on table 'hbase:meta' at 
> region=hbase:meta,,1.1588230740, 
> hostname=hadoop-slave2.hadoop,16020,1510341988652, seqNum=0



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

Reply via email to