[
https://issues.apache.org/jira/browse/HBASE-19287?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16275129#comment-16275129
]
Yi Liang commented on HBASE-19287:
----------------------------------
[~stack] Spent some time digging into code. I found details of the assign
Procedure work flow is
{quote}
1. Master send assign request to target Regionserver, and this active
AssignProcedure will be remove from Procedure Scheduler(A queue that store all
the active procedure) and suspend this AssignProcedure.
2. Once target Server received request and open the region, it will send a
response to master
3. Once Master receive the response, it will wake this procedure and put the
AssignProcedure back to Procedure Scheduler. And worker threads in
ProcedureExecutor will poll this AssignProcedure and run the remain steps.
{quote}
The problem happens on step3, if the master does not receive response from
target server for any reason; That assign procedure will become a dead
procedure, no other mechanism will wake the procedure(i.e put it back into
procedure scheduler) any more. (Do not know why we need to remove this
procedure out of procedure scheduler in step1, maybe we can just mark it as
suspend and yield it?)
The thing here is that this suspend procedure will be only wake by the response
from target server, no other mechanism can wake it (ServerCrashProcedure may
wake it, but if the target server is not crashed, master just can not receive
the response for other reasons like network issue. this problem will still
happens; or if master is not up, SCP also does not work).
So this will be a general problem not only for meta, but for other normal
regions.
So we need to come up with a idea to wake those suspend procedures.
My suggestion is that we can have a separate thread to check all those suspend
procedures periodically, if they are timeout or their target server is crashed,
we can do reassign.
(1) The target server crashed will only suspend meta's assign since master is
not up yet, other regions can be wake by ServerCrashProcedure.
(2) Timeout mechanism for all suspend procedure. If one procedure has been
suspended for too long, we mark it as timeout and redo the remain steps.
We can do (1) first, but for (2), since we don't have timeout for procedure
yet. Not sure how to fix it properly.
> master hangs forever if RecoverMeta send assign meta region request to target
> server fail
> -----------------------------------------------------------------------------------------
>
> Key: HBASE-19287
> URL: https://issues.apache.org/jira/browse/HBASE-19287
> Project: HBase
> Issue Type: Bug
> Reporter: Yi Liang
> Assignee: Yi Liang
>
> 2017-11-10 19:26:56,019 INFO [ProcExecWrkr-1]
> procedure.RecoverMetaProcedure: pid=138,
> state=RUNNABLE:RECOVER_META_ASSIGN_REGIONS; RecoverMetaProcedure
> failedMetaServer=null, splitWal=true; Retaining meta assignment to
> server=hadoop-slave1.hadoop,16020,1510341981454
> 2017-11-10 19:26:56,029 INFO [ProcExecWrkr-1] procedure2.ProcedureExecutor:
> Initialized subprocedures=[{pid=139, ppid=138,
> state=RUNNABLE:REGION_TRANSITION_QUEUE; AssignProcedure table=hbase:meta,
> region=1588230740, target=hadoop-slave1.hadoop,16020,1510341981454}]
> 2017-11-10 19:26:56,067 INFO [ProcExecWrkr-2]
> procedure.MasterProcedureScheduler: pid=139, ppid=138,
> state=RUNNABLE:REGION_TRANSITION_QUEUE; AssignProcedure table=hbase:meta,
> region=1588230740, target=hadoop-slave1.hadoop,16020,1510341981454 hbase:meta
> hbase:meta,,1.1588230740
> 2017-11-10 19:26:56,071 INFO [ProcExecWrkr-2] assignment.AssignProcedure:
> Start pid=139, ppid=138, state=RUNNABLE:REGION_TRANSITION_QUEUE;
> AssignProcedure table=hbase:meta, region=1588230740,
> target=hadoop-slave1.hadoop,16020,1510341981454; rit=OFFLINE,
> location=hadoop-slave1.hadoop,16020,1510341981454; forceNewPlan=false,
> retain=false
> 2017-11-10 19:26:56,224 INFO [ProcExecWrkr-4] zookeeper.MetaTableLocator:
> Setting hbase:meta (replicaId=0) location in ZooKeeper as
> hadoop-slave2.hadoop,16020,1510341988652
> 2017-11-10 19:26:56,230 INFO [ProcExecWrkr-4]
> assignment.RegionTransitionProcedure: Dispatch pid=139, ppid=138,
> state=RUNNABLE:REGION_TRANSITION_DISPATCH; AssignProcedure table=hbase:meta,
> region=1588230740, target=hadoop-slave1.hadoop,16020,1510341981454;
> rit=OPENING, location=hadoop-slave2.hadoop,16020,1510341988652
> 2017-11-10 19:26:56,382 INFO [ProcedureDispatcherTimeoutThread]
> procedure.RSProcedureDispatcher: Using procedure batch rpc execution for
> serverName=hadoop-slave2.hadoop,16020,1510341988652 version=2097152
> 2017-11-10 19:26:57,542 INFO [main-EventThread]
> zookeeper.RegionServerTracker: RegionServer ephemeral node deleted,
> processing expiration [hadoop-slave2.hadoop,16020,1510341988652]
> 2017-11-10 19:26:57,543 INFO [main-EventThread] master.ServerManager: Master
> doesn't enable ServerShutdownHandler during initialization, delay expiring
> server hadoop-slave2.hadoop,16020,1510341988652
> 2017-11-10 19:26:58,875 INFO
> [RpcServer.default.FPBQ.Fifo.handler=29,queue=2,port=16000]
> master.ServerManager: Registering
> server=hadoop-slave1.hadoop,16020,1510342016106
> 2017-11-10 19:27:05,832 INFO
> [RpcServer.default.FPBQ.Fifo.handler=29,queue=2,port=16000]
> master.ServerManager: Registering
> server=hadoop-slave2.hadoop,16020,1510342023184
> 2017-11-10 19:27:05,832 INFO
> [RpcServer.default.FPBQ.Fifo.handler=29,queue=2,port=16000]
> master.ServerManager: Triggering server recovery; existingServer
> hadoop-slave2.hadoop,16020,1510341988652 looks stale, new
> server:hadoop-slave2.hadoop,16020,1510342023184
> 2017-11-10 19:27:05,832 INFO
> [RpcServer.default.FPBQ.Fifo.handler=29,queue=2,port=16000]
> master.ServerManager: Master doesn't enable ServerShutdownHandler during
> initialization, delay expiring server hadoop-slave2.hadoop,16020,1510341988652
> 2017-11-10 19:27:49,815 INFO
> [RpcServer.default.FPBQ.Fifo.handler=29,queue=2,port=16000]
> client.RpcRetryingCallerImpl: tarted=38594 ms ago, cancelled=false,
> msg=org.apache.hadoop.hbase.NotServingRegionException: hbase:meta,,1 is not
> online on hadoop-slave2.hadoop,16020,1510342023184
> at
> org.apache.hadoop.hbase.regionserver.HRegionServer.getRegionByEncodedName(HRegionServer.java:3290)
> at
> org.apache.hadoop.hbase.regionserver.RSRpcServices.getRegion(RSRpcServices.java:1370)
> at
> org.apache.hadoop.hbase.regionserver.RSRpcServices.get(RSRpcServices.java:2401)
> at
> org.apache.hadoop.hbase.shaded.protobuf.generated.ClientProtos$ClientService$2.callBlockingMethod(ClientProtos.java:41544)
> at org.apache.hadoop.hbase.ipc.RpcServer.call(RpcServer.java:406)
> at org.apache.hadoop.hbase.ipc.CallRunner.run(CallRunner.java:133)
> at
> org.apache.hadoop.hbase.ipc.RpcExecutor$Handler.run(RpcExecutor.java:278)
> at
> org.apache.hadoop.hbase.ipc.RpcExecutor$Handler.run(RpcExecutor.java:258)
> row 'hbase:namespace' on table 'hbase:meta' at
> region=hbase:meta,,1.1588230740,
> hostname=hadoop-slave2.hadoop,16020,1510341988652, seqNum=0
--
This message was sent by Atlassian JIRA
(v6.4.14#64029)