[
https://issues.apache.org/jira/browse/HBASE-19287?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16281118#comment-16281118
]
Yi Liang commented on HBASE-19287:
----------------------------------
After some investigation, I found that it takes time to add a whole Timeout
Mechanism into current Procedure. Not sure I can finished those before release
of hbase2.0, so I just provide a fix that use idea we talked above
{quote}
(2) Or at least, if we get a crash for the server we are currently trying to
assign hbase:meta too during startup, we should notice and recalibrate the
assign?
{quote}
Draft patch to try UT, and still working on writing new testcase for this
problem
> master hangs forever if RecoverMeta send assign meta region request to target
> server fail
> -----------------------------------------------------------------------------------------
>
> Key: HBASE-19287
> URL: https://issues.apache.org/jira/browse/HBASE-19287
> Project: HBase
> Issue Type: Bug
> Reporter: Yi Liang
> Assignee: Yi Liang
>
> 2017-11-10 19:26:56,019 INFO [ProcExecWrkr-1]
> procedure.RecoverMetaProcedure: pid=138,
> state=RUNNABLE:RECOVER_META_ASSIGN_REGIONS; RecoverMetaProcedure
> failedMetaServer=null, splitWal=true; Retaining meta assignment to
> server=hadoop-slave1.hadoop,16020,1510341981454
> 2017-11-10 19:26:56,029 INFO [ProcExecWrkr-1] procedure2.ProcedureExecutor:
> Initialized subprocedures=[{pid=139, ppid=138,
> state=RUNNABLE:REGION_TRANSITION_QUEUE; AssignProcedure table=hbase:meta,
> region=1588230740, target=hadoop-slave1.hadoop,16020,1510341981454}]
> 2017-11-10 19:26:56,067 INFO [ProcExecWrkr-2]
> procedure.MasterProcedureScheduler: pid=139, ppid=138,
> state=RUNNABLE:REGION_TRANSITION_QUEUE; AssignProcedure table=hbase:meta,
> region=1588230740, target=hadoop-slave1.hadoop,16020,1510341981454 hbase:meta
> hbase:meta,,1.1588230740
> 2017-11-10 19:26:56,071 INFO [ProcExecWrkr-2] assignment.AssignProcedure:
> Start pid=139, ppid=138, state=RUNNABLE:REGION_TRANSITION_QUEUE;
> AssignProcedure table=hbase:meta, region=1588230740,
> target=hadoop-slave1.hadoop,16020,1510341981454; rit=OFFLINE,
> location=hadoop-slave1.hadoop,16020,1510341981454; forceNewPlan=false,
> retain=false
> 2017-11-10 19:26:56,224 INFO [ProcExecWrkr-4] zookeeper.MetaTableLocator:
> Setting hbase:meta (replicaId=0) location in ZooKeeper as
> hadoop-slave2.hadoop,16020,1510341988652
> 2017-11-10 19:26:56,230 INFO [ProcExecWrkr-4]
> assignment.RegionTransitionProcedure: Dispatch pid=139, ppid=138,
> state=RUNNABLE:REGION_TRANSITION_DISPATCH; AssignProcedure table=hbase:meta,
> region=1588230740, target=hadoop-slave1.hadoop,16020,1510341981454;
> rit=OPENING, location=hadoop-slave2.hadoop,16020,1510341988652
> 2017-11-10 19:26:56,382 INFO [ProcedureDispatcherTimeoutThread]
> procedure.RSProcedureDispatcher: Using procedure batch rpc execution for
> serverName=hadoop-slave2.hadoop,16020,1510341988652 version=2097152
> 2017-11-10 19:26:57,542 INFO [main-EventThread]
> zookeeper.RegionServerTracker: RegionServer ephemeral node deleted,
> processing expiration [hadoop-slave2.hadoop,16020,1510341988652]
> 2017-11-10 19:26:57,543 INFO [main-EventThread] master.ServerManager: Master
> doesn't enable ServerShutdownHandler during initialization, delay expiring
> server hadoop-slave2.hadoop,16020,1510341988652
> 2017-11-10 19:26:58,875 INFO
> [RpcServer.default.FPBQ.Fifo.handler=29,queue=2,port=16000]
> master.ServerManager: Registering
> server=hadoop-slave1.hadoop,16020,1510342016106
> 2017-11-10 19:27:05,832 INFO
> [RpcServer.default.FPBQ.Fifo.handler=29,queue=2,port=16000]
> master.ServerManager: Registering
> server=hadoop-slave2.hadoop,16020,1510342023184
> 2017-11-10 19:27:05,832 INFO
> [RpcServer.default.FPBQ.Fifo.handler=29,queue=2,port=16000]
> master.ServerManager: Triggering server recovery; existingServer
> hadoop-slave2.hadoop,16020,1510341988652 looks stale, new
> server:hadoop-slave2.hadoop,16020,1510342023184
> 2017-11-10 19:27:05,832 INFO
> [RpcServer.default.FPBQ.Fifo.handler=29,queue=2,port=16000]
> master.ServerManager: Master doesn't enable ServerShutdownHandler during
> initialization, delay expiring server hadoop-slave2.hadoop,16020,1510341988652
> 2017-11-10 19:27:49,815 INFO
> [RpcServer.default.FPBQ.Fifo.handler=29,queue=2,port=16000]
> client.RpcRetryingCallerImpl: tarted=38594 ms ago, cancelled=false,
> msg=org.apache.hadoop.hbase.NotServingRegionException: hbase:meta,,1 is not
> online on hadoop-slave2.hadoop,16020,1510342023184
> at
> org.apache.hadoop.hbase.regionserver.HRegionServer.getRegionByEncodedName(HRegionServer.java:3290)
> at
> org.apache.hadoop.hbase.regionserver.RSRpcServices.getRegion(RSRpcServices.java:1370)
> at
> org.apache.hadoop.hbase.regionserver.RSRpcServices.get(RSRpcServices.java:2401)
> at
> org.apache.hadoop.hbase.shaded.protobuf.generated.ClientProtos$ClientService$2.callBlockingMethod(ClientProtos.java:41544)
> at org.apache.hadoop.hbase.ipc.RpcServer.call(RpcServer.java:406)
> at org.apache.hadoop.hbase.ipc.CallRunner.run(CallRunner.java:133)
> at
> org.apache.hadoop.hbase.ipc.RpcExecutor$Handler.run(RpcExecutor.java:278)
> at
> org.apache.hadoop.hbase.ipc.RpcExecutor$Handler.run(RpcExecutor.java:258)
> row 'hbase:namespace' on table 'hbase:meta' at
> region=hbase:meta,,1.1588230740,
> hostname=hadoop-slave2.hadoop,16020,1510341988652, seqNum=0
--
This message was sent by Atlassian JIRA
(v6.4.14#64029)