[ 
https://issues.apache.org/jira/browse/HBASE-21788?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16762987#comment-16762987
 ] 

Bahram Chehrazy commented on HBASE-21788:
-----------------------------------------

The master finally got out of that state when the server 
"*<server1>,16020,1549450371876*" died and the meta region was moved to another 
server via a new SCP. The old OpenRegionProcedure got terminated after 24hrs 
and the new one finished in just 3 seconds. I have a feeling that the server1 
was not even serving that region hence was ignoring the dispatcher's request.

 

2019-02-07 03:12:20,016 INFO  [RegionServerTracker-0] 
master.RegionServerTracker: RegionServer *ephemeral node deleted*, processing 
expiration [*<server1>,16020,1549450371876*]

2019-02-07 03:12:24,836 INFO  [PEWorker-2] procedure.ServerCrashProcedure: 
pid=32708, state=RUNNABLE:SERVER_CRASH_ASSIGN_META, hasLock=true; 
ServerCrashProcedure server=bn01ap4d8d5feaf,16020,1549450371876, splitWal=true, 
meta=true found RIT pid=32700, ppid=32695, 
state=WAITING:REGION_STATE_TRANSITION_CONFIRM_OPENED, hasLock=false; 
TransitRegionStateProcedure table=hbase:meta, region=1588230740, ASSIGN; 
rit=OPENING, location=*<server1>,16020,1549450371876*, table=hbase:meta, 
region=1588230740

2019-02-07 03:12:25,028 INFO  [PEWorker-14] procedure2.ProcedureExecutor: 
Finished subprocedure(s) of pid=32700, ppid=32695, 
state=RUNNABLE:REGION_STATE_TRANSITION_CONFIRM_OPENED, hasLock=false; 
TransitRegionStateProcedure table=hbase:meta, region=1588230740, ASSIGN; resume 
parent processing.

2019-02-07 03:12:25,029 INFO  [PEWorker-14] procedure2.ProcedureExecutor: 
Finished *pid=32701*, ppid=32700, state=SUCCESS, hasLock=false; 
org.apache.hadoop.hbase.master.assignment.OpenRegionProcedure in *24hrs*, 
47.379sec

2019-02-07 03:12:25,030 INFO  [PEWorker-9] 
assignment.TransitRegionStateProcedure: Retry=1 of max=2147483647; pid=32700, 
ppid=32695, state=RUNNABLE:REGION_STATE_TRANSITION_CONFIRM_OPENED, 
hasLock=true; TransitRegionStateProcedure table=hbase:meta, region=1588230740, 
ASSIGN; rit=ABNORMALLY_CLOSED, location=null

2019-02-07 03:12:25,080 INFO  [PEWorker-9] 
assignment.TransitRegionStateProcedure: Starting pid=32700, ppid=32695, 
state=RUNNABLE:REGION_STATE_TRANSITION_GET_ASSIGN_CANDIDATE, hasLock=true; 
TransitRegionStateProcedure table=hbase:meta, region=1588230740, ASSIGN; 
rit=ABNORMALLY_CLOSED, location=null; forceNewPlan=true, retain=false

2019-02-07 03:12:25,393 INFO  [PEWorker-12] zookeeper.MetaTableLocator: Setting 
hbase:meta (replicaId=0) location in ZooKeeper as 
*<server2>,16020,1549450814730*

2019-02-07 03:12:25,421 INFO [PEWorker-12] procedure2.ProcedureExecutor: 
Initialized subprocedures=[{pid=32709, ppid=32700, state=RUNNABLE, 
hasLock=false; org.apache.hadoop.hbase.master.assignment.*OpenRegionProcedure*}]
2019-02-07 03:12:28,495 INFO 
[RpcServer.priority.FPBQ.Fifo.handler=19,queue=1,port=16000] 
zookeeper.MetaTableLocator: Setting hbase:meta (replicaId=0) location in 
ZooKeeper as *<server2>,16020,1549450814730*
2019-02-07 03:12:28,665 INFO [PEWorker-11] procedure2.ProcedureExecutor: 
Finished subprocedure(s) of pid=32700, ppid=32695, 
state=RUNNABLE:REGION_STATE_TRANSITION_CONFIRM_OPENED, hasLock=true; 
TransitRegionStateProcedure table=hbase:meta, region=1588230740, ASSIGN; resume 
parent processing.
2019-02-07 03:12:28,665 INFO [PEWorker-11] procedure2.ProcedureExecutor: 
Finished pid=32709, ppid=32700, state=SUCCESS, hasLock=false; 
org.apache.hadoop.hbase.master.assignment.OpenRegionProcedure in *3.1180sec*

> OpenRegionProcedure (after recovery?) is unreliable and needs to be improved
> ----------------------------------------------------------------------------
>
>                 Key: HBASE-21788
>                 URL: https://issues.apache.org/jira/browse/HBASE-21788
>             Project: HBase
>          Issue Type: Bug
>    Affects Versions: 3.0.0
>            Reporter: Sergey Shelukhin
>            Priority: Critical
>
> Not much for this one yet.
> I repeatedly see the cases when the region is stuck in OPENING, and after 
> master restart RIT is recovered, and stays WAITING; its OpenRegionProcedure 
> (also recovered) is stuck in Runnable and never does anything for hours. I 
> cannot find logs on the target server indicating that it ever tried to do 
> anything after master restart.
> This procedure needs at the very least logging of what it's trying to do, and 
> maybe a timeout so it unconditionally fails after a configurable period (1 
> hour?).
> I may also investigate why it doesn't do anything and file a separate bug. I 
> wonder if it's somehow related to the region status check, but this is just a 
> hunch.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

Reply via email to