[
https://issues.apache.org/jira/browse/HBASE-21788?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16762987#comment-16762987
]
Bahram Chehrazy commented on HBASE-21788:
-----------------------------------------
The master finally got out of that state when the server
"*<server1>,16020,1549450371876*" died and the meta region was moved to another
server via a new SCP. The old OpenRegionProcedure got terminated after 24hrs
and the new one finished in just 3 seconds. I have a feeling that the server1
was not even serving that region hence was ignoring the dispatcher's request.
2019-02-07 03:12:20,016 INFO [RegionServerTracker-0]
master.RegionServerTracker: RegionServer *ephemeral node deleted*, processing
expiration [*<server1>,16020,1549450371876*]
2019-02-07 03:12:24,836 INFO [PEWorker-2] procedure.ServerCrashProcedure:
pid=32708, state=RUNNABLE:SERVER_CRASH_ASSIGN_META, hasLock=true;
ServerCrashProcedure server=bn01ap4d8d5feaf,16020,1549450371876, splitWal=true,
meta=true found RIT pid=32700, ppid=32695,
state=WAITING:REGION_STATE_TRANSITION_CONFIRM_OPENED, hasLock=false;
TransitRegionStateProcedure table=hbase:meta, region=1588230740, ASSIGN;
rit=OPENING, location=*<server1>,16020,1549450371876*, table=hbase:meta,
region=1588230740
2019-02-07 03:12:25,028 INFO [PEWorker-14] procedure2.ProcedureExecutor:
Finished subprocedure(s) of pid=32700, ppid=32695,
state=RUNNABLE:REGION_STATE_TRANSITION_CONFIRM_OPENED, hasLock=false;
TransitRegionStateProcedure table=hbase:meta, region=1588230740, ASSIGN; resume
parent processing.
2019-02-07 03:12:25,029 INFO [PEWorker-14] procedure2.ProcedureExecutor:
Finished *pid=32701*, ppid=32700, state=SUCCESS, hasLock=false;
org.apache.hadoop.hbase.master.assignment.OpenRegionProcedure in *24hrs*,
47.379sec
2019-02-07 03:12:25,030 INFO [PEWorker-9]
assignment.TransitRegionStateProcedure: Retry=1 of max=2147483647; pid=32700,
ppid=32695, state=RUNNABLE:REGION_STATE_TRANSITION_CONFIRM_OPENED,
hasLock=true; TransitRegionStateProcedure table=hbase:meta, region=1588230740,
ASSIGN; rit=ABNORMALLY_CLOSED, location=null
2019-02-07 03:12:25,080 INFO [PEWorker-9]
assignment.TransitRegionStateProcedure: Starting pid=32700, ppid=32695,
state=RUNNABLE:REGION_STATE_TRANSITION_GET_ASSIGN_CANDIDATE, hasLock=true;
TransitRegionStateProcedure table=hbase:meta, region=1588230740, ASSIGN;
rit=ABNORMALLY_CLOSED, location=null; forceNewPlan=true, retain=false
2019-02-07 03:12:25,393 INFO [PEWorker-12] zookeeper.MetaTableLocator: Setting
hbase:meta (replicaId=0) location in ZooKeeper as
*<server2>,16020,1549450814730*
2019-02-07 03:12:25,421 INFO [PEWorker-12] procedure2.ProcedureExecutor:
Initialized subprocedures=[{pid=32709, ppid=32700, state=RUNNABLE,
hasLock=false; org.apache.hadoop.hbase.master.assignment.*OpenRegionProcedure*}]
2019-02-07 03:12:28,495 INFO
[RpcServer.priority.FPBQ.Fifo.handler=19,queue=1,port=16000]
zookeeper.MetaTableLocator: Setting hbase:meta (replicaId=0) location in
ZooKeeper as *<server2>,16020,1549450814730*
2019-02-07 03:12:28,665 INFO [PEWorker-11] procedure2.ProcedureExecutor:
Finished subprocedure(s) of pid=32700, ppid=32695,
state=RUNNABLE:REGION_STATE_TRANSITION_CONFIRM_OPENED, hasLock=true;
TransitRegionStateProcedure table=hbase:meta, region=1588230740, ASSIGN; resume
parent processing.
2019-02-07 03:12:28,665 INFO [PEWorker-11] procedure2.ProcedureExecutor:
Finished pid=32709, ppid=32700, state=SUCCESS, hasLock=false;
org.apache.hadoop.hbase.master.assignment.OpenRegionProcedure in *3.1180sec*
> OpenRegionProcedure (after recovery?) is unreliable and needs to be improved
> ----------------------------------------------------------------------------
>
> Key: HBASE-21788
> URL: https://issues.apache.org/jira/browse/HBASE-21788
> Project: HBase
> Issue Type: Bug
> Affects Versions: 3.0.0
> Reporter: Sergey Shelukhin
> Priority: Critical
>
> Not much for this one yet.
> I repeatedly see the cases when the region is stuck in OPENING, and after
> master restart RIT is recovered, and stays WAITING; its OpenRegionProcedure
> (also recovered) is stuck in Runnable and never does anything for hours. I
> cannot find logs on the target server indicating that it ever tried to do
> anything after master restart.
> This procedure needs at the very least logging of what it's trying to do, and
> maybe a timeout so it unconditionally fails after a configurable period (1
> hour?).
> I may also investigate why it doesn't do anything and file a separate bug. I
> wonder if it's somehow related to the region status check, but this is just a
> hunch.
--
This message was sent by Atlassian JIRA
(v7.6.3#76005)