Sergey Shelukhin created HBASE-21786:
----------------------------------------
Summary: RIT for a region without a lock can mess up the RIT that
has the lock
Key: HBASE-21786
URL: https://issues.apache.org/jira/browse/HBASE-21786
Project: HBase
Issue Type: Bug
Affects Versions: 3.0.0
Reporter: Sergey Shelukhin
I cannot find in the log where the 2nd RIT is coming from, the first line I see
for it is Waiting for the lock. It has no parent procedure.
One RIT, restored from WAL, with a retry manages to restore the region to some
server.
{noformat}
2019-01-25 10:56:21,878 INFO [master/master:17000:becomeActiveMaster]
procedure.MasterProcedureScheduler: Took xlock for pid=1738, ppid=3,
state=WAITING:REGION_STATE_TRANSITION_CONFIRM_OPENED, hasLock=false;
TransitRegionStateProcedure table=table,
region=27f7ab2a05d9d730b2ab2339d1531b8e, ASSIGN
2019-01-25 10:56:22,055 INFO [master/master:17000:becomeActiveMaster]
assignment.AssignmentManager: Attach pid=1738, ppid=3,
state=WAITING:REGION_STATE_TRANSITION_CONFIRM_OPENED, hasLock=false;
TransitRegionStateProcedure table=table,
region=27f7ab2a05d9d730b2ab2339d1531b8e, ASSIGN to rit=OFFLINE, location=null,
table=table, region=27f7ab2a05d9d730b2ab2339d1531b8e to restore RIT
2019-01-25 10:56:51,362 INFO [master/master:17000:becomeActiveMaster]
assignment.RegionStateStore: Load hbase:meta entry
region=27f7ab2a05d9d730b2ab2339d1531b8e, regionState=OFFLINE,
lastHost=server2,17020,1548290445704,
regionLocation=server1,17020,1548442302645, openSeqNum=120108
2019-01-25 10:57:26,842 INFO [PEWorker-7] procedure2.ProcedureExecutor:
Finished subprocedure(s) of pid=1738, ppid=3,
state=RUNNABLE:REGION_STATE_TRANSITION_CONFIRM_OPENED, hasLock=false;
TransitRegionStateProcedure table=table,
region=27f7ab2a05d9d730b2ab2339d1531b8e, ASSIGN; resume parent processing.
2019-01-25 10:57:26,842 INFO [PEWorker-12]
assignment.TransitRegionStateProcedure: Retry=1 of max=2147483647; pid=1738,
ppid=3, state=RUNNABLE:REGION_STATE_TRANSITION_CONFIRM_OPENED, hasLock=true;
TransitRegionStateProcedure table=table,
region=27f7ab2a05d9d730b2ab2339d1531b8e, ASSIGN; rit=OFFLINE,
location=server1,17020,1548442302645
2019-01-25 10:57:26,902 INFO [PEWorker-12]
assignment.TransitRegionStateProcedure: Starting pid=1738, ppid=3,
state=RUNNABLE:REGION_STATE_TRANSITION_GET_ASSIGN_CANDIDATE, hasLock=true;
TransitRegionStateProcedure table=table,
region=27f7ab2a05d9d730b2ab2339d1531b8e, ASSIGN; rit=OFFLINE, location=null;
forceNewPlan=true, retain=false
2019-01-25 10:57:33,817 INFO [PEWorker-7] assignment.RegionStateStore:
pid=1738 updating hbase:meta row=27f7ab2a05d9d730b2ab2339d1531b8e,
regionState=OPENING, regionLocation=server3,17020,1548442571056
{noformat}
The other RIT appears out of nowhere.. there's no "to restore RIT" line for it.
I wonder if it could be a side effect of the region being offline, or the retry
above?
Regardless, it cannot get the lock.
{noformat}
2019-01-25 10:57:46,255 INFO [PEWorker-15] procedure.MasterProcedureScheduler:
Waiting on xlock for pid=4351,
state=RUNNABLE:REGION_STATE_TRANSITION_GET_ASSIGN_CANDIDATE, hasLock=false;
TransitRegionStateProcedure table=table,
region=27f7ab2a05d9d730b2ab2339d1531b8e, ASSIGN held by pid=1738
{noformat}
However, when the server responds that region is opened, the new RIT 4351 takes
the notification and discards it.
{noformat}
2019-01-25 10:58:23,263 WARN
[RpcServer.default.FPBQ.Fifo.handler=19,queue=4,port=17000]
assignment.TransitRegionStateProcedure: Received report OPENED transition from
server3,17020,1548442571056 for rit=OPENING,
location=server3,17020,1548442571056, table=table,
region=27f7ab2a05d9d730b2ab2339d1531b8e, pid=4351, but the TRSP is not in
REGION_STATE_TRANSITION_CONFIRM_OPENED state, should be a retry, ignore
{noformat}
Region is stuck in OPENING forever.
--
This message was sent by Atlassian JIRA
(v7.6.3#76005)