[
https://issues.apache.org/jira/browse/HBASE-21623?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16768632#comment-16768632
]
Sergey Shelukhin commented on HBASE-21623:
------------------------------------------
Do you mean the language-level lock, or the HBase-level region lock?
The rit retried on a different server and so the fact that the region was at
one time opening on the crashed server was irrelevant at that point. SCP
doesn't appear to care for the HBase-level locks when it makes the decision to
replace (and it couldn't, cause whether it is updating the right or wrong RIT,
the latter would still be holding that, so the situation is not different).
The language-level lock protects individual procedure assignments from racing -
assuming every piece of code under it does correct checks;auditing them all is
out of the scope of this issue, I was assuming you see some specific bug with
that.
The race proceeds as following (where each individual step looks safe from
lower-level races to me, given region lock.
RIT: OPENING r1 on server1.
Server1: (silence)
SCP: server1 crashed, what's on server1? looks like r1
RIT: open failed, OPENING r1 on server2 now
Server2: opening...
SCP: looks like a RIT on r1; hey RIT for r1, your server crashed! (*)
RIT: oh well, OPENING r1 on server3 now
Which in this case also leads to
Server3: opening...
Server2: hey I opened r1!
RIT: who cares, it's on server3 now (as a side note, I'm adding a RS kill here
in a separate JIRA, ignoring this is not safe)
Server3: hey I (also) opened r1!
The fix is for (*) to check which server has crashed. I don't think SCP can
list regions and notify atomically without major changes, because they are
separate state machine states.
> ServerCrashProcedure can stomp on a RIT for a wrong server
> ----------------------------------------------------------
>
> Key: HBASE-21623
> URL: https://issues.apache.org/jira/browse/HBASE-21623
> Project: HBase
> Issue Type: Bug
> Components: amv2
> Affects Versions: 3.0.0, 2.2.0
> Reporter: Sergey Shelukhin
> Assignee: Sergey Shelukhin
> Priority: Critical
> Attachments: HBASE-21623.patch
>
>
> A server died while some region was being opened on it; eventually the open
> failed, and the RIT procedure started retrying on a different server.
> However, by then SCP for the dying server had already obtained the region
> from the list of regions on the old server, and proceeded to overwrite
> whatever the RIT was doing with a new server.
> {noformat}
> 2018-12-18 23:06:03,160 INFO [PEWorker-14] procedure2.ProcedureExecutor:
> Initialized subprocedures=[{pid=151404, ppid=151104, state=RUNNABLE,
> hasLock=false; org.apache.hadoop.hbase.master.assignment.OpenRegionProcedure}]
> ...
> 2018-12-18 23:06:38,208 INFO [PEWorker-10] procedure.ServerCrashProcedure:
> Start pid=151632, state=RUNNABLE:SERVER_CRASH_START, hasLock=true;
> ServerCrashProcedure server=oldServer,17020,1545202098577, splitWal=true,
> meta=false
> ...
> 2018-12-18 23:06:41,953 WARN [RSProcedureDispatcher-pool4-t115]
> assignment.RegionRemoteProcedureBase: The remote operation pid=151404,
> ppid=151104, state=RUNNABLE, hasLock=false;
> org.apache.hadoop.hbase.master.assignment.OpenRegionProcedure for region
> {ENCODED => region1, ... } to server oldServer,17020,1545202098577 failed
> org.apache.hadoop.hbase.regionserver.RegionServerAbortedException:
> org.apache.hadoop.hbase.regionserver.RegionServerAbortedException: Server
> oldServer,17020,1545202098577 aborting
> 2018-12-18 23:06:42,485 INFO [PEWorker-5] procedure2.ProcedureExecutor:
> Finished subprocedure(s) of pid=151104, ppid=150875,
> state=RUNNABLE:REGION_STATE_TRANSITION_CONFIRM_OPENED, hasLock=true;
> TransitRegionStateProcedure table=t1, region=region1, ASSIGN; resume parent
> processing.
> 2018-12-18 23:06:42,485 INFO [PEWorker-13]
> assignment.TransitRegionStateProcedure: Retry=1 of max=2147483647;
> pid=151104, ppid=150875,
> state=RUNNABLE:REGION_STATE_TRANSITION_CONFIRM_OPENED, hasLock=true;
> TransitRegionStateProcedure table=t1, region=region1, ASSIGN; rit=OPENING,
> location=oldServer,17020,1545202098577
> 2018-12-18 23:06:42,500 INFO [PEWorker-13]
> assignment.TransitRegionStateProcedure: Starting pid=151104, ppid=150875,
> state=RUNNABLE:REGION_STATE_TRANSITION_GET_ASSIGN_CANDIDATE, hasLock=true;
> TransitRegionStateProcedure table=t1, region=region1, ASSIGN; rit=OPENING,
> location=null; forceNewPlan=true, retain=false
> 2018-12-18 23:06:42,657 INFO [PEWorker-2] assignment.RegionStateStore:
> pid=151104 updating hbase:meta row=region1, regionState=OPENING,
> regionLocation=newServer,17020,1545202111238
> ...
> 2018-12-18 23:06:43,094 INFO [PEWorker-4] procedure.ServerCrashProcedure:
> pid=151632, state=RUNNABLE:SERVER_CRASH_ASSIGN, hasLock=true;
> ServerCrashProcedure server=oldServer,17020,1545202098577, splitWal=true,
> meta=false found RIT pid=151104, ppid=150875,
> state=WAITING:REGION_STATE_TRANSITION_CONFIRM_OPENED, hasLock=true;
> TransitRegionStateProcedure table=t1, region=region1, ASSIGN; rit=OPENING,
> location=newServer,17020,1545202111238, table=t1, region=region1
> 2018-12-18 23:06:43,094 INFO [PEWorker-4] assignment.RegionStateStore:
> pid=151104 updating hbase:meta row=region1, regionState=ABNORMALLY_CLOSED
> {noformat}
> Later, the RIT overwrote the state again, it seems, and then the region got
> stuck in OPENING state forever, but I'm not sure yet if that's just due to
> this bug or if there was another bug after that. For now this can be
> addressed.
--
This message was sent by Atlassian JIRA
(v7.6.3#76005)