[
https://issues.apache.org/jira/browse/HBASE-21623?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16769318#comment-16769318
]
Wellington Chevreuil commented on HBASE-21623:
----------------------------------------------
Thanks for the explanation [~sershe], however I still think the locks should
had avoided it. Maybe my reading of this code path is mistaken, but here my
interpretation of which pieces of code would be related:
{quote} SCP: server1 crashed, what's on server1? looks like r1
{quote}
So this would mean this part of SCPs code:
{noformat}
for (RegionInfo region : regions) {
RegionStateNode regionNode =
am.getRegionStates().getOrCreateRegionStateNode(region);
regionNode.lock();
try {
if (regionNode.getProcedure() != null) {
LOG.info("{} found RIT {}; {}", this, regionNode.getProcedure(),
regionNode);
regionNode.getProcedure().serverCrashed(env, regionNode,
getServerName());
} else {
if
(env.getMasterServices().getTableStateManager().isTableState(regionNode.getTable(),
TableState.State.DISABLING, TableState.State.DISABLED)) {
continue;
}
TransitRegionStateProcedure proc =
TransitRegionStateProcedure.assign(env, region, null);
regionNode.setProcedure(proc);
addChildProcedure(proc);
}
} finally {
regionNode.unlock();
}
}
{noformat}
For this step:
{quote}RIT: open failed, OPENING r1 on server2 now
{quote}
Related code would be in TRSP execute -> executeFromState -> openRegion, where
execute method is enclosed by the region node lock:
{noformat}
protected Procedure[] execute(MasterProcedureEnv env)
throws ProcedureSuspendedException, ProcedureYieldException,
InterruptedException {
RegionStateNode regionNode =
env.getAssignmentManager().getRegionStates().getOrCreateRegionStateNode(getRegion());
regionNode.lock();
try {
return super.execute(env);
} finally {
regionNode.unlock();
}
}
{noformat}
So below point would only really happen if the TRSP had already finished its
execution and regionNode object lock has been released, wouldn't it? But in
this case, SCP *regionNode.getProcedure()* call should return null, not the
previous RIT that had already completed, isn't it?
{quote}SCP: looks like a RIT on r1; hey RIT for r1, your server crashed! (*)
{quote}
> ServerCrashProcedure can stomp on a RIT for a wrong server
> ----------------------------------------------------------
>
> Key: HBASE-21623
> URL: https://issues.apache.org/jira/browse/HBASE-21623
> Project: HBase
> Issue Type: Bug
> Components: amv2
> Affects Versions: 3.0.0, 2.2.0
> Reporter: Sergey Shelukhin
> Assignee: Sergey Shelukhin
> Priority: Critical
> Attachments: HBASE-21623.patch
>
>
> A server died while some region was being opened on it; eventually the open
> failed, and the RIT procedure started retrying on a different server.
> However, by then SCP for the dying server had already obtained the region
> from the list of regions on the old server, and proceeded to overwrite
> whatever the RIT was doing with a new server.
> {noformat}
> 2018-12-18 23:06:03,160 INFO [PEWorker-14] procedure2.ProcedureExecutor:
> Initialized subprocedures=[{pid=151404, ppid=151104, state=RUNNABLE,
> hasLock=false; org.apache.hadoop.hbase.master.assignment.OpenRegionProcedure}]
> ...
> 2018-12-18 23:06:38,208 INFO [PEWorker-10] procedure.ServerCrashProcedure:
> Start pid=151632, state=RUNNABLE:SERVER_CRASH_START, hasLock=true;
> ServerCrashProcedure server=oldServer,17020,1545202098577, splitWal=true,
> meta=false
> ...
> 2018-12-18 23:06:41,953 WARN [RSProcedureDispatcher-pool4-t115]
> assignment.RegionRemoteProcedureBase: The remote operation pid=151404,
> ppid=151104, state=RUNNABLE, hasLock=false;
> org.apache.hadoop.hbase.master.assignment.OpenRegionProcedure for region
> {ENCODED => region1, ... } to server oldServer,17020,1545202098577 failed
> org.apache.hadoop.hbase.regionserver.RegionServerAbortedException:
> org.apache.hadoop.hbase.regionserver.RegionServerAbortedException: Server
> oldServer,17020,1545202098577 aborting
> 2018-12-18 23:06:42,485 INFO [PEWorker-5] procedure2.ProcedureExecutor:
> Finished subprocedure(s) of pid=151104, ppid=150875,
> state=RUNNABLE:REGION_STATE_TRANSITION_CONFIRM_OPENED, hasLock=true;
> TransitRegionStateProcedure table=t1, region=region1, ASSIGN; resume parent
> processing.
> 2018-12-18 23:06:42,485 INFO [PEWorker-13]
> assignment.TransitRegionStateProcedure: Retry=1 of max=2147483647;
> pid=151104, ppid=150875,
> state=RUNNABLE:REGION_STATE_TRANSITION_CONFIRM_OPENED, hasLock=true;
> TransitRegionStateProcedure table=t1, region=region1, ASSIGN; rit=OPENING,
> location=oldServer,17020,1545202098577
> 2018-12-18 23:06:42,500 INFO [PEWorker-13]
> assignment.TransitRegionStateProcedure: Starting pid=151104, ppid=150875,
> state=RUNNABLE:REGION_STATE_TRANSITION_GET_ASSIGN_CANDIDATE, hasLock=true;
> TransitRegionStateProcedure table=t1, region=region1, ASSIGN; rit=OPENING,
> location=null; forceNewPlan=true, retain=false
> 2018-12-18 23:06:42,657 INFO [PEWorker-2] assignment.RegionStateStore:
> pid=151104 updating hbase:meta row=region1, regionState=OPENING,
> regionLocation=newServer,17020,1545202111238
> ...
> 2018-12-18 23:06:43,094 INFO [PEWorker-4] procedure.ServerCrashProcedure:
> pid=151632, state=RUNNABLE:SERVER_CRASH_ASSIGN, hasLock=true;
> ServerCrashProcedure server=oldServer,17020,1545202098577, splitWal=true,
> meta=false found RIT pid=151104, ppid=150875,
> state=WAITING:REGION_STATE_TRANSITION_CONFIRM_OPENED, hasLock=true;
> TransitRegionStateProcedure table=t1, region=region1, ASSIGN; rit=OPENING,
> location=newServer,17020,1545202111238, table=t1, region=region1
> 2018-12-18 23:06:43,094 INFO [PEWorker-4] assignment.RegionStateStore:
> pid=151104 updating hbase:meta row=region1, regionState=ABNORMALLY_CLOSED
> {noformat}
> Later, the RIT overwrote the state again, it seems, and then the region got
> stuck in OPENING state forever, but I'm not sure yet if that's just due to
> this bug or if there was another bug after that. For now this can be
> addressed.
--
This message was sent by Atlassian JIRA
(v7.6.3#76005)