[ 
https://issues.apache.org/jira/browse/HBASE-28405?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17821214#comment-17821214
 ] 

Aman Poonia commented on HBASE-28405:
-------------------------------------

Looking at the code comments it says that it might be a mistake and master is 
checking on RS again and last response has not reached to HMaster yet. But our 
current issue seems to be little different. Here first call itself is returned 
without any reportRegionStateTransition
{code:java}
// code placeholder
String regionName = regionInfo.getRegionNameAsString();
Region onlineRegion = rs.getRegion(encodedName);
if (onlineRegion != null) {
  LOG.warn("Received OPEN for {} which is already online", regionName);
  // Just follow the old behavior, do we need to call 
reportRegionStateTransition? Maybe not?
  // For normal case, it could happen that the rpc call to schedule this 
handler is succeeded,
  // but before returning to master the connection is broken. And when master 
tries again, we
  // have already finished the opening. For this case we do not need to call
  // reportRegionStateTransition any more.
  return;
}
Boolean previous = 
rs.getRegionsInTransitionInRS().putIfAbsent(encodedNameBytes, Boolean.TRUE); 
{code}
[~apurtell] [~vjasani] FYI

> Region open procedure silently returns without notifying the parent proc
> ------------------------------------------------------------------------
>
>                 Key: HBASE-28405
>                 URL: https://issues.apache.org/jira/browse/HBASE-28405
>             Project: HBase
>          Issue Type: Bug
>          Components: proc-v2
>    Affects Versions: 2.5.7
>            Reporter: Aman Poonia
>            Assignee: Aman Poonia
>            Priority: Major
>
> We had a scenario in production where a merge operation had failed as below
> _2024-02-11 10:53:57,715 ERROR [PEWorker-31] 
> assignment.MergeTableRegionsProcedure - Error trying to merge 
> [a92008b76ccae47d55c590930b837036, f56752ae9f30fad9de5a80a8ba578e4b] in 
> table1 (in state=MERGE_TABLE_REGIONS_CLOSE_REGIONS)_
> _org.apache.hadoop.hbase.HBaseIOException: The parent region state=MERGING, 
> location=rs-229,60020,1707587658182, table=table1, 
> region=f56752ae9f30fad9de5a80a8ba578e4b is currently in transition, give up_
> _at 
> org.apache.hadoop.hbase.master.assignment.AssignmentManagerUtil.createUnassignProceduresForSplitOrMerge(AssignmentManagerUtil.java:120)_
> _at 
> org.apache.hadoop.hbase.master.assignment.MergeTableRegionsProcedure.createUnassignProcedures(MergeTableRegionsProcedure.java:648)_
> _at 
> org.apache.hadoop.hbase.master.assignment.MergeTableRegionsProcedure.executeFromState(MergeTableRegionsProcedure.java:205)_
> _at 
> org.apache.hadoop.hbase.master.assignment.MergeTableRegionsProcedure.executeFromState(MergeTableRegionsProcedure.java:79)_
> _at 
> org.apache.hadoop.hbase.procedure2.StateMachineProcedure.execute(StateMachineProcedure.java:188)_
> _at 
> org.apache.hadoop.hbase.procedure2.Procedure.doExecute(Procedure.java:922)_
> _at 
> org.apache.hadoop.hbase.procedure2.ProcedureExecutor.execProcedure(ProcedureExecutor.java:1650)_
> _at 
> org.apache.hadoop.hbase.procedure2.ProcedureExecutor.executeProcedure(ProcedureExecutor.java:1396)_
> _at 
> org.apache.hadoop.hbase.procedure2.ProcedureExecutor.access$1000(ProcedureExecutor.java:75)_
> _at 
> org.apache.hadoop.hbase.procedure2.ProcedureExecutor$WorkerThread.runProcedure(ProcedureExecutor.java:1964)_
> _at org.apache.hadoop.hbase.trace.TraceUtil.trace(TraceUtil.java:216)_
> _at 
> org.apache.hadoop.hbase.procedure2.ProcedureExecutor$WorkerThread.run(ProcedureExecutor.java:1991)_
> Now when we do rollback of failed merge operation we see a issue where region 
> is in state opened until the RS holding it stopped.
> Rollback create a TRSP as below
> _2024-02-11 10:53:57,719 DEBUG [PEWorker-31] procedure2.ProcedureExecutor - 
> Stored [pid=26674602, 
> state=RUNNABLE:REGION_STATE_TRANSITION_GET_ASSIGN_CANDIDATE; 
> TransitRegionStateProcedure table=table1, 
> region=a92008b76ccae47d55c590930b837036, ASSIGN]_
> and rollback finished successfully
> _2024-02-11 10:53:57,721 INFO [PEWorker-31] procedure2.ProcedureExecutor - 
> Rolled back pid=26673594, state=ROLLEDBACK, 
> exception=org.apache.hadoop.hbase.HBaseIOException via 
> master-merge-regions:org.apache.hadoop.hbase.HBaseIOException: The parent 
> region state=MERGING, location=rs-229,60020,1707587658182, table=table1, 
> region=f56752ae9f30fad9de5a80a8ba578e4b is currently in transition, give up; 
> MergeTableRegionsProcedure table=table1, 
> regions=[a92008b76ccae47d55c590930b837036, f56752ae9f30fad9de5a80a8ba578e4b], 
> force=false exec-time=1.4820 sec_
> We create a procedure to open the region a92008b76ccae47d55c590930b837036
> Intrestingly we didnt close the region as creation of procedure to close 
> regions had thrown exception and not execution of procedure.
> Now when we run TRSP it sends a OpenRegionProcedure which is handled by 
> AssignRegionHandler
> This handlers on execution suggests that region is already online
> Sequence of events are as follow
> _2024-02-11 10:53:58,919 INFO [PEWorker-58] assignment.RegionStateStore - 
> pid=26674602 updating hbase:meta row=a92008b76ccae47d55c590930b837036, 
> regionState=OPENING, regionLocation=rs-210,60020,1707596461539_
> _2024-02-11 10:53:58,920 INFO [PEWorker-58] procedure2.ProcedureExecutor - 
> Initialized subprocedures=[\{pid=26675798, ppid=26674602, state=RUNNABLE; 
> OpenRegionProcedure a92008b76ccae47d55c590930b837036, 
> server=rs-210,60020,1707596461539}]_
> _2024-02-11 10:53:59,074 WARN [REGION-regionserver/rs-210:60020-10] 
> handler.AssignRegionHandler - Received OPEN for 
> table1,r1,1685436252488.a92008b76ccae47d55c590930b837036. which is already 
> online_



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

Reply via email to