[
https://issues.apache.org/jira/browse/HBASE-28405?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17834599#comment-17834599
]
Viraj Jasani commented on HBASE-28405:
--------------------------------------
{quote}Also, there is a related and interesting finding.
Here:
{noformat}
2024-02-11 10:53:59,074 WARN [REGION-regionserver/rs-210:60020-10]
handler.AssignRegionHandler -
Received OPEN for table1,r1,1685436252488.a92008b76ccae47d55c590930b837036.
which is already online
{noformat}
One option here is the RS can tell the master the assign succeeded, because the
OPEN request is idempotent when the region is already open on the RS.
{quote}
Yes, assign should have been idempotent, but it's not so we need to fix this.
Came across 3 similar incidents this week, out of which 2 were similar to this
Jira i.e. merge transition is rolled back due to one of the parent regions was
in transition. However, the rollback does not get successfully completed
because assign is not somehow treated idempotent.
{quote}It would be ideal if the master does not make redundant requests, but if
it does make one, the RS should handle the request and return success to the
master because the request to open an already open region on the same server is
idempotent with the earlier request that caused the region to be opened there
in the first place.
{quote}
+1, if assign was idempotent, we would not have run into merge rollback getting
stuck indefinitely.
> Region open procedure silently returns without notifying the parent proc
> ------------------------------------------------------------------------
>
> Key: HBASE-28405
> URL: https://issues.apache.org/jira/browse/HBASE-28405
> Project: HBase
> Issue Type: Bug
> Components: proc-v2
> Affects Versions: 2.5.7
> Reporter: Aman Poonia
> Assignee: Aman Poonia
> Priority: Major
>
> *We had a scenario in production where a merge operation had failed as below*
> _2024-02-11 10:53:57,715 ERROR [PEWorker-31]
> assignment.MergeTableRegionsProcedure - Error trying to merge
> [a92008b76ccae47d55c590930b837036, f56752ae9f30fad9de5a80a8ba578e4b] in
> table1 (in state=MERGE_TABLE_REGIONS_CLOSE_REGIONS)_
> _org.apache.hadoop.hbase.HBaseIOException: The parent region state=MERGING,
> location=rs-229,60020,1707587658182, table=table1,
> region=f56752ae9f30fad9de5a80a8ba578e4b is currently in transition, give up_
> _at
> org.apache.hadoop.hbase.master.assignment.AssignmentManagerUtil.createUnassignProceduresForSplitOrMerge(AssignmentManagerUtil.java:120)_
> _at
> org.apache.hadoop.hbase.master.assignment.MergeTableRegionsProcedure.createUnassignProcedures(MergeTableRegionsProcedure.java:648)_
> _at
> org.apache.hadoop.hbase.master.assignment.MergeTableRegionsProcedure.executeFromState(MergeTableRegionsProcedure.java:205)_
> _at
> org.apache.hadoop.hbase.master.assignment.MergeTableRegionsProcedure.executeFromState(MergeTableRegionsProcedure.java:79)_
> _at
> org.apache.hadoop.hbase.procedure2.StateMachineProcedure.execute(StateMachineProcedure.java:188)_
> _at
> org.apache.hadoop.hbase.procedure2.Procedure.doExecute(Procedure.java:922)_
> _at
> org.apache.hadoop.hbase.procedure2.ProcedureExecutor.execProcedure(ProcedureExecutor.java:1650)_
> _at
> org.apache.hadoop.hbase.procedure2.ProcedureExecutor.executeProcedure(ProcedureExecutor.java:1396)_
> _at
> org.apache.hadoop.hbase.procedure2.ProcedureExecutor.access$1000(ProcedureExecutor.java:75)_
> _at
> org.apache.hadoop.hbase.procedure2.ProcedureExecutor$WorkerThread.runProcedure(ProcedureExecutor.java:1964)_
> _at org.apache.hadoop.hbase.trace.TraceUtil.trace(TraceUtil.java:216)_
> _at
> org.apache.hadoop.hbase.procedure2.ProcedureExecutor$WorkerThread.run(ProcedureExecutor.java:1991)_
> *Now when we do rollback of failed merge operation we see a issue where
> region is in state opened until the RS holding it stopped.*
> Rollback create a TRSP as below
> _2024-02-11 10:53:57,719 DEBUG [PEWorker-31] procedure2.ProcedureExecutor -
> Stored [pid=26674602,
> state=RUNNABLE:REGION_STATE_TRANSITION_GET_ASSIGN_CANDIDATE;
> TransitRegionStateProcedure table=table1,
> region=a92008b76ccae47d55c590930b837036, ASSIGN]_
> *and rollback finished successfully*
> _2024-02-11 10:53:57,721 INFO [PEWorker-31] procedure2.ProcedureExecutor -
> Rolled back pid=26673594, state=ROLLEDBACK,
> exception=org.apache.hadoop.hbase.HBaseIOException via
> master-merge-regions:org.apache.hadoop.hbase.HBaseIOException: The parent
> region state=MERGING, location=rs-229,60020,1707587658182, table=table1,
> region=f56752ae9f30fad9de5a80a8ba578e4b is currently in transition, give up;
> MergeTableRegionsProcedure table=table1,
> regions=[a92008b76ccae47d55c590930b837036, f56752ae9f30fad9de5a80a8ba578e4b],
> force=false exec-time=1.4820 sec_
> *We create a procedure to open the region a92008b76ccae47d55c590930b837036.
> Intrestingly we didnt close the region as creation of procedure to close
> regions had thrown exception and not execution of procedure. When we run TRSP
> it sends a OpenRegionProcedure which is handled by AssignRegionHandler. This
> handlers on execution suggests that region is already online*
> Sequence of events are as follow
> _2024-02-11 10:53:58,919 INFO [PEWorker-58] assignment.RegionStateStore -
> pid=26674602 updating hbase:meta row=a92008b76ccae47d55c590930b837036,
> regionState=OPENING, regionLocation=rs-210,60020,1707596461539_
> _2024-02-11 10:53:58,920 INFO [PEWorker-58] procedure2.ProcedureExecutor -
> Initialized subprocedures=[\\{pid=26675798, ppid=26674602, state=RUNNABLE;
> OpenRegionProcedure a92008b76ccae47d55c590930b837036,
> server=rs-210,60020,1707596461539}]_
> _2024-02-11 10:53:59,074 WARN [REGION-regionserver/rs-210:60020-10]
> handler.AssignRegionHandler - Received OPEN for
> table1,r1,1685436252488.a92008b76ccae47d55c590930b837036. which is already
> online_
--
This message was sent by Atlassian Jira
(v8.20.10#820010)