[
https://issues.apache.org/jira/browse/HBASE-28405?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17834611#comment-17834611
]
Duo Zhang commented on HBASE-28405:
-----------------------------------
As I explained above, the current logic depends on both master and rs side.
While a region is in transition, master is free to issue redundant open region
requests many times to the same region server, the region server will ignore
the redundant requests and make sure it reports the result to master at least
once.
But if a region is not in transition state at all, master should not issue any
requests to region server, as I also explained above, it will be fine if we
issue the open region request to the same rs and we just report to master that
it is already online, but what if we issue the open region request to other rs?
It will cause the region to be online on two region servers…
So the key point here, is that while rolling back the
MergeTableRegionsProcedure, we need to figure out whether we have actually
issued a TRSP to bring the regions offline, if not, we should not issue the a
TRSP to bring the region online…
And it is a bit strange that, I think we should only change the region’s state
to MERGING after we offline it? Why do we set it to MERGING while it is still
online?
> Region open procedure silently returns without notifying the parent proc
> ------------------------------------------------------------------------
>
> Key: HBASE-28405
> URL: https://issues.apache.org/jira/browse/HBASE-28405
> Project: HBase
> Issue Type: Bug
> Components: proc-v2
> Affects Versions: 2.5.7
> Reporter: Aman Poonia
> Assignee: Aman Poonia
> Priority: Major
>
> *We had a scenario in production where a merge operation had failed as below*
> _2024-02-11 10:53:57,715 ERROR [PEWorker-31]
> assignment.MergeTableRegionsProcedure - Error trying to merge
> [a92008b76ccae47d55c590930b837036, f56752ae9f30fad9de5a80a8ba578e4b] in
> table1 (in state=MERGE_TABLE_REGIONS_CLOSE_REGIONS)_
> _org.apache.hadoop.hbase.HBaseIOException: The parent region state=MERGING,
> location=rs-229,60020,1707587658182, table=table1,
> region=f56752ae9f30fad9de5a80a8ba578e4b is currently in transition, give up_
> _at
> org.apache.hadoop.hbase.master.assignment.AssignmentManagerUtil.createUnassignProceduresForSplitOrMerge(AssignmentManagerUtil.java:120)_
> _at
> org.apache.hadoop.hbase.master.assignment.MergeTableRegionsProcedure.createUnassignProcedures(MergeTableRegionsProcedure.java:648)_
> _at
> org.apache.hadoop.hbase.master.assignment.MergeTableRegionsProcedure.executeFromState(MergeTableRegionsProcedure.java:205)_
> _at
> org.apache.hadoop.hbase.master.assignment.MergeTableRegionsProcedure.executeFromState(MergeTableRegionsProcedure.java:79)_
> _at
> org.apache.hadoop.hbase.procedure2.StateMachineProcedure.execute(StateMachineProcedure.java:188)_
> _at
> org.apache.hadoop.hbase.procedure2.Procedure.doExecute(Procedure.java:922)_
> _at
> org.apache.hadoop.hbase.procedure2.ProcedureExecutor.execProcedure(ProcedureExecutor.java:1650)_
> _at
> org.apache.hadoop.hbase.procedure2.ProcedureExecutor.executeProcedure(ProcedureExecutor.java:1396)_
> _at
> org.apache.hadoop.hbase.procedure2.ProcedureExecutor.access$1000(ProcedureExecutor.java:75)_
> _at
> org.apache.hadoop.hbase.procedure2.ProcedureExecutor$WorkerThread.runProcedure(ProcedureExecutor.java:1964)_
> _at org.apache.hadoop.hbase.trace.TraceUtil.trace(TraceUtil.java:216)_
> _at
> org.apache.hadoop.hbase.procedure2.ProcedureExecutor$WorkerThread.run(ProcedureExecutor.java:1991)_
> *Now when we do rollback of failed merge operation we see a issue where
> region is in state opened until the RS holding it stopped.*
> Rollback create a TRSP as below
> _2024-02-11 10:53:57,719 DEBUG [PEWorker-31] procedure2.ProcedureExecutor -
> Stored [pid=26674602,
> state=RUNNABLE:REGION_STATE_TRANSITION_GET_ASSIGN_CANDIDATE;
> TransitRegionStateProcedure table=table1,
> region=a92008b76ccae47d55c590930b837036, ASSIGN]_
> *and rollback finished successfully*
> _2024-02-11 10:53:57,721 INFO [PEWorker-31] procedure2.ProcedureExecutor -
> Rolled back pid=26673594, state=ROLLEDBACK,
> exception=org.apache.hadoop.hbase.HBaseIOException via
> master-merge-regions:org.apache.hadoop.hbase.HBaseIOException: The parent
> region state=MERGING, location=rs-229,60020,1707587658182, table=table1,
> region=f56752ae9f30fad9de5a80a8ba578e4b is currently in transition, give up;
> MergeTableRegionsProcedure table=table1,
> regions=[a92008b76ccae47d55c590930b837036, f56752ae9f30fad9de5a80a8ba578e4b],
> force=false exec-time=1.4820 sec_
> *We create a procedure to open the region a92008b76ccae47d55c590930b837036.
> Intrestingly we didnt close the region as creation of procedure to close
> regions had thrown exception and not execution of procedure. When we run TRSP
> it sends a OpenRegionProcedure which is handled by AssignRegionHandler. This
> handlers on execution suggests that region is already online*
> Sequence of events are as follow
> _2024-02-11 10:53:58,919 INFO [PEWorker-58] assignment.RegionStateStore -
> pid=26674602 updating hbase:meta row=a92008b76ccae47d55c590930b837036,
> regionState=OPENING, regionLocation=rs-210,60020,1707596461539_
> _2024-02-11 10:53:58,920 INFO [PEWorker-58] procedure2.ProcedureExecutor -
> Initialized subprocedures=[\\{pid=26675798, ppid=26674602, state=RUNNABLE;
> OpenRegionProcedure a92008b76ccae47d55c590930b837036,
> server=rs-210,60020,1707596461539}]_
> _2024-02-11 10:53:59,074 WARN [REGION-regionserver/rs-210:60020-10]
> handler.AssignRegionHandler - Received OPEN for
> table1,r1,1685436252488.a92008b76ccae47d55c590930b837036. which is already
> online_
--
This message was sent by Atlassian Jira
(v8.20.10#820010)