[ 
https://issues.apache.org/jira/browse/HBASE-28405?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17821361#comment-17821361
 ] 

Andrew Kyle Purtell edited comment on HBASE-28405 at 2/27/24 5:46 PM:
----------------------------------------------------------------------

{quote}Maybe since region is online just changing the state in region state 
node (meta) from MERGING to OPEN would have sifficed in such cases.
{quote}
The best solution is fixing the state in the master to reflect the region is 
still online on the RS after the failed merge.

Here:
{noformat}
2024-02-11 10:53:59,074 WARN [REGION-regionserver/rs-210:60020-10] 
handler.AssignRegionHandler -
Received OPEN for table1,r1,1685436252488.a92008b76ccae47d55c590930b837036. 
which is already online
{noformat}
One option here is the RS can tell the master the assign succeeded, because the 
OPEN request is idempotent when the region is already open on the RS.

[~zhangduo]
{quote}So the problem is that, we should not issue a TRSP if the region is 
already online, when rollbacking the MergeTableRegionsProcedure. If we assign 
the region to the same RS, it will hang the rollback
{quote}
It would be ideal if the master does not make redundant requests, but if it 
does make one, the RS should handle the request and return success to the 
master because the request to open an already open region on the same server is 
idempotent with the earlier request that caused the region to be opened there 
in the first place. So why would this hang the rollback? Maybe because today 
the RS won't ack the new request? So we can change the RS code to do that if so.


was (Author: apurtell):
{quote}Maybe since region is online just changing the state in region state 
node (meta) from MERGING to OPEN would have sifficed in such cases.
{quote}
The best solution is fixing the state in the master to reflect the region is 
still online on the RS after the failed merge.

Here:
{noformat}
2024-02-11 10:53:59,074 WARN [REGION-regionserver/rs-210:60020-10] 
handler.AssignRegionHandler -
Received OPEN for table1,r1,1685436252488.a92008b76ccae47d55c590930b837036. 
which is already online
{noformat}
One option here is the RS can tell the master the assign succeeded, because the 
OPEN request is idempotent when the region is already open on the RS.

[~zhangduo]
{quote}So the problem is that, we should not issue a TRSP if the region is 
already online, when rollbacking the MergeTableRegionsProcedure. If we assign 
the region to the same RS, it will hang the rollback
{quote}
It would be ideal if the master does not make redundant requests, but if it 
does make one, the RS should handle the request and return success to the 
master because the request to open an already open region on the same server is 
idempotent with the earlier request that caused the region to be opened there 
in the first place. So why would this hang the rollback? Maybe because today 
the RS won't ack the new request? So we can change the RS code to do that if so.

Although it would be good to optimize the master so it isn't making redundant 
requests. 

> Region open procedure silently returns without notifying the parent proc
> ------------------------------------------------------------------------
>
>                 Key: HBASE-28405
>                 URL: https://issues.apache.org/jira/browse/HBASE-28405
>             Project: HBase
>          Issue Type: Bug
>          Components: proc-v2
>    Affects Versions: 2.5.7
>            Reporter: Aman Poonia
>            Assignee: Aman Poonia
>            Priority: Major
>
> *We had a scenario in production where a merge operation had failed as below*
> _2024-02-11 10:53:57,715 ERROR [PEWorker-31] 
> assignment.MergeTableRegionsProcedure - Error trying to merge 
> [a92008b76ccae47d55c590930b837036, f56752ae9f30fad9de5a80a8ba578e4b] in 
> table1 (in state=MERGE_TABLE_REGIONS_CLOSE_REGIONS)_
> _org.apache.hadoop.hbase.HBaseIOException: The parent region state=MERGING, 
> location=rs-229,60020,1707587658182, table=table1, 
> region=f56752ae9f30fad9de5a80a8ba578e4b is currently in transition, give up_
> _at 
> org.apache.hadoop.hbase.master.assignment.AssignmentManagerUtil.createUnassignProceduresForSplitOrMerge(AssignmentManagerUtil.java:120)_
> _at 
> org.apache.hadoop.hbase.master.assignment.MergeTableRegionsProcedure.createUnassignProcedures(MergeTableRegionsProcedure.java:648)_
> _at 
> org.apache.hadoop.hbase.master.assignment.MergeTableRegionsProcedure.executeFromState(MergeTableRegionsProcedure.java:205)_
> _at 
> org.apache.hadoop.hbase.master.assignment.MergeTableRegionsProcedure.executeFromState(MergeTableRegionsProcedure.java:79)_
> _at 
> org.apache.hadoop.hbase.procedure2.StateMachineProcedure.execute(StateMachineProcedure.java:188)_
> _at 
> org.apache.hadoop.hbase.procedure2.Procedure.doExecute(Procedure.java:922)_
> _at 
> org.apache.hadoop.hbase.procedure2.ProcedureExecutor.execProcedure(ProcedureExecutor.java:1650)_
> _at 
> org.apache.hadoop.hbase.procedure2.ProcedureExecutor.executeProcedure(ProcedureExecutor.java:1396)_
> _at 
> org.apache.hadoop.hbase.procedure2.ProcedureExecutor.access$1000(ProcedureExecutor.java:75)_
> _at 
> org.apache.hadoop.hbase.procedure2.ProcedureExecutor$WorkerThread.runProcedure(ProcedureExecutor.java:1964)_
> _at org.apache.hadoop.hbase.trace.TraceUtil.trace(TraceUtil.java:216)_
> _at 
> org.apache.hadoop.hbase.procedure2.ProcedureExecutor$WorkerThread.run(ProcedureExecutor.java:1991)_
> *Now when we do rollback of failed merge operation we see a issue where 
> region is in state opened until the RS holding it stopped.*
> Rollback create a TRSP as below
> _2024-02-11 10:53:57,719 DEBUG [PEWorker-31] procedure2.ProcedureExecutor - 
> Stored [pid=26674602, 
> state=RUNNABLE:REGION_STATE_TRANSITION_GET_ASSIGN_CANDIDATE; 
> TransitRegionStateProcedure table=table1, 
> region=a92008b76ccae47d55c590930b837036, ASSIGN]_
> *and rollback finished successfully*
> _2024-02-11 10:53:57,721 INFO [PEWorker-31] procedure2.ProcedureExecutor - 
> Rolled back pid=26673594, state=ROLLEDBACK, 
> exception=org.apache.hadoop.hbase.HBaseIOException via 
> master-merge-regions:org.apache.hadoop.hbase.HBaseIOException: The parent 
> region state=MERGING, location=rs-229,60020,1707587658182, table=table1, 
> region=f56752ae9f30fad9de5a80a8ba578e4b is currently in transition, give up; 
> MergeTableRegionsProcedure table=table1, 
> regions=[a92008b76ccae47d55c590930b837036, f56752ae9f30fad9de5a80a8ba578e4b], 
> force=false exec-time=1.4820 sec_
> *We create a procedure to open the region a92008b76ccae47d55c590930b837036. 
> Intrestingly we didnt close the region as creation of procedure to close 
> regions had thrown exception and not execution of procedure. When we run TRSP 
> it sends a OpenRegionProcedure which is handled by AssignRegionHandler. This 
> handlers on execution suggests that region is already online*
> Sequence of events are as follow
> _2024-02-11 10:53:58,919 INFO [PEWorker-58] assignment.RegionStateStore - 
> pid=26674602 updating hbase:meta row=a92008b76ccae47d55c590930b837036, 
> regionState=OPENING, regionLocation=rs-210,60020,1707596461539_
> _2024-02-11 10:53:58,920 INFO [PEWorker-58] procedure2.ProcedureExecutor - 
> Initialized subprocedures=[\\{pid=26675798, ppid=26674602, state=RUNNABLE; 
> OpenRegionProcedure a92008b76ccae47d55c590930b837036, 
> server=rs-210,60020,1707596461539}]_
> _2024-02-11 10:53:59,074 WARN [REGION-regionserver/rs-210:60020-10] 
> handler.AssignRegionHandler - Received OPEN for 
> table1,r1,1685436252488.a92008b76ccae47d55c590930b837036. which is already 
> online_



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

Reply via email to