[ 
https://issues.apache.org/jira/browse/HBASE-30201?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=18085942#comment-18085942
 ] 

rstest commented on HBASE-30201:
--------------------------------

Potential fix direction:

- In `checkIfShouldMoveSystemRegionAsync()`, do not call the normal 
`moveAsync(...)` path for `hbase:meta` when the region is already in transition 
or when the source/target server is dead, queued-dead, or currently being 
handled by `ServerCrashProcedure`.
- If `hbase:meta` already has an active procedure, defer the compatibility move 
and retry after the procedure completes instead of treating the normal move 
failure as the recovery path.
- Add a regression test where an upgraded HMaster sees a newly registered 
RegionServer while the old RegionServer hosting `hbase:meta` has failed during 
`OPENING`; the test should assert that `hbase:meta` is eventually assigned and 
that the compatibility move path does not interfere with crash recovery.

> HBase rolling upgrade buggy on `hbases:meta` crash recovery
> -----------------------------------------------------------
>
>                 Key: HBASE-30201
>                 URL: https://issues.apache.org/jira/browse/HBASE-30201
>             Project: HBase
>          Issue Type: Bug
>          Components: master, regionserver
>    Affects Versions: 4.0.0-alpha-1, 2.6.4
>            Reporter: rstest
>            Priority: Major
>
> h1. Summary
> HBase rolling upgrade can race system-region compatibility movement with meta 
> crash recovery, leaving `hbase:meta` stuck in `OPENING`
> h1. Bug Symptom
> During a rolling upgrade from HBase 2.6.4 to HBase 4.0.0-alpha-1-SNAPSHOT, 
> the upgraded HMaster can try to move `hbase:meta` through the normal 
> system-region compatibility path while the RegionServer hosting `hbase:meta` 
> has just failed and meta is already being opened/recovered by another 
> assignment procedure.
> The observed sequence is:
>  - A 3-node HBase 2.6.4 cluster starts normally.
>  - Node0, the HMaster, is upgraded to HBase 4.0.0-alpha-1-SNAPSHOT.
>  - Node2, `hregion2`, is killed shortly after the Node0 upgrade.
>  - The raw error identifies `hregion2` as the location of `hbase:meta`.
>  - The upgraded HMaster runs 
> `AssignmentManager.checkIfShouldMoveSystemRegionAsync()`, which moves system 
> regions toward higher-version RegionServers during mixed-version operation.
>  - That path calls the normal `moveAsync(...)` path for `hbase:meta`.
>  - `hbase:meta` is already `OPENING` and already has an active assignment 
> procedure.
>  - `AssignmentManager.preTransitCheck(...)` rejects the normal move attempt 
> because the region has an active procedure.
>  
> Expected behavior:
>  - If the server hosting `hbase:meta` dies during rolling upgrade, meta 
> recovery should be handled by `ServerCrashProcedure` or the active meta 
> assignment procedure.
>  - The system-region compatibility move path should defer, skip, or retry 
> later when `hbase:meta` is already in transition.
>  - The master should eventually assign/recover `hbase:meta` without getting 
> stuck behind a rejected normal move.
> Actual behavior:
>  - The upgraded HMaster logs an HBase product error for `hbase:meta` in 
> `OPENING`.
>  - The normal move path is rejected while `hbase:meta` still has an active 
> procedure.
>  - The observed failure is only reported in the old-new rolling lane for this 
> test plan, not as a normal old-old or new-new baseline behavior.
> Representative exception:
> {code:java}
> 2026-03-07T05:07:45,259 ERROR [Thread-38] assignment.AssignmentManager:
> org.apache.hadoop.hbase.HBaseIOException: state=OPENING,
> location=hregion2,16020,1772859875787, table=hbase:meta,
> region=1588230740 is currently in transition; pid=49
>   at 
> org.apache.hadoop.hbase.master.assignment.AssignmentManager.preTransitCheck(AssignmentManager.java:766)
>   at 
> org.apache.hadoop.hbase.master.assignment.AssignmentManager.createMoveRegionProcedure(AssignmentManager.java:879)
>   at 
> org.apache.hadoop.hbase.master.assignment.AssignmentManager.moveAsync(AssignmentManager.java:896)
>  {code}
> Relevant code path in the upgraded version:
>  - `AssignmentManager.checkIfShouldMoveSystemRegionAsync()` starts a 
> background compatibility check that moves system table regions to newer 
> RegionServers.
>  - The method comments already call out the killed-server race: if a server 
> is killed and a new one starts, this thread can think it should move system 
> tables while `ServerCrashProcedure` is responsible for assignment recovery.
>  - For `hbase:meta`, the method calls `moveAsync(plan)` immediately.
>  - `moveAsync(...)` reaches `createMoveRegionProcedure(...)`.
>  - `createMoveRegionProcedure(...)` calls `preTransitCheck(...)`.
>  - `preTransitCheck(...)` throws if `regionNode.getProcedure() != null`, 
> which is exactly the observed `pid=49` state.
> This is not just a harmless log-message mismatch. `hbase:meta` is the 
> metadata table used to locate user regions, so leaving it in an unresolved 
> `OPENING`/in-transition state can make client and admin operations unable to 
> locate regions even while cluster processes are still running.
> h2. How To Reproduce
> One way to reproduce is to trigger the compatibility-move and crash-recovery 
> race for `hbase:meta` during a mixed-version rolling upgrade.
> 1. Start a 3-node HBase 2.6.4 cluster with one HMaster and two RegionServers.
> 2. Run workload that creates normal table/namespace state so the cluster has 
> active assignment and meta activity.
> 3. Start a rolling upgrade from HBase 2.6.4 to HBase 4.0.0-alpha-1-SNAPSHOT.
> 4. Upgrade the HMaster node first.
> 5. Shortly after the upgraded HMaster is running, kill the RegionServer 
> currently hosting or opening `hbase:meta`.
> In the observed run this was Node2, `hregion2`.
> 6. Continue the rolling upgrade so other RegionServers register at the new 
> version.
> 7. Observe that the upgraded HMaster's system-region compatibility check 
> tries to move `hbase:meta` through the normal `moveAsync(...)` path while 
> `hbase:meta` is already `OPENING` with an active procedure.
> 8. Check the HMaster logs for:
> {code:java}
> table=hbase:meta ... state=OPENING ... is currently in transition; 
> pid=...{code}
>   



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

Reply via email to