rstest created HBASE-30201:
------------------------------

             Summary: HBase rolling upgrade buggy on `hbases:meta` crash 
recovery
                 Key: HBASE-30201
                 URL: https://issues.apache.org/jira/browse/HBASE-30201
             Project: HBase
          Issue Type: Bug
          Components: master, regionserver
    Affects Versions: 2.6.4, 4.0.0-alpha-1
            Reporter: rstest


h1. Summary
HBase rolling upgrade can race system-region compatibility movement with meta 
crash recovery, leaving `hbase:meta` stuck in `OPENING`
h1. Bug Symptom
During a rolling upgrade from HBase 2.6.4 to HBase 4.0.0-alpha-1-SNAPSHOT, the 
upgraded HMaster can try to move `hbase:meta` through the normal system-region 
compatibility path while the RegionServer hosting `hbase:meta` has just failed 
and meta is already being opened/recovered by another assignment procedure.
The observed sequence is:
- A 3-node HBase 2.6.4 cluster starts normally.
- Node0, the HMaster, is upgraded to HBase 4.0.0-alpha-1-SNAPSHOT.
- Node2, `hregion2`, is killed shortly after the Node0 upgrade.
- The raw error identifies `hregion2` as the location of `hbase:meta`.
- The upgraded HMaster runs 
`AssignmentManager.checkIfShouldMoveSystemRegionAsync()`, which moves system 
regions toward higher-version RegionServers during mixed-version operation.
- That path calls the normal `moveAsync(...)` path for `hbase:meta`.
- `hbase:meta` is already `OPENING` and already has an active assignment 
procedure.
- `AssignmentManager.preTransitCheck(...)` rejects the normal move attempt 
because the region has an active procedure.
 
Expected behavior:
- If the server hosting `hbase:meta` dies during rolling upgrade, meta recovery 
should be handled by `ServerCrashProcedure` or the active meta assignment 
procedure.
- The system-region compatibility move path should defer, skip, or retry later 
when `hbase:meta` is already in transition.
- The master should eventually assign/recover `hbase:meta` without getting 
stuck behind a rejected normal move.


Actual behavior:

- The upgraded HMaster logs an HBase product error for `hbase:meta` in 
`OPENING`.
- The normal move path is rejected while `hbase:meta` still has an active 
procedure.
- The observed failure is only reported in the old-new rolling lane for this 
test plan, not as a normal old-old or new-new baseline behavior.
Representative exception:
{code:java}
2026-03-07T05:07:45,259 ERROR [Thread-38] assignment.AssignmentManager:
org.apache.hadoop.hbase.HBaseIOException: state=OPENING,
location=hregion2,16020,1772859875787, table=hbase:meta,
region=1588230740 is currently in transition; pid=49
  at 
org.apache.hadoop.hbase.master.assignment.AssignmentManager.preTransitCheck(AssignmentManager.java:766)
  at 
org.apache.hadoop.hbase.master.assignment.AssignmentManager.createMoveRegionProcedure(AssignmentManager.java:879)
  at 
org.apache.hadoop.hbase.master.assignment.AssignmentManager.moveAsync(AssignmentManager.java:896)
 {code}
Relevant code path in the upgraded version:
- `AssignmentManager.checkIfShouldMoveSystemRegionAsync()` starts a background 
compatibility check that moves system table regions to newer RegionServers.
- The method comments already call out the killed-server race: if a server is 
killed and a new one starts, this thread can think it should move system tables 
while `ServerCrashProcedure` is responsible for assignment recovery.
- For `hbase:meta`, the method calls `moveAsync(plan)` immediately.
- `moveAsync(...)` reaches `createMoveRegionProcedure(...)`.
- `createMoveRegionProcedure(...)` calls `preTransitCheck(...)`.
- `preTransitCheck(...)` throws if `regionNode.getProcedure() != null`, which 
is exactly the observed `pid=49` state.


This is not just a harmless log-message mismatch. `hbase:meta` is the metadata 
table used to locate user regions, so leaving it in an unresolved 
`OPENING`/in-transition state can make client and admin operations unable to 
locate regions even while cluster processes are still running.
h2. How To Reproduce
One way to reproduce is to trigger the compatibility-move and crash-recovery 
race for `hbase:meta` during a mixed-version rolling upgrade.

1. Start a 3-node HBase 2.6.4 cluster with one HMaster and two RegionServers.

2. Run workload that creates normal table/namespace state so the cluster has 
active assignment and meta activity.

3. Start a rolling upgrade from HBase 2.6.4 to HBase 4.0.0-alpha-1-SNAPSHOT.

4. Upgrade the HMaster node first.

5. Shortly after the upgraded HMaster is running, kill the RegionServer 
currently hosting or opening `hbase:meta`.

In the observed run this was Node2, `hregion2`.

6. Continue the rolling upgrade so other RegionServers register at the new 
version.

7. Observe that the upgraded HMaster's system-region compatibility check tries 
to move `hbase:meta` through the normal `moveAsync(...)` path while 
`hbase:meta` is already `OPENING` with an active procedure.
8. Check the HMaster logs for:

 
{code:java}
table=hbase:meta ... state=OPENING ... is currently in transition; pid=...{code}
 
 
 
 



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

Reply via email to