rstest created HBASE-30201:
------------------------------
Summary: HBase rolling upgrade buggy on `hbases:meta` crash
recovery
Key: HBASE-30201
URL: https://issues.apache.org/jira/browse/HBASE-30201
Project: HBase
Issue Type: Bug
Components: master, regionserver
Affects Versions: 2.6.4, 4.0.0-alpha-1
Reporter: rstest
h1. Summary
HBase rolling upgrade can race system-region compatibility movement with meta
crash recovery, leaving `hbase:meta` stuck in `OPENING`
h1. Bug Symptom
During a rolling upgrade from HBase 2.6.4 to HBase 4.0.0-alpha-1-SNAPSHOT, the
upgraded HMaster can try to move `hbase:meta` through the normal system-region
compatibility path while the RegionServer hosting `hbase:meta` has just failed
and meta is already being opened/recovered by another assignment procedure.
The observed sequence is:
- A 3-node HBase 2.6.4 cluster starts normally.
- Node0, the HMaster, is upgraded to HBase 4.0.0-alpha-1-SNAPSHOT.
- Node2, `hregion2`, is killed shortly after the Node0 upgrade.
- The raw error identifies `hregion2` as the location of `hbase:meta`.
- The upgraded HMaster runs
`AssignmentManager.checkIfShouldMoveSystemRegionAsync()`, which moves system
regions toward higher-version RegionServers during mixed-version operation.
- That path calls the normal `moveAsync(...)` path for `hbase:meta`.
- `hbase:meta` is already `OPENING` and already has an active assignment
procedure.
- `AssignmentManager.preTransitCheck(...)` rejects the normal move attempt
because the region has an active procedure.
Expected behavior:
- If the server hosting `hbase:meta` dies during rolling upgrade, meta recovery
should be handled by `ServerCrashProcedure` or the active meta assignment
procedure.
- The system-region compatibility move path should defer, skip, or retry later
when `hbase:meta` is already in transition.
- The master should eventually assign/recover `hbase:meta` without getting
stuck behind a rejected normal move.
Actual behavior:
- The upgraded HMaster logs an HBase product error for `hbase:meta` in
`OPENING`.
- The normal move path is rejected while `hbase:meta` still has an active
procedure.
- The observed failure is only reported in the old-new rolling lane for this
test plan, not as a normal old-old or new-new baseline behavior.
Representative exception:
{code:java}
2026-03-07T05:07:45,259 ERROR [Thread-38] assignment.AssignmentManager:
org.apache.hadoop.hbase.HBaseIOException: state=OPENING,
location=hregion2,16020,1772859875787, table=hbase:meta,
region=1588230740 is currently in transition; pid=49
at
org.apache.hadoop.hbase.master.assignment.AssignmentManager.preTransitCheck(AssignmentManager.java:766)
at
org.apache.hadoop.hbase.master.assignment.AssignmentManager.createMoveRegionProcedure(AssignmentManager.java:879)
at
org.apache.hadoop.hbase.master.assignment.AssignmentManager.moveAsync(AssignmentManager.java:896)
{code}
Relevant code path in the upgraded version:
- `AssignmentManager.checkIfShouldMoveSystemRegionAsync()` starts a background
compatibility check that moves system table regions to newer RegionServers.
- The method comments already call out the killed-server race: if a server is
killed and a new one starts, this thread can think it should move system tables
while `ServerCrashProcedure` is responsible for assignment recovery.
- For `hbase:meta`, the method calls `moveAsync(plan)` immediately.
- `moveAsync(...)` reaches `createMoveRegionProcedure(...)`.
- `createMoveRegionProcedure(...)` calls `preTransitCheck(...)`.
- `preTransitCheck(...)` throws if `regionNode.getProcedure() != null`, which
is exactly the observed `pid=49` state.
This is not just a harmless log-message mismatch. `hbase:meta` is the metadata
table used to locate user regions, so leaving it in an unresolved
`OPENING`/in-transition state can make client and admin operations unable to
locate regions even while cluster processes are still running.
h2. How To Reproduce
One way to reproduce is to trigger the compatibility-move and crash-recovery
race for `hbase:meta` during a mixed-version rolling upgrade.
1. Start a 3-node HBase 2.6.4 cluster with one HMaster and two RegionServers.
2. Run workload that creates normal table/namespace state so the cluster has
active assignment and meta activity.
3. Start a rolling upgrade from HBase 2.6.4 to HBase 4.0.0-alpha-1-SNAPSHOT.
4. Upgrade the HMaster node first.
5. Shortly after the upgraded HMaster is running, kill the RegionServer
currently hosting or opening `hbase:meta`.
In the observed run this was Node2, `hregion2`.
6. Continue the rolling upgrade so other RegionServers register at the new
version.
7. Observe that the upgraded HMaster's system-region compatibility check tries
to move `hbase:meta` through the normal `moveAsync(...)` path while
`hbase:meta` is already `OPENING` with an active procedure.
8. Check the HMaster logs for:
{code:java}
table=hbase:meta ... state=OPENING ... is currently in transition; pid=...{code}
--
This message was sent by Atlassian Jira
(v8.20.10#820010)