Grigore Lupescu created HBASE-27773:
---------------------------------------
Summary: STUCK Region-In-Transition state
Key: HBASE-27773
URL: https://issues.apache.org/jira/browse/HBASE-27773
Project: HBase
Issue Type: Bug
Components: regionserver
Affects Versions: 2.4.11
Environment: HBase: 2.4.11
Hadoop: 3.2.4
ZooKeeper: 3.7.1
Reporter: Grigore Lupescu
Attachments: config.txt
One problem we see customers encounter in the field with some regularity is the
`STUCK Region-In-Transition state=OPENING`.
We have a three server cluster that runs a full HBASE stack: 3 zookeeper nodes,
an HBASE master active and standby, 3 region servers, 3 HDFS data nodes.
We've managed to reproduce the stuck region in transition state, by rebooting
randomly one of the 3 nodes. This is not necessarily the only way customers may
end up in this state, rather a deterministic way we managed to reproduce it to
a certain extent. Also (a) writing data to hbase while the node reboot happens
increases the chances of the stuck state being reached as well as (b) if the
rebooted node is also the active hbasemaster.
Sample logs:
{code:java}
[7745.457s][info][gc] GC(12) Pause Young (Normal) (G1 Evacuation Pause)
523M->44M(818M) 12.736ms
[10505.454s][info][gc] GC(13) Pause Young (Normal) (G1 Evacuation Pause)
523M->44M(818M) 11.066ms
2023-04-03 11:26:53,208 WARN [ProcExecTimeout] assignment.AssignmentManager:
STUCK Region-In-Transition state=OPENING,
location=cvp504.sjc.aristanetworks.com,16201,1680509017771, table=aeris_v2,
region=b732898573f935b72fb1876c6ff944b3
2023-04-03 11:27:53,208 WARN [ProcExecTimeout] assignment.AssignmentManager:
STUCK Region-In-Transition state=OPENING,
location=cvp504.sjc.aristanetworks.com,16201,1680509017771, table=aeris_v2,
region=78be037bae2fc201707fa511e90dfbbf
2023-04-03 11:27:53,208 WARN [ProcExecTimeout] assignment.AssignmentManager:
STUCK Region-In-Transition state=OPENING,
location=cvp504.sjc.aristanetworks.com,16201,1680509017771, table=aeris_v2,
region=b732898573f935b72fb1876c6ff944b3
2023-04-03 11:28:53,145 INFO [master/cvp504:16000.Chore.1] master.HMaster: Not
running balancer (force=false, metaRIT=false) because 2 region(s) in
transition: [state=OPENING,
location=cvp504.sjc.aristanetworks.com,16201,1680509017771, table=aeris_v2,
region=78be037bae2fc201707fa511e90dfbbf, state=OPENING,
location=cvp504.sjc.aristanetworks.com,16201,1680509017771, table=aeris_v2,
region=b732898573f935b72fb1876c6ff944b3]
2023-04-03 11:28:53,168 WARN [master/cvp504:16000.Chore.1]
janitor.CatalogJanitor:
unknown_server=cvp503.sjc.aristanetworks.com,16201,1680499899167/aeris_v2,\x09,1680499940070.78be037bae2fc201707fa511e90dfbbf.,
unknown_server=cvp503.sjc.aristanetworks.com,16201,1680499899167/aeris_v2,\x12,1680499940070.b732898573f935b72fb1876c6ff944b3.
2023-04-03 11:28:53,208 WARN [ProcExecTimeout] assignment.AssignmentManager:
STUCK Region-In-Transition state=OPENING,
location=cvp504.sjc.aristanetworks.com,16201,1680509017771, table=aeris_v2,
region=78be037bae2fc201707fa511e90dfbbf
2023-04-03 11:28:53,208 WARN [ProcExecTimeout] assignment.AssignmentManager:
STUCK Region-In-Transition state=OPENING,
location=cvp504.sjc.aristanetworks.com,16201,1680509017771, table=aeris_v2,
region=b732898573f935b72fb1876c6ff944b3
2023-04-03 11:29:53,209 WARN [ProcExecTimeout] assignment.AssignmentManager:
STUCK Region-In-Transition state=OPENING,
location=cvp504.sjc.aristanetworks.com,16201,1680509017771, table=aeris_v2,
region=78be037bae2fc201707fa511e90dfbbf
2023-04-03 11:29:53,209 WARN [ProcExecTimeout] assignment.AssignmentManager:
STUCK Region-In-Transition state=OPENING,
location=cvp504.sjc.aristanetworks.com,16201,1680509017771, table=aeris_v2,
region=b732898573f935b72fb1876c6ff944b3{code}
The stuck state also gets _fixed_ if we kill the pod with the regionserver
which has the region with stuck in transition.
--
This message was sent by Atlassian Jira
(v8.20.10#820010)