[
https://issues.apache.org/jira/browse/HBASE-27773?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17707935#comment-17707935
]
Grigore Lupescu commented on HBASE-27773:
-----------------------------------------
Regarding the _symptom_ _fix_ in which we kill the regionserver indentified by
location in the STUCK Region-In-Transition, would this be a resonable way to
advance/unblock the state?
We might add some logic to get the state unblocked, if stuck region in
transition detected, as a last resort.
> STUCK Region-In-Transition state
> --------------------------------
>
> Key: HBASE-27773
> URL: https://issues.apache.org/jira/browse/HBASE-27773
> Project: HBase
> Issue Type: Bug
> Components: regionserver
> Affects Versions: 2.4.11
> Environment: HBase: 2.4.11
> Hadoop: 3.2.4
> ZooKeeper: 3.7.1
> Reporter: Grigore Lupescu
> Priority: Major
> Attachments: config.txt
>
>
> One problem we see customers encounter in the field with some regularity is
> the `STUCK Region-In-Transition state=OPENING`.
> We have a three server cluster that runs a full HBASE stack: 3 zookeeper
> nodes, an HBASE master active and standby, 3 region servers, 3 HDFS data
> nodes.
> We've managed to reproduce the stuck region in transition state, by rebooting
> randomly one of the 3 nodes. This is not necessarily the only way customers
> may end up in this state, rather a deterministic way we managed to reproduce
> it to a certain extent. Also (a) writing data to hbase while the node reboot
> happens increases the chances of the stuck state being reached as well as (b)
> if the rebooted node is also the active hbasemaster.
> Sample logs:
>
> {code:java}
> [7745.457s][info][gc] GC(12) Pause Young (Normal) (G1 Evacuation Pause)
> 523M->44M(818M) 12.736ms
> [10505.454s][info][gc] GC(13) Pause Young (Normal) (G1 Evacuation Pause)
> 523M->44M(818M) 11.066ms
> 2023-04-03 11:26:53,208 WARN [ProcExecTimeout] assignment.AssignmentManager:
> STUCK Region-In-Transition state=OPENING,
> location=cvp504.sjc.aristanetworks.com,16201,1680509017771, table=aeris_v2,
> region=b732898573f935b72fb1876c6ff944b3
> 2023-04-03 11:27:53,208 WARN [ProcExecTimeout] assignment.AssignmentManager:
> STUCK Region-In-Transition state=OPENING,
> location=cvp504.sjc.aristanetworks.com,16201,1680509017771, table=aeris_v2,
> region=78be037bae2fc201707fa511e90dfbbf
> 2023-04-03 11:27:53,208 WARN [ProcExecTimeout] assignment.AssignmentManager:
> STUCK Region-In-Transition state=OPENING,
> location=cvp504.sjc.aristanetworks.com,16201,1680509017771, table=aeris_v2,
> region=b732898573f935b72fb1876c6ff944b3
> 2023-04-03 11:28:53,145 INFO [master/cvp504:16000.Chore.1] master.HMaster:
> Not running balancer (force=false, metaRIT=false) because 2 region(s) in
> transition: [state=OPENING,
> location=cvp504.sjc.aristanetworks.com,16201,1680509017771, table=aeris_v2,
> region=78be037bae2fc201707fa511e90dfbbf, state=OPENING,
> location=cvp504.sjc.aristanetworks.com,16201,1680509017771, table=aeris_v2,
> region=b732898573f935b72fb1876c6ff944b3]
> 2023-04-03 11:28:53,168 WARN [master/cvp504:16000.Chore.1]
> janitor.CatalogJanitor:
> unknown_server=cvp503.sjc.aristanetworks.com,16201,1680499899167/aeris_v2,\x09,1680499940070.78be037bae2fc201707fa511e90dfbbf.,
>
> unknown_server=cvp503.sjc.aristanetworks.com,16201,1680499899167/aeris_v2,\x12,1680499940070.b732898573f935b72fb1876c6ff944b3.
> 2023-04-03 11:28:53,208 WARN [ProcExecTimeout] assignment.AssignmentManager:
> STUCK Region-In-Transition state=OPENING,
> location=cvp504.sjc.aristanetworks.com,16201,1680509017771, table=aeris_v2,
> region=78be037bae2fc201707fa511e90dfbbf
> 2023-04-03 11:28:53,208 WARN [ProcExecTimeout] assignment.AssignmentManager:
> STUCK Region-In-Transition state=OPENING,
> location=cvp504.sjc.aristanetworks.com,16201,1680509017771, table=aeris_v2,
> region=b732898573f935b72fb1876c6ff944b3
> 2023-04-03 11:29:53,209 WARN [ProcExecTimeout] assignment.AssignmentManager:
> STUCK Region-In-Transition state=OPENING,
> location=cvp504.sjc.aristanetworks.com,16201,1680509017771, table=aeris_v2,
> region=78be037bae2fc201707fa511e90dfbbf
> 2023-04-03 11:29:53,209 WARN [ProcExecTimeout] assignment.AssignmentManager:
> STUCK Region-In-Transition state=OPENING,
> location=cvp504.sjc.aristanetworks.com,16201,1680509017771, table=aeris_v2,
> region=b732898573f935b72fb1876c6ff944b3{code}
>
> The stuck state also gets _fixed_ if we kill the pod with the regionserver
> which has the region with stuck in transition.
--
This message was sent by Atlassian Jira
(v8.20.10#820010)