[ 
https://issues.apache.org/jira/browse/HBASE-27773?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17708557#comment-17708557
 ] 

Aaron Beitch commented on HBASE-27773:
--------------------------------------

Adding to what [~grigore] has posted, the cause of the STUCK RIT seems to be an 
open region procedure that is never communicated to the region server. The 
debug html page shows the procedure waiting to be completed, but the region 
server logs (even with added tracing/debugging levels enabled) don't show any 
evidence that the region server has seen the request.

> STUCK Region-In-Transition state
> --------------------------------
>
>                 Key: HBASE-27773
>                 URL: https://issues.apache.org/jira/browse/HBASE-27773
>             Project: HBase
>          Issue Type: Bug
>          Components: regionserver
>    Affects Versions: 2.4.11
>         Environment: HBase: 2.4.11
> Hadoop: 3.2.4
> ZooKeeper: 3.7.1
>            Reporter: Grigore Lupescu
>            Priority: Major
>         Attachments: config.txt
>
>
> One problem we encounter with some regularity is the `STUCK 
> Region-In-Transition state=OPENING`.
> We have a three server cluster that runs a full HBASE stack: 3 zookeeper 
> nodes, an HBASE master active and standby, 3 region servers, 3 HDFS data 
> nodes.
> We've managed to reproduce the stuck region in transition state, by rebooting 
> randomly one of the 3 nodes. This is not necessarily the only way it may end 
> up in this state, rather a deterministic way we managed to reproduce it to a 
> certain extent. Also (a) writing data to hbase while the node reboot happens 
> increases the chances of the stuck state being reached as well as (b) if the 
> rebooted node is also the active hbasemaster.
> Sample logs:
>  
> {code:java}
> [7745.457s][info][gc] GC(12) Pause Young (Normal) (G1 Evacuation Pause) 
> 523M->44M(818M) 12.736ms
> [10505.454s][info][gc] GC(13) Pause Young (Normal) (G1 Evacuation Pause) 
> 523M->44M(818M) 11.066ms
> 2023-04-03 11:26:53,208 WARN  [ProcExecTimeout] assignment.AssignmentManager: 
> STUCK Region-In-Transition state=OPENING, 
> location=cvp504.sjc.aristanetworks.com,16201,1680509017771, table=aeris_v2, 
> region=b732898573f935b72fb1876c6ff944b3
> 2023-04-03 11:27:53,208 WARN  [ProcExecTimeout] assignment.AssignmentManager: 
> STUCK Region-In-Transition state=OPENING, 
> location=cvp504.sjc.aristanetworks.com,16201,1680509017771, table=aeris_v2, 
> region=78be037bae2fc201707fa511e90dfbbf
> 2023-04-03 11:27:53,208 WARN  [ProcExecTimeout] assignment.AssignmentManager: 
> STUCK Region-In-Transition state=OPENING, 
> location=cvp504.sjc.aristanetworks.com,16201,1680509017771, table=aeris_v2, 
> region=b732898573f935b72fb1876c6ff944b3
> 2023-04-03 11:28:53,145 INFO  [master/cvp504:16000.Chore.1] master.HMaster: 
> Not running balancer (force=false, metaRIT=false) because 2 region(s) in 
> transition: [state=OPENING, 
> location=cvp504.sjc.aristanetworks.com,16201,1680509017771, table=aeris_v2, 
> region=78be037bae2fc201707fa511e90dfbbf, state=OPENING, 
> location=cvp504.sjc.aristanetworks.com,16201,1680509017771, table=aeris_v2, 
> region=b732898573f935b72fb1876c6ff944b3]
> 2023-04-03 11:28:53,168 WARN  [master/cvp504:16000.Chore.1] 
> janitor.CatalogJanitor: 
> unknown_server=cvp503.sjc.aristanetworks.com,16201,1680499899167/aeris_v2,\x09,1680499940070.78be037bae2fc201707fa511e90dfbbf.,
>  
> unknown_server=cvp503.sjc.aristanetworks.com,16201,1680499899167/aeris_v2,\x12,1680499940070.b732898573f935b72fb1876c6ff944b3.
> 2023-04-03 11:28:53,208 WARN  [ProcExecTimeout] assignment.AssignmentManager: 
> STUCK Region-In-Transition state=OPENING, 
> location=cvp504.sjc.aristanetworks.com,16201,1680509017771, table=aeris_v2, 
> region=78be037bae2fc201707fa511e90dfbbf
> 2023-04-03 11:28:53,208 WARN  [ProcExecTimeout] assignment.AssignmentManager: 
> STUCK Region-In-Transition state=OPENING, 
> location=cvp504.sjc.aristanetworks.com,16201,1680509017771, table=aeris_v2, 
> region=b732898573f935b72fb1876c6ff944b3
> 2023-04-03 11:29:53,209 WARN  [ProcExecTimeout] assignment.AssignmentManager: 
> STUCK Region-In-Transition state=OPENING, 
> location=cvp504.sjc.aristanetworks.com,16201,1680509017771, table=aeris_v2, 
> region=78be037bae2fc201707fa511e90dfbbf
> 2023-04-03 11:29:53,209 WARN  [ProcExecTimeout] assignment.AssignmentManager: 
> STUCK Region-In-Transition state=OPENING, 
> location=cvp504.sjc.aristanetworks.com,16201,1680509017771, table=aeris_v2, 
> region=b732898573f935b72fb1876c6ff944b3{code}
>  
> The stuck state also gets _fixed_ if we kill the pod with the regionserver 
> which has the region with stuck in transition.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

Reply via email to