[ 
https://issues.apache.org/jira/browse/HBASE-27773?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Grigore Lupescu updated HBASE-27773:
------------------------------------
    Description: 
One problem we encounter with some regularity is the `STUCK 
Region-In-Transition state=OPENING`.

We have a three server cluster that runs a full HBASE stack: 3 zookeeper nodes, 
an HBASE master active and standby, 3 region servers, 3 HDFS data nodes.

We've managed to reproduce the stuck region in transition state, by rebooting 
randomly one of the 3 nodes. This is not necessarily the only way it may end up 
in this state, rather a deterministic way we managed to reproduce it to a 
certain extent. Also (a) writing data to hbase while the node reboot happens 
increases the chances of the stuck state being reached as well as (b) if the 
rebooted node is also the active hbasemaster.

Sample logs:

 
{code:java}
[7745.457s][info][gc] GC(12) Pause Young (Normal) (G1 Evacuation Pause) 
523M->44M(818M) 12.736ms
[10505.454s][info][gc] GC(13) Pause Young (Normal) (G1 Evacuation Pause) 
523M->44M(818M) 11.066ms
2023-04-03 11:26:53,208 WARN  [ProcExecTimeout] assignment.AssignmentManager: 
STUCK Region-In-Transition state=OPENING, 
location=cvp504.sjc.aristanetworks.com,16201,1680509017771, table=aeris_v2, 
region=b732898573f935b72fb1876c6ff944b3
2023-04-03 11:27:53,208 WARN  [ProcExecTimeout] assignment.AssignmentManager: 
STUCK Region-In-Transition state=OPENING, 
location=cvp504.sjc.aristanetworks.com,16201,1680509017771, table=aeris_v2, 
region=78be037bae2fc201707fa511e90dfbbf
2023-04-03 11:27:53,208 WARN  [ProcExecTimeout] assignment.AssignmentManager: 
STUCK Region-In-Transition state=OPENING, 
location=cvp504.sjc.aristanetworks.com,16201,1680509017771, table=aeris_v2, 
region=b732898573f935b72fb1876c6ff944b3
2023-04-03 11:28:53,145 INFO  [master/cvp504:16000.Chore.1] master.HMaster: Not 
running balancer (force=false, metaRIT=false) because 2 region(s) in 
transition: [state=OPENING, 
location=cvp504.sjc.aristanetworks.com,16201,1680509017771, table=aeris_v2, 
region=78be037bae2fc201707fa511e90dfbbf, state=OPENING, 
location=cvp504.sjc.aristanetworks.com,16201,1680509017771, table=aeris_v2, 
region=b732898573f935b72fb1876c6ff944b3]
2023-04-03 11:28:53,168 WARN  [master/cvp504:16000.Chore.1] 
janitor.CatalogJanitor: 
unknown_server=cvp503.sjc.aristanetworks.com,16201,1680499899167/aeris_v2,\x09,1680499940070.78be037bae2fc201707fa511e90dfbbf.,
 
unknown_server=cvp503.sjc.aristanetworks.com,16201,1680499899167/aeris_v2,\x12,1680499940070.b732898573f935b72fb1876c6ff944b3.
2023-04-03 11:28:53,208 WARN  [ProcExecTimeout] assignment.AssignmentManager: 
STUCK Region-In-Transition state=OPENING, 
location=cvp504.sjc.aristanetworks.com,16201,1680509017771, table=aeris_v2, 
region=78be037bae2fc201707fa511e90dfbbf
2023-04-03 11:28:53,208 WARN  [ProcExecTimeout] assignment.AssignmentManager: 
STUCK Region-In-Transition state=OPENING, 
location=cvp504.sjc.aristanetworks.com,16201,1680509017771, table=aeris_v2, 
region=b732898573f935b72fb1876c6ff944b3
2023-04-03 11:29:53,209 WARN  [ProcExecTimeout] assignment.AssignmentManager: 
STUCK Region-In-Transition state=OPENING, 
location=cvp504.sjc.aristanetworks.com,16201,1680509017771, table=aeris_v2, 
region=78be037bae2fc201707fa511e90dfbbf
2023-04-03 11:29:53,209 WARN  [ProcExecTimeout] assignment.AssignmentManager: 
STUCK Region-In-Transition state=OPENING, 
location=cvp504.sjc.aristanetworks.com,16201,1680509017771, table=aeris_v2, 
region=b732898573f935b72fb1876c6ff944b3{code}
 

The stuck state also gets _fixed_ if we kill the pod with the regionserver 
which has the region with stuck in transition.

  was:
One problem we see customers encounter in the field with some regularity is the 
`STUCK Region-In-Transition state=OPENING`.

We have a three server cluster that runs a full HBASE stack: 3 zookeeper nodes, 
an HBASE master active and standby, 3 region servers, 3 HDFS data nodes.

We've managed to reproduce the stuck region in transition state, by rebooting 
randomly one of the 3 nodes. This is not necessarily the only way customers may 
end up in this state, rather a deterministic way we managed to reproduce it to 
a certain extent. Also (a) writing data to hbase while the node reboot happens 
increases the chances of the stuck state being reached as well as (b) if the 
rebooted node is also the active hbasemaster.

Sample logs:

 
{code:java}
[7745.457s][info][gc] GC(12) Pause Young (Normal) (G1 Evacuation Pause) 
523M->44M(818M) 12.736ms
[10505.454s][info][gc] GC(13) Pause Young (Normal) (G1 Evacuation Pause) 
523M->44M(818M) 11.066ms
2023-04-03 11:26:53,208 WARN  [ProcExecTimeout] assignment.AssignmentManager: 
STUCK Region-In-Transition state=OPENING, 
location=cvp504.sjc.aristanetworks.com,16201,1680509017771, table=aeris_v2, 
region=b732898573f935b72fb1876c6ff944b3
2023-04-03 11:27:53,208 WARN  [ProcExecTimeout] assignment.AssignmentManager: 
STUCK Region-In-Transition state=OPENING, 
location=cvp504.sjc.aristanetworks.com,16201,1680509017771, table=aeris_v2, 
region=78be037bae2fc201707fa511e90dfbbf
2023-04-03 11:27:53,208 WARN  [ProcExecTimeout] assignment.AssignmentManager: 
STUCK Region-In-Transition state=OPENING, 
location=cvp504.sjc.aristanetworks.com,16201,1680509017771, table=aeris_v2, 
region=b732898573f935b72fb1876c6ff944b3
2023-04-03 11:28:53,145 INFO  [master/cvp504:16000.Chore.1] master.HMaster: Not 
running balancer (force=false, metaRIT=false) because 2 region(s) in 
transition: [state=OPENING, 
location=cvp504.sjc.aristanetworks.com,16201,1680509017771, table=aeris_v2, 
region=78be037bae2fc201707fa511e90dfbbf, state=OPENING, 
location=cvp504.sjc.aristanetworks.com,16201,1680509017771, table=aeris_v2, 
region=b732898573f935b72fb1876c6ff944b3]
2023-04-03 11:28:53,168 WARN  [master/cvp504:16000.Chore.1] 
janitor.CatalogJanitor: 
unknown_server=cvp503.sjc.aristanetworks.com,16201,1680499899167/aeris_v2,\x09,1680499940070.78be037bae2fc201707fa511e90dfbbf.,
 
unknown_server=cvp503.sjc.aristanetworks.com,16201,1680499899167/aeris_v2,\x12,1680499940070.b732898573f935b72fb1876c6ff944b3.
2023-04-03 11:28:53,208 WARN  [ProcExecTimeout] assignment.AssignmentManager: 
STUCK Region-In-Transition state=OPENING, 
location=cvp504.sjc.aristanetworks.com,16201,1680509017771, table=aeris_v2, 
region=78be037bae2fc201707fa511e90dfbbf
2023-04-03 11:28:53,208 WARN  [ProcExecTimeout] assignment.AssignmentManager: 
STUCK Region-In-Transition state=OPENING, 
location=cvp504.sjc.aristanetworks.com,16201,1680509017771, table=aeris_v2, 
region=b732898573f935b72fb1876c6ff944b3
2023-04-03 11:29:53,209 WARN  [ProcExecTimeout] assignment.AssignmentManager: 
STUCK Region-In-Transition state=OPENING, 
location=cvp504.sjc.aristanetworks.com,16201,1680509017771, table=aeris_v2, 
region=78be037bae2fc201707fa511e90dfbbf
2023-04-03 11:29:53,209 WARN  [ProcExecTimeout] assignment.AssignmentManager: 
STUCK Region-In-Transition state=OPENING, 
location=cvp504.sjc.aristanetworks.com,16201,1680509017771, table=aeris_v2, 
region=b732898573f935b72fb1876c6ff944b3{code}
 

The stuck state also gets _fixed_ if we kill the pod with the regionserver 
which has the region with stuck in transition.


> STUCK Region-In-Transition state
> --------------------------------
>
>                 Key: HBASE-27773
>                 URL: https://issues.apache.org/jira/browse/HBASE-27773
>             Project: HBase
>          Issue Type: Bug
>          Components: regionserver
>    Affects Versions: 2.4.11
>         Environment: HBase: 2.4.11
> Hadoop: 3.2.4
> ZooKeeper: 3.7.1
>            Reporter: Grigore Lupescu
>            Priority: Major
>         Attachments: config.txt
>
>
> One problem we encounter with some regularity is the `STUCK 
> Region-In-Transition state=OPENING`.
> We have a three server cluster that runs a full HBASE stack: 3 zookeeper 
> nodes, an HBASE master active and standby, 3 region servers, 3 HDFS data 
> nodes.
> We've managed to reproduce the stuck region in transition state, by rebooting 
> randomly one of the 3 nodes. This is not necessarily the only way it may end 
> up in this state, rather a deterministic way we managed to reproduce it to a 
> certain extent. Also (a) writing data to hbase while the node reboot happens 
> increases the chances of the stuck state being reached as well as (b) if the 
> rebooted node is also the active hbasemaster.
> Sample logs:
>  
> {code:java}
> [7745.457s][info][gc] GC(12) Pause Young (Normal) (G1 Evacuation Pause) 
> 523M->44M(818M) 12.736ms
> [10505.454s][info][gc] GC(13) Pause Young (Normal) (G1 Evacuation Pause) 
> 523M->44M(818M) 11.066ms
> 2023-04-03 11:26:53,208 WARN  [ProcExecTimeout] assignment.AssignmentManager: 
> STUCK Region-In-Transition state=OPENING, 
> location=cvp504.sjc.aristanetworks.com,16201,1680509017771, table=aeris_v2, 
> region=b732898573f935b72fb1876c6ff944b3
> 2023-04-03 11:27:53,208 WARN  [ProcExecTimeout] assignment.AssignmentManager: 
> STUCK Region-In-Transition state=OPENING, 
> location=cvp504.sjc.aristanetworks.com,16201,1680509017771, table=aeris_v2, 
> region=78be037bae2fc201707fa511e90dfbbf
> 2023-04-03 11:27:53,208 WARN  [ProcExecTimeout] assignment.AssignmentManager: 
> STUCK Region-In-Transition state=OPENING, 
> location=cvp504.sjc.aristanetworks.com,16201,1680509017771, table=aeris_v2, 
> region=b732898573f935b72fb1876c6ff944b3
> 2023-04-03 11:28:53,145 INFO  [master/cvp504:16000.Chore.1] master.HMaster: 
> Not running balancer (force=false, metaRIT=false) because 2 region(s) in 
> transition: [state=OPENING, 
> location=cvp504.sjc.aristanetworks.com,16201,1680509017771, table=aeris_v2, 
> region=78be037bae2fc201707fa511e90dfbbf, state=OPENING, 
> location=cvp504.sjc.aristanetworks.com,16201,1680509017771, table=aeris_v2, 
> region=b732898573f935b72fb1876c6ff944b3]
> 2023-04-03 11:28:53,168 WARN  [master/cvp504:16000.Chore.1] 
> janitor.CatalogJanitor: 
> unknown_server=cvp503.sjc.aristanetworks.com,16201,1680499899167/aeris_v2,\x09,1680499940070.78be037bae2fc201707fa511e90dfbbf.,
>  
> unknown_server=cvp503.sjc.aristanetworks.com,16201,1680499899167/aeris_v2,\x12,1680499940070.b732898573f935b72fb1876c6ff944b3.
> 2023-04-03 11:28:53,208 WARN  [ProcExecTimeout] assignment.AssignmentManager: 
> STUCK Region-In-Transition state=OPENING, 
> location=cvp504.sjc.aristanetworks.com,16201,1680509017771, table=aeris_v2, 
> region=78be037bae2fc201707fa511e90dfbbf
> 2023-04-03 11:28:53,208 WARN  [ProcExecTimeout] assignment.AssignmentManager: 
> STUCK Region-In-Transition state=OPENING, 
> location=cvp504.sjc.aristanetworks.com,16201,1680509017771, table=aeris_v2, 
> region=b732898573f935b72fb1876c6ff944b3
> 2023-04-03 11:29:53,209 WARN  [ProcExecTimeout] assignment.AssignmentManager: 
> STUCK Region-In-Transition state=OPENING, 
> location=cvp504.sjc.aristanetworks.com,16201,1680509017771, table=aeris_v2, 
> region=78be037bae2fc201707fa511e90dfbbf
> 2023-04-03 11:29:53,209 WARN  [ProcExecTimeout] assignment.AssignmentManager: 
> STUCK Region-In-Transition state=OPENING, 
> location=cvp504.sjc.aristanetworks.com,16201,1680509017771, table=aeris_v2, 
> region=b732898573f935b72fb1876c6ff944b3{code}
>  
> The stuck state also gets _fixed_ if we kill the pod with the regionserver 
> which has the region with stuck in transition.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

Reply via email to