[jira] [Updated] (HBASE-27711) Regions permanently stuck in unknown_server state

Aaron Beitch (Jira) Tue, 14 Mar 2023 09:51:09 -0700


     [ 
https://issues.apache.org/jira/browse/HBASE-27711?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]


Aaron Beitch updated HBASE-27711:
---------------------------------
    Description: 
We see this log message and the regions listed are never put back into service 
without manual intervention:
{code:java}
NodeC hbasemaster-0 hbasemaster 2023-02-15 14:15:56,149 WARN  
[master/NodeC:16000.Chore.1] janitor.CatalogJanitor: 
unknown_server=NodeA,16201,1676468874221/__test-table_NodeA__,,1672786676251.a3cac9159205d7611c85dd5c4feeded7.,
 
unknown_server=NodeA,16201,1676468874221/__test-table_NodeB__,,1672786676579.50e948f0a5bc962aabfe27e9ea4227a5.,
 
unknown_server=NodeA,16201,1676468874221/aeris_v2,,1672786736251.6ab0292cca294784bce8415cc69c30d4.,
 
unknown_server=NodeA,16201,1676468874221/aeris_v2,\x06,1672786736251.15d958805892370907a47f31a6e08db1.,
 
unknown_server=NodeA,16201,1676468874221/aeris_v2,\x12,1672786736251.ac3c78ff6903f52d9e2acf80b8436085.{code}
 
Normally when we see these unknown_server logs, they do get resolved by 
reassigning the regions, however we have a reproducible case where this doesn't 
happen.

When this occurs we also see the following log messages related to the regions:
{code:java}
NodeC hbasemaster-0 hbasemaster 2023-02-15 14:10:59,810 WARN  
[RpcServer.priority.RWQ.Fifo.write.handler=0,queue=0,port=16000] 
assignment.AssignmentManager: Reporting NodeC,16201,1676469549542 server does 
not match state=OPEN, location=NodeA,16201,1676468874221, table=aeris_v2, 
region=6ab0292cca294784bce8415cc69c30d4 (time since last update=3749ms); 
closing…
NodeC hbasemaster-0 hbasemaster 2023-02-15 14:11:00,323 WARN  
[RpcServer.priority.RWQ.Fifo.write.handler=0,queue=0,port=16000] 
assignment.AssignmentManager: No matching procedure found for 
C,16201,1676469549542 transition on state=OPEN, 
location=NodeA,16201,1676468874221, table=aeris_v2, 
region=6ab0292cca294784bce8415cc69c30d4 to CLOSED
{code}
 
This suggests that the master has a different mapping of region to region 
server than is expected so it closes the region. We would expect that the 
regions get assigned somewhere else and then reopened, but we are not seeing 
that.

This log message comes from here: 
[https://github.com/apache/hbase/blob/branch-2.4/hbase-server/src/main/java/org/apache/hadoop/hbase/master/assignment/AssignmentManager.java#L1292]

The next thing that is done is calling AssignmentManager's 
closeRegionServerSilently method.

Our setup:

We have a three server cluster that runs a full HBASE stack: 3 zookeeper nodes, 
an HBASE master active and standby, 3 region servers, 3 HDFS data nodes. For 
reliability testing we are running a script that will restart one of the three 
servers, which will have running on it a region server, zookeeper and HDFS 
process, and possibly also the HBASE master primary or standby.

In this test we saw the issue after NodeB had been killed at 14:08:33, which 
had been running the active master, so the master did switchover to NodeC. Then 
at 14:12:56 we saw a "STUCK Region-In-Transition" log for a region on NodeA 
(this is another common reproducible issue we plan to open a ticket for) and 
then restarted just the region server process on NodeA to get that region 
reassigned.

  was:
We see this log message and the regions listed are never put back into service 
without manual intervention:

{code:java}
NodeC hbasemaster-0 hbasemaster 2023-02-15 14:15:56,149 WARN  
[master/NodeC:16000.Chore.1] janitor.CatalogJanitor: 
unknown_server=NodeA,16201,1676468874221/__test-table_NodeA__,,1672786676251.a3cac9159205d7611c85dd5c4feeded7.,
 
unknown_server=NodeA,16201,1676468874221/__test-table_NodeB__,,1672786676579.50e948f0a5bc962aabfe27e9ea4227a5.,
 
unknown_server=NodeA,16201,1676468874221/aeris_v2,,1672786736251.6ab0292cca294784bce8415cc69c30d4.,
 
unknown_server=NodeA,16201,1676468874221/aeris_v2,\x06,1672786736251.15d958805892370907a47f31a6e08db1.,
 
unknown_server=NodeA,16201,1676468874221/aeris_v2,\x12,1672786736251.ac3c78ff6903f52d9e2acf80b8436085.{code}
 
Normally when we see these unknown_server logs, they do get resolved by 
reassigning the regions, however we have a reproducible case where this doesn't 
happen. 

When this occurs we also see the following log messages related to the regions:

{code:java}
NodeC hbasemaster-0 hbasemaster 2023-02-15 14:10:59,810 WARN  
[RpcServer.priority.RWQ.Fifo.write.handler=0,queue=0,port=16000] 
assignment.AssignmentManager: Reporting NodeC,16201,1676469549542 server does 
not match state=OPEN, location=NodeA,16201,1676468874221, table=aeris_v2, 
region=6ab0292cca294784bce8415cc69c30d4 (time since last update=3749ms); 
closing…
NodeC hbasemaster-0 hbasemaster 2023-02-15 14:11:00,323 WARN  
[RpcServer.priority.RWQ.Fifo.write.handler=0,queue=0,port=16000] 
assignment.AssignmentManager: No matching procedure found for 
C,16201,1676469549542 transition on state=OPEN, 
location=NodeA,16201,1676468874221, table=aeris_v2, 
region=6ab0292cca294784bce8415cc69c30d4 to CLOSED
{code}
 
This suggests that the master has a different mapping of region to region 
server than is expected so it closes the region. We would expect that the 
regions get assigned somewhere else and then reopened, but we are not seeing 
that.

This log message comes from here: 
[https://github.com/apache/hbase/blob/branch-2.4/hbase-server/src/main/java/org/apache/hadoop/hbase/master/assignment/AssignmentManager.java#L1292]

The next thing that is done is calling AssignmentManager's 
closeRegionServerSilently method.

Our setup:

We have a three server cluster that runs a full HBASE stack: 3 zookeeper nodes, 
an HBASE master active and standby, 3 region servers, 3 HDFS data nodes. For 
reliability testing we are running a script that will restart one of the three 
nodes, which will have running on it a region server, zookeeper and HDFS 
process, and possibly also the HBASE master primary or standby.

In this test we saw the issue after NodeB had been killed at 14:08:33, which 
had been running the active master, so the master did switchover to NodeC. Then 
at 14:12:56 we saw a "STUCK Region-In-Transition" log for a region on NodeA 
(this is another common reproducible issue we plan to open a ticket for) and 
then restarted just the region server process on NodeA to get that region 
reassigned.


> Regions permanently stuck in unknown_server state
> -------------------------------------------------
>
>                 Key: HBASE-27711
>                 URL: https://issues.apache.org/jira/browse/HBASE-27711
>             Project: HBase
>          Issue Type: Bug
>          Components: Region Assignment
>    Affects Versions: 2.4.11
>         Environment: HBase: 2.4.11
> Hadoop: 3.2.4
> ZooKeeper: 3.7.1
>            Reporter: Aaron Beitch
>            Priority: Major
>
> We see this log message and the regions listed are never put back into 
> service without manual intervention:
> {code:java}
> NodeC hbasemaster-0 hbasemaster 2023-02-15 14:15:56,149 WARN  
> [master/NodeC:16000.Chore.1] janitor.CatalogJanitor: 
> unknown_server=NodeA,16201,1676468874221/__test-table_NodeA__,,1672786676251.a3cac9159205d7611c85dd5c4feeded7.,
>  
> unknown_server=NodeA,16201,1676468874221/__test-table_NodeB__,,1672786676579.50e948f0a5bc962aabfe27e9ea4227a5.,
>  
> unknown_server=NodeA,16201,1676468874221/aeris_v2,,1672786736251.6ab0292cca294784bce8415cc69c30d4.,
>  
> unknown_server=NodeA,16201,1676468874221/aeris_v2,\x06,1672786736251.15d958805892370907a47f31a6e08db1.,
>  
> unknown_server=NodeA,16201,1676468874221/aeris_v2,\x12,1672786736251.ac3c78ff6903f52d9e2acf80b8436085.{code}
>  
> Normally when we see these unknown_server logs, they do get resolved by 
> reassigning the regions, however we have a reproducible case where this 
> doesn't happen.
> When this occurs we also see the following log messages related to the 
> regions:
> {code:java}
> NodeC hbasemaster-0 hbasemaster 2023-02-15 14:10:59,810 WARN  
> [RpcServer.priority.RWQ.Fifo.write.handler=0,queue=0,port=16000] 
> assignment.AssignmentManager: Reporting NodeC,16201,1676469549542 server does 
> not match state=OPEN, location=NodeA,16201,1676468874221, table=aeris_v2, 
> region=6ab0292cca294784bce8415cc69c30d4 (time since last update=3749ms); 
> closing…
> NodeC hbasemaster-0 hbasemaster 2023-02-15 14:11:00,323 WARN  
> [RpcServer.priority.RWQ.Fifo.write.handler=0,queue=0,port=16000] 
> assignment.AssignmentManager: No matching procedure found for 
> C,16201,1676469549542 transition on state=OPEN, 
> location=NodeA,16201,1676468874221, table=aeris_v2, 
> region=6ab0292cca294784bce8415cc69c30d4 to CLOSED
> {code}
>  
> This suggests that the master has a different mapping of region to region 
> server than is expected so it closes the region. We would expect that the 
> regions get assigned somewhere else and then reopened, but we are not seeing 
> that.
> This log message comes from here: 
> [https://github.com/apache/hbase/blob/branch-2.4/hbase-server/src/main/java/org/apache/hadoop/hbase/master/assignment/AssignmentManager.java#L1292]
> The next thing that is done is calling AssignmentManager's 
> closeRegionServerSilently method.
> Our setup:
> We have a three server cluster that runs a full HBASE stack: 3 zookeeper 
> nodes, an HBASE master active and standby, 3 region servers, 3 HDFS data 
> nodes. For reliability testing we are running a script that will restart one 
> of the three servers, which will have running on it a region server, 
> zookeeper and HDFS process, and possibly also the HBASE master primary or 
> standby.
> In this test we saw the issue after NodeB had been killed at 14:08:33, which 
> had been running the active master, so the master did switchover to NodeC. 
> Then at 14:12:56 we saw a "STUCK Region-In-Transition" log for a region on 
> NodeA (this is another common reproducible issue we plan to open a ticket 
> for) and then restarted just the region server process on NodeA to get that 
> region reassigned.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[jira] [Updated] (HBASE-27711) Regions permanently stuck in unknown_server state

Reply via email to