Aaron Beitch created HBASE-27711:
------------------------------------

             Summary: Regions permanently stuck in unknown_server state
                 Key: HBASE-27711
                 URL: https://issues.apache.org/jira/browse/HBASE-27711
             Project: HBase
          Issue Type: Bug
          Components: Region Assignment
    Affects Versions: 2.4.11
         Environment: HBase: 2.4.11

Hadoop: 3.2.4

ZooKeeper: 3.7.1
            Reporter: Aaron Beitch


We see this log message and the regions listed are never put back into service 
without manual intervention:

{code:java}
NodeC hbasemaster-0 hbasemaster 2023-02-15 14:15:56,149 WARN  
[master/NodeC:16000.Chore.1] janitor.CatalogJanitor: 
unknown_server=NodeA,16201,1676468874221/__test-table_NodeA__,,1672786676251.a3cac9159205d7611c85dd5c4feeded7.,
 
unknown_server=NodeA,16201,1676468874221/__test-table_NodeB__,,1672786676579.50e948f0a5bc962aabfe27e9ea4227a5.,
 
unknown_server=NodeA,16201,1676468874221/aeris_v2,,1672786736251.6ab0292cca294784bce8415cc69c30d4.,
 
unknown_server=NodeA,16201,1676468874221/aeris_v2,\x06,1672786736251.15d958805892370907a47f31a6e08db1.,
 
unknown_server=NodeA,16201,1676468874221/aeris_v2,\x12,1672786736251.ac3c78ff6903f52d9e2acf80b8436085.{code}
 

Normally when we see these unknown_server logs, they do get resolved by 
reassigning the regions, however we have a reproducible case where this doesn't 
happen. 

 

When this occurs we also see the following log messages related to the regions:

 
{code:java}
NodeC hbasemaster-0 hbasemaster 2023-02-15 14:10:59,810 WARN  
[RpcServer.priority.RWQ.Fifo.write.handler=0,queue=0,port=16000] 
assignment.AssignmentManager: Reporting NodeC,16201,1676469549542 server does 
not match state=OPEN, location=NodeA,16201,1676468874221, table=aeris_v2, 
region=6ab0292cca294784bce8415cc69c30d4 (time since last update=3749ms); 
closing…
NodeC hbasemaster-0 hbasemaster 2023-02-15 14:11:00,323 WARN  
[RpcServer.priority.RWQ.Fifo.write.handler=0,queue=0,port=16000] 
assignment.AssignmentManager: No matching procedure found for 
C,16201,1676469549542 transition on state=OPEN, 
location=NodeA,16201,1676468874221, table=aeris_v2, 
region=6ab0292cca294784bce8415cc69c30d4 to CLOSED
{code}
 

 

This suggests that the master has a different mapping of region to region 
server than is expected so it closes the region. We would expect that the 
regions get assigned somewhere else and then reopened, but we are not seeing 
that.

 

This log message comes from here: 
[https://github.com/apache/hbase/blob/branch-2.4/hbase-server/src/main/java/org/apache/hadoop/hbase/master/assignment/AssignmentManager.java#L1292]

The next thing that is done is calling AssignmentManager's 
closeRegionServerSilently method.

 

Our setup:

We have a three server cluster that runs a full HBASE stack: 3 zookeeper nodes, 
an HBASE master active and standby, 3 region servers, 3 HDFS data nodes. For 
reliability testing we are running a script that will restart one of the three 
nodes, which will have running on it a region server, zookeeper and HDFS 
process, and possibly also the HBASE master primary or standby.

 

In this test we saw the issue after NodeB had been killed at 14:08:33, which 
had been running the active master, so the master did switchover to NodeC. Then 
at 14:12:56 we saw a "STUCK Region-In-Transition" log for a region on NodeA 
(this is another common reproducible issue we plan to open a ticket for) and 
then restarted just the region server process on NodeA to get that region 
reassigned.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

Reply via email to