Aaron Beitch created HBASE-27711:
------------------------------------
Summary: Regions permanently stuck in unknown_server state
Key: HBASE-27711
URL: https://issues.apache.org/jira/browse/HBASE-27711
Project: HBase
Issue Type: Bug
Components: Region Assignment
Affects Versions: 2.4.11
Environment: HBase: 2.4.11
Hadoop: 3.2.4
ZooKeeper: 3.7.1
Reporter: Aaron Beitch
We see this log message and the regions listed are never put back into service
without manual intervention:
{code:java}
NodeC hbasemaster-0 hbasemaster 2023-02-15 14:15:56,149 WARN
[master/NodeC:16000.Chore.1] janitor.CatalogJanitor:
unknown_server=NodeA,16201,1676468874221/__test-table_NodeA__,,1672786676251.a3cac9159205d7611c85dd5c4feeded7.,
unknown_server=NodeA,16201,1676468874221/__test-table_NodeB__,,1672786676579.50e948f0a5bc962aabfe27e9ea4227a5.,
unknown_server=NodeA,16201,1676468874221/aeris_v2,,1672786736251.6ab0292cca294784bce8415cc69c30d4.,
unknown_server=NodeA,16201,1676468874221/aeris_v2,\x06,1672786736251.15d958805892370907a47f31a6e08db1.,
unknown_server=NodeA,16201,1676468874221/aeris_v2,\x12,1672786736251.ac3c78ff6903f52d9e2acf80b8436085.{code}
Normally when we see these unknown_server logs, they do get resolved by
reassigning the regions, however we have a reproducible case where this doesn't
happen.
When this occurs we also see the following log messages related to the regions:
{code:java}
NodeC hbasemaster-0 hbasemaster 2023-02-15 14:10:59,810 WARN
[RpcServer.priority.RWQ.Fifo.write.handler=0,queue=0,port=16000]
assignment.AssignmentManager: Reporting NodeC,16201,1676469549542 server does
not match state=OPEN, location=NodeA,16201,1676468874221, table=aeris_v2,
region=6ab0292cca294784bce8415cc69c30d4 (time since last update=3749ms);
closing…
NodeC hbasemaster-0 hbasemaster 2023-02-15 14:11:00,323 WARN
[RpcServer.priority.RWQ.Fifo.write.handler=0,queue=0,port=16000]
assignment.AssignmentManager: No matching procedure found for
C,16201,1676469549542 transition on state=OPEN,
location=NodeA,16201,1676468874221, table=aeris_v2,
region=6ab0292cca294784bce8415cc69c30d4 to CLOSED
{code}
This suggests that the master has a different mapping of region to region
server than is expected so it closes the region. We would expect that the
regions get assigned somewhere else and then reopened, but we are not seeing
that.
This log message comes from here:
[https://github.com/apache/hbase/blob/branch-2.4/hbase-server/src/main/java/org/apache/hadoop/hbase/master/assignment/AssignmentManager.java#L1292]
The next thing that is done is calling AssignmentManager's
closeRegionServerSilently method.
Our setup:
We have a three server cluster that runs a full HBASE stack: 3 zookeeper nodes,
an HBASE master active and standby, 3 region servers, 3 HDFS data nodes. For
reliability testing we are running a script that will restart one of the three
nodes, which will have running on it a region server, zookeeper and HDFS
process, and possibly also the HBASE master primary or standby.
In this test we saw the issue after NodeB had been killed at 14:08:33, which
had been running the active master, so the master did switchover to NodeC. Then
at 14:12:56 we saw a "STUCK Region-In-Transition" log for a region on NodeA
(this is another common reproducible issue we plan to open a ticket for) and
then restarted just the region server process on NodeA to get that region
reassigned.
--
This message was sent by Atlassian Jira
(v8.20.10#820010)