[
https://issues.apache.org/jira/browse/HBASE-27711?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Aaron Beitch updated HBASE-27711:
---------------------------------
Attachment: config.txt
> Regions permanently stuck in unknown_server state
> -------------------------------------------------
>
> Key: HBASE-27711
> URL: https://issues.apache.org/jira/browse/HBASE-27711
> Project: HBase
> Issue Type: Bug
> Components: Region Assignment
> Affects Versions: 2.4.11
> Environment: HBase: 2.4.11
> Hadoop: 3.2.4
> ZooKeeper: 3.7.1
> Reporter: Aaron Beitch
> Priority: Major
> Attachments: config.txt
>
>
> We see this log message and the regions listed are never put back into
> service without manual intervention:
> {code:java}
> NodeC hbasemaster-0 hbasemaster 2023-02-15 14:15:56,149 WARN
> [master/NodeC:16000.Chore.1] janitor.CatalogJanitor:
> unknown_server=NodeA,16201,1676468874221/__test-table_NodeA__,,1672786676251.a3cac9159205d7611c85dd5c4feeded7.,
>
> unknown_server=NodeA,16201,1676468874221/__test-table_NodeB__,,1672786676579.50e948f0a5bc962aabfe27e9ea4227a5.,
>
> unknown_server=NodeA,16201,1676468874221/aeris_v2,,1672786736251.6ab0292cca294784bce8415cc69c30d4.,
>
> unknown_server=NodeA,16201,1676468874221/aeris_v2,\x06,1672786736251.15d958805892370907a47f31a6e08db1.,
>
> unknown_server=NodeA,16201,1676468874221/aeris_v2,\x12,1672786736251.ac3c78ff6903f52d9e2acf80b8436085.{code}
>
> Normally when we see these unknown_server logs, they do get resolved by
> reassigning the regions, however we have a reproducible case where this
> doesn't happen.
> When this occurs we also see the following log messages related to the
> regions:
> {code:java}
> NodeC hbasemaster-0 hbasemaster 2023-02-15 14:10:59,810 WARN
> [RpcServer.priority.RWQ.Fifo.write.handler=0,queue=0,port=16000]
> assignment.AssignmentManager: Reporting NodeC,16201,1676469549542 server does
> not match state=OPEN, location=NodeA,16201,1676468874221, table=aeris_v2,
> region=6ab0292cca294784bce8415cc69c30d4 (time since last update=3749ms);
> closing…
> NodeC hbasemaster-0 hbasemaster 2023-02-15 14:11:00,323 WARN
> [RpcServer.priority.RWQ.Fifo.write.handler=0,queue=0,port=16000]
> assignment.AssignmentManager: No matching procedure found for
> C,16201,1676469549542 transition on state=OPEN,
> location=NodeA,16201,1676468874221, table=aeris_v2,
> region=6ab0292cca294784bce8415cc69c30d4 to CLOSED
> {code}
>
> This suggests that the master has a different mapping of region to region
> server than is expected so it closes the region. We would expect that the
> regions get assigned somewhere else and then reopened, but we are not seeing
> that.
> This log message comes from here:
> [https://github.com/apache/hbase/blob/branch-2.4/hbase-server/src/main/java/org/apache/hadoop/hbase/master/assignment/AssignmentManager.java#L1292]
> The next thing that is done is calling AssignmentManager's
> closeRegionServerSilently method.
> Our setup:
> We have a three server cluster that runs a full HBASE stack: 3 zookeeper
> nodes, an HBASE master active and standby, 3 region servers, 3 HDFS data
> nodes. For reliability testing we are running a script that will restart one
> of the three servers, which will have running on it a region server,
> zookeeper and HDFS process, and possibly also the HBASE master primary or
> standby.
> In this test we saw the issue after NodeB had been killed at 14:08:33, which
> had been running the active master, so the master did switchover to NodeC.
> Then at 14:12:56 we saw a "STUCK Region-In-Transition" log for a region on
> NodeA (this is another common reproducible issue we plan to open a ticket
> for) and then restarted just the region server process on NodeA to get that
> region reassigned.
--
This message was sent by Atlassian Jira
(v8.20.10#820010)