[ 
https://issues.apache.org/jira/browse/HBASE-27711?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Aaron Beitch updated HBASE-27711:
---------------------------------
    Environment: 
HBase: 2.4.11
Hadoop: 3.2.4
ZooKeeper: 3.7.1

  was:
HBase: 2.4.11

Hadoop: 3.2.4

ZooKeeper: 3.7.1


> Regions permanently stuck in unknown_server state
> -------------------------------------------------
>
>                 Key: HBASE-27711
>                 URL: https://issues.apache.org/jira/browse/HBASE-27711
>             Project: HBase
>          Issue Type: Bug
>          Components: Region Assignment
>    Affects Versions: 2.4.11
>         Environment: HBase: 2.4.11
> Hadoop: 3.2.4
> ZooKeeper: 3.7.1
>            Reporter: Aaron Beitch
>            Priority: Major
>
> We see this log message and the regions listed are never put back into 
> service without manual intervention:
> {code:java}
> NodeC hbasemaster-0 hbasemaster 2023-02-15 14:15:56,149 WARN  
> [master/NodeC:16000.Chore.1] janitor.CatalogJanitor: 
> unknown_server=NodeA,16201,1676468874221/__test-table_NodeA__,,1672786676251.a3cac9159205d7611c85dd5c4feeded7.,
>  
> unknown_server=NodeA,16201,1676468874221/__test-table_NodeB__,,1672786676579.50e948f0a5bc962aabfe27e9ea4227a5.,
>  
> unknown_server=NodeA,16201,1676468874221/aeris_v2,,1672786736251.6ab0292cca294784bce8415cc69c30d4.,
>  
> unknown_server=NodeA,16201,1676468874221/aeris_v2,\x06,1672786736251.15d958805892370907a47f31a6e08db1.,
>  
> unknown_server=NodeA,16201,1676468874221/aeris_v2,\x12,1672786736251.ac3c78ff6903f52d9e2acf80b8436085.{code}
>  
> Normally when we see these unknown_server logs, they do get resolved by 
> reassigning the regions, however we have a reproducible case where this 
> doesn't happen. 
> When this occurs we also see the following log messages related to the 
> regions:
> {code:java}
> NodeC hbasemaster-0 hbasemaster 2023-02-15 14:10:59,810 WARN  
> [RpcServer.priority.RWQ.Fifo.write.handler=0,queue=0,port=16000] 
> assignment.AssignmentManager: Reporting NodeC,16201,1676469549542 server does 
> not match state=OPEN, location=NodeA,16201,1676468874221, table=aeris_v2, 
> region=6ab0292cca294784bce8415cc69c30d4 (time since last update=3749ms); 
> closing…
> NodeC hbasemaster-0 hbasemaster 2023-02-15 14:11:00,323 WARN  
> [RpcServer.priority.RWQ.Fifo.write.handler=0,queue=0,port=16000] 
> assignment.AssignmentManager: No matching procedure found for 
> C,16201,1676469549542 transition on state=OPEN, 
> location=NodeA,16201,1676468874221, table=aeris_v2, 
> region=6ab0292cca294784bce8415cc69c30d4 to CLOSED
> {code}
>  
> This suggests that the master has a different mapping of region to region 
> server than is expected so it closes the region. We would expect that the 
> regions get assigned somewhere else and then reopened, but we are not seeing 
> that.
> This log message comes from here: 
> [https://github.com/apache/hbase/blob/branch-2.4/hbase-server/src/main/java/org/apache/hadoop/hbase/master/assignment/AssignmentManager.java#L1292]
> The next thing that is done is calling AssignmentManager's 
> closeRegionServerSilently method.
> Our setup:
> We have a three server cluster that runs a full HBASE stack: 3 zookeeper 
> nodes, an HBASE master active and standby, 3 region servers, 3 HDFS data 
> nodes. For reliability testing we are running a script that will restart one 
> of the three nodes, which will have running on it a region server, zookeeper 
> and HDFS process, and possibly also the HBASE master primary or standby.
> In this test we saw the issue after NodeB had been killed at 14:08:33, which 
> had been running the active master, so the master did switchover to NodeC. 
> Then at 14:12:56 we saw a "STUCK Region-In-Transition" log for a region on 
> NodeA (this is another common reproducible issue we plan to open a ticket 
> for) and then restarted just the region server process on NodeA to get that 
> region reassigned.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

Reply via email to