[ 
https://issues.apache.org/jira/browse/HBASE-29802?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=18050412#comment-18050412
 ] 

Duo Zhang commented on HBASE-29802:
-----------------------------------

Ah, I found the root cause when debugging the failure of 
TestRegionServerReportForDuty in HBASE-29144...

The problem is that, in SingleProcessHBaseCluster, we have a 
SingleFileSystemShutdownThread hook, when a region server exit, we will invoke 
this hook and it will close the file system instance, and this operation will 
lead to close all the opened files on this file system.

But the problem is that, in the same process, we use share the same file system 
instance across diffferent services, like master and region server, so stop a 
region server will cause master's file system be closed too, and when closing 
MasterRegion with abort = false, we will still try to close gracefully and 
write to a WAL file which has already been closed, and cause trouble...

Let me check the history on why we need this shutdown hook, seems not very 
useful, because when arriving here, we can make sure that the main loop for 
region server quits, we do not need to close file system to interrupt any 
operations...

> NPE when shutting down mini cluster cause tests hang
> ----------------------------------------------------
>
>                 Key: HBASE-29802
>                 URL: https://issues.apache.org/jira/browse/HBASE-29802
>             Project: HBase
>          Issue Type: Bug
>          Components: mapreduce, test
>            Reporter: Duo Zhang
>            Priority: Major
>
> Found this in a timeout run for TestSecureExportSnapshot
> {noformat}
> 2026-01-02T22:53:58,988 INFO  [RegionServerTracker-0] master.HMaster(3360): 
> ***** STOPPING master '59fad5da8b71,38027,1767393931926' *****
> 2026-01-02T22:53:58,988 INFO  [RegionServerTracker-0] master.HMaster(3362): 
> STOPPED: Cluster shutdown set; onlineServer=0
> 2026-01-02T22:53:58,988 INFO  [M:0;59fad5da8b71:38027] 
> hbase.HBaseServerBase(457): Close async cluster connection
> 2026-01-02T22:53:58,988 INFO  [M:0;59fad5da8b71:38027] 
> hbase.HBaseServerBase(440): Shutdown chores and chore service
> 2026-01-02T22:53:58,988 DEBUG [M:0;59fad5da8b71:38027] 
> cleaner.LogCleaner(212): Cancelling LogCleaner
> 2026-01-02T22:53:58,989 DEBUG [M:0;59fad5da8b71:38027] 
> cleaner.HFileCleaner(335): Stopping file delete threads
> 2026-01-02T22:53:58,989 WARN  [OldWALsCleaner-0] cleaner.LogCleaner(200): 
> Interrupted while cleaning old WALs, will try to clean it next round. Exiting.
> 2026-01-02T22:53:58,989 DEBUG 
> [master/59fad5da8b71:0:becomeActiveMaster-HFileCleaner.large.0-1767393935168] 
> cleaner.HFileCleaner(306): Exit 
> Thread[master/59fad5da8b71:0:becomeActiveMaster-HFileCleaner.large.0-1767393935168,5,FailOnTimeoutGroup]
> 2026-01-02T22:53:58,989 DEBUG 
> [master/59fad5da8b71:0:becomeActiveMaster-HFileCleaner.small.0-1767393935174] 
> cleaner.HFileCleaner(306): Exit 
> Thread[master/59fad5da8b71:0:becomeActiveMaster-HFileCleaner.small.0-1767393935174,5,FailOnTimeoutGroup]
> 2026-01-02T22:53:58,989 INFO  [M:0;59fad5da8b71:38027] 
> hbase.ChoreService(370): Chore service for: master/59fad5da8b71:0 had 
> [ScheduledChore name=FlushedSequenceIdFlusher, period=10800000, 
> unit=MILLISECONDS] on shutdown
> 2026-01-02T22:53:58,989 INFO  [M:0;59fad5da8b71:38027] 
> hbase.HBaseServerBase(450): Shutdown executor service
> 2026-01-02T22:53:58,989 DEBUG [M:0;59fad5da8b71:38027] master.HMaster(1813): 
> Stopping service threads
> 2026-01-02T22:53:58,989 INFO  [M:0;59fad5da8b71:38027] 
> procedure2.RemoteProcedureDispatcher(119): Stopping procedure remote 
> dispatcher
> 2026-01-02T22:53:58,989 INFO  [M:0;59fad5da8b71:38027] 
> procedure2.ProcedureExecutor(723): Stopping
> 2026-01-02T22:53:58,990 INFO  [M:0;59fad5da8b71:38027] 
> region.RegionProcedureStore(110): Stopping the Region Procedure Store, 
> isAbort=false
> 2026-01-02T22:53:58,991 DEBUG [Time-limited test-EventThread] 
> zookeeper.ZKWatcher(609): master:38027-0x100734ba5eb0000, 
> quorum=127.0.0.1:59100, baseZNode=/hbase Received ZooKeeper Event, 
> type=NodeDeleted, state=SyncConnected, path=/hbase/master
> 2026-01-02T22:53:58,992 DEBUG [Time-limited test-EventThread] 
> zookeeper.ZKWatcher(609): master:38027-0x100734ba5eb0000, 
> quorum=127.0.0.1:59100, baseZNode=/hbase Received ZooKeeper Event, 
> type=NodeChildrenChanged, state=SyncConnected, path=/hbase
> 2026-01-02T22:53:58,992 DEBUG [M:0;59fad5da8b71:38027] zookeeper.ZKUtil(347): 
> master:38027-0x100734ba5eb0000, quorum=127.0.0.1:59100, baseZNode=/hbase 
> Unable to get data of znode /hbase/master because node does not exist (not an 
> error)
> 2026-01-02T22:53:58,992 WARN  [M:0;59fad5da8b71:38027] 
> master.ActiveMasterManager(344): Failed get of master address: 
> java.io.IOException: Can't get master address from ZooKeeper; znode data == 
> null
> 2026-01-02T22:53:58,993 INFO  [M:0;59fad5da8b71:38027] 
> master.ServerManager(1148): Writing .lastflushedseqids file at: 
> hdfs://localhost:44733/user/jenkins/test-data/eb9fcf01-d40e-546d-7f81-2198b3d62cff/.lastflushedseqids
> 2026-01-02T22:53:58,994 DEBUG [normalizer-worker-0] 
> normalizer.RegionNormalizerWorker(193): interrupt detected. terminating.
> 2026-01-02T22:53:59,019 INFO  [Block report processor] 
> blockmanagement.BlockManager(3777): BLOCK* addStoredBlock: 127.0.0.1:37189 is 
> added to blk_1073742422_1598 (size=307)
> 2026-01-02T22:53:59,019 INFO  [Block report processor] 
> blockmanagement.BlockManager(3777): BLOCK* addStoredBlock: 127.0.0.1:38631 is 
> added to blk_1073742422_1598 (size=307)
> 2026-01-02T22:53:59,020 INFO  [Block report processor] 
> blockmanagement.BlockManager(3777): BLOCK* addStoredBlock: 127.0.0.1:40115 is 
> added to blk_1073742422_1598 (size=307)
> 2026-01-02T22:53:59,020 INFO  [M:0;59fad5da8b71:38027] 
> assignment.AssignmentManager(409): Stopping assignment manager
> 2026-01-02T22:53:59,021 INFO  [M:0;59fad5da8b71:38027] 
> region.MasterRegion(264): Closing local region {ENCODED => 
> 1595e783b53d99cd5eef43b6debb2682, NAME => 
> 'master:store,,1.1595e783b53d99cd5eef43b6debb2682.', STARTKEY => '', ENDKEY 
> => ''}, isAbort=false
> 2026-01-02T22:53:59,021 DEBUG [M:0;59fad5da8b71:38027] 
> regionserver.HRegion(1725): Closing 1595e783b53d99cd5eef43b6debb2682, 
> disabling compactions & flushes
> 2026-01-02T22:53:59,042 INFO  [M:0;59fad5da8b71:38027] 
> regionserver.HRegion(1758): Closing region 
> master:store,,1.1595e783b53d99cd5eef43b6debb2682.
> 2026-01-02T22:53:59,042 DEBUG [M:0;59fad5da8b71:38027] 
> regionserver.HRegion(1779): Time limited wait for close lock on 
> master:store,,1.1595e783b53d99cd5eef43b6debb2682.
> 2026-01-02T22:53:59,042 DEBUG [M:0;59fad5da8b71:38027] 
> regionserver.HRegion(1846): Acquired close lock on 
> master:store,,1.1595e783b53d99cd5eef43b6debb2682. after waiting 0 ms
> 2026-01-02T22:53:59,043 DEBUG [M:0;59fad5da8b71:38027] 
> regionserver.HRegion(1856): Updates disabled for region 
> master:store,,1.1595e783b53d99cd5eef43b6debb2682.
> 2026-01-02T22:53:59,043 INFO  [M:0;59fad5da8b71:38027] 
> regionserver.HRegion(2969): Flushing 1595e783b53d99cd5eef43b6debb2682 4/4 
> column families, dataSize=1013.53 KB heapSize=1.19 MB
> 2026-01-02T22:53:59,043 ERROR 
> [AsyncFSWAL-0-hdfs://localhost:44733/user/jenkins/test-data/eb9fcf01-d40e-546d-7f81-2198b3d62cff/MasterData-prefix:59fad5da8b71,38027,1767393931926]
>  server.NIOServerCnxnFactory(85): Thread 
> Thread[AsyncFSWAL-0-hdfs://localhost:44733/user/jenkins/test-data/eb9fcf01-d40e-546d-7f81-2198b3d62cff/MasterData-prefix:59fad5da8b71,38027,1767393931926,5,FailOnTimeoutGroup]
>  died
> java.lang.NullPointerException: Cannot invoke 
> "org.apache.hbase.thirdparty.io.netty.buffer.ByteBuf.readableBytes()" because 
> "this.buf" is null
>       at 
> org.apache.hadoop.hbase.io.asyncfs.FanOutOneBlockAsyncDFSOutput.buffered(FanOutOneBlockAsyncDFSOutput.java:414)
>       at 
> org.apache.hadoop.hbase.regionserver.wal.AsyncProtobufLogWriter.append(AsyncProtobufLogWriter.java:134)
>       at 
> org.apache.hadoop.hbase.regionserver.wal.AsyncFSWAL.doAppend(AsyncFSWAL.java:181)
>       at 
> org.apache.hadoop.hbase.regionserver.wal.AsyncFSWAL.doAppend(AsyncFSWAL.java:100)
>       at 
> org.apache.hadoop.hbase.regionserver.wal.AbstractFSWAL.appendEntry(AbstractFSWAL.java:1333)
>       at 
> org.apache.hadoop.hbase.regionserver.wal.AbstractFSWAL.appendAndSync(AbstractFSWAL.java:1724)
>       at 
> org.apache.hadoop.hbase.regionserver.wal.AbstractFSWAL.consume(AbstractFSWAL.java:1832)
>       at 
> java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1136)
>       at 
> java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:635)
>       at java.base/java.lang.Thread.run(Thread.java:840)
> {noformat}



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

Reply via email to