[ 
https://issues.apache.org/jira/browse/HBASE-28167?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17777272#comment-17777272
 ] 

Ke Han commented on HBASE-28167:
--------------------------------

Thanks for the reply!

[~subrat.mishra] Unfortunately I didn't record the NN and DN logs properly in 
the previous buggy run. I'll try to reproduce it again with hdfs logging to 
provide more information.

> HMaster crashes due to NPE at AsyncFSWAL.closeWriter
> ----------------------------------------------------
>
>                 Key: HBASE-28167
>                 URL: https://issues.apache.org/jira/browse/HBASE-28167
>             Project: HBase
>          Issue Type: Bug
>          Components: master
>    Affects Versions: 2.4.17
>            Reporter: Ke Han
>            Priority: Major
>         Attachments: hbase--master-a82083cf5d18.log, persistent.tar.gz
>
>
> I am testing the upgrade process of HMaster, when starting up the new version 
> HMaster 2.4.17, it crashed immediately with the following exception.
> {code:java}
> 2023-10-17 21:03:35,892 ERROR [master/hmaster:16000:becomeActiveMaster] 
> master.HMaster: ***** ABORTING master hmaster,16000,1697576576301: Unhandled 
> exception. Starting shutdown. *****
> org.apache.hadoop.hbase.FailedCloseWALAfterInitializedErrorException: Failed 
> close after init wal failed.
>         at 
> org.apache.hadoop.hbase.wal.AbstractFSWALProvider.getWAL(AbstractFSWALProvider.java:167)
>         at 
> org.apache.hadoop.hbase.wal.AbstractFSWALProvider.getWAL(AbstractFSWALProvider.java:62)
>         at org.apache.hadoop.hbase.wal.WALFactory.getWAL(WALFactory.java:295)
>         at 
> org.apache.hadoop.hbase.master.region.MasterRegion.createWAL(MasterRegion.java:200)
>         at 
> org.apache.hadoop.hbase.master.region.MasterRegion.bootstrap(MasterRegion.java:220)
>         at 
> org.apache.hadoop.hbase.master.region.MasterRegion.create(MasterRegion.java:348)
>         at 
> org.apache.hadoop.hbase.master.region.MasterRegionFactory.create(MasterRegionFactory.java:104)
>         at 
> org.apache.hadoop.hbase.master.HMaster.finishActiveMasterInitialization(HMaster.java:855)
>         at 
> org.apache.hadoop.hbase.master.HMaster.startActiveMasterManager(HMaster.java:2193)
>         at 
> org.apache.hadoop.hbase.master.HMaster.lambda$run$0(HMaster.java:528)
>         at java.lang.Thread.run(Thread.java:750)
> Caused by: java.io.IOException: java.lang.NullPointerException
>         at 
> org.apache.hadoop.hbase.regionserver.wal.AbstractFSWAL.shutdown(AbstractFSWAL.java:979)
>         at 
> org.apache.hadoop.hbase.regionserver.wal.AbstractFSWAL.close(AbstractFSWAL.java:1006)
>         at 
> org.apache.hadoop.hbase.wal.AbstractFSWALProvider.getWAL(AbstractFSWALProvider.java:165)
>         ... 10 more
> Caused by: java.lang.NullPointerException
>         at 
> java.util.concurrent.ConcurrentHashMap.putVal(ConcurrentHashMap.java:1011)
>         at 
> java.util.concurrent.ConcurrentHashMap.put(ConcurrentHashMap.java:1006)
>         at 
> org.apache.hadoop.hbase.regionserver.wal.AsyncFSWAL.closeWriter(AsyncFSWAL.java:743)
>         at 
> org.apache.hadoop.hbase.regionserver.wal.AsyncFSWAL.doShutdown(AsyncFSWAL.java:800)
>         at 
> org.apache.hadoop.hbase.regionserver.wal.AbstractFSWAL$2.call(AbstractFSWAL.java:951)
>         at 
> org.apache.hadoop.hbase.regionserver.wal.AbstractFSWAL$2.call(AbstractFSWAL.java:946)
>         at java.util.concurrent.FutureTask.run(FutureTask.java:266)
>         at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
>         at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
>         ... 1 more {code}
> h1. Reproduce
> It happens non-deterministically, around 2 out of 1802 tests. It might 
> require an exception when HMaster interacts with the HDFS cluster since I 
> noticed the following warning before the NPE exception
> {code:java}
> 2023-10-17 21:03:35,857 WARN  [master/hmaster:16000:becomeActiveMaster] 
> asyncfs.FanOutOneBlockAsyncDFSOutputHelper: create fan-out dfs output 
> /hbase/MasterData/WALs/hmaster,16000,1697576576301/hmaster%2C16000%2C1697576576301.1697576615700
>  failed, retry = 0
> org.apache.hadoop.ipc.RemoteException(java.io.IOException): File 
> /hbase/MasterData/WALs/hmaster,16000,1697576576301/hmaster%2C16000%2C1697576576301.1697576615700
>  could only be replicated to 0 nodes instead of minReplication (=1).  There 
> are 1 datanode(s) running and no node(s) are excluded in this operation.
>         at 
> org.apache.hadoop.hdfs.server.blockmanagement.BlockManager.chooseTarget4NewBlock(BlockManager.java:1832)
>         at 
> org.apache.hadoop.hdfs.server.namenode.FSDirWriteFileOp.chooseTargetForNewBlock(FSDirWriteFileOp.java:265)
>         at 
> org.apache.hadoop.hdfs.server.namenode.FSNamesystem.getAdditionalBlock(FSNamesystem.java:2586)
>         at 
> org.apache.hadoop.hdfs.server.namenode.NameNodeRpcServer.addBlock(NameNodeRpcServer.java:889)
>         at 
> org.apache.hadoop.hdfs.protocolPB.ClientNamenodeProtocolServerSideTranslatorPB.addBlock(ClientNamenodeProtocolServerSideTranslatorPB.java:517)
>         at 
> org.apache.hadoop.hdfs.protocol.proto.ClientNamenodeProtocolProtos$ClientNamenodeProtocol$2.callBlockingMethod(ClientNamenodeProtocolProtos.java)
>         at 
> org.apache.hadoop.ipc.ProtobufRpcEngine$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine.java:528)
>         at 
> org.apache.hadoop.ipc.ProtobufRpcEngine$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine.java:498)
>         at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:1038)
>         at org.apache.hadoop.ipc.Server$RpcCall.run(Server.java:1003)
>  {code}
> This remote excpetion might lead to a NULL writer and it further causes NPE 
> to crash the HMaster.
> h1. Root Cause
> When invoking *inflightWALClosures.put(path.getName(), writer);* the writer 
> is still null, this leads to an NPE which is not handled by HMaster and 
> causes the crash.
> {code:java}
> hbase-server/src/main/java/org/apache/hadoop/hbase/regionserver/wal/AsyncFSWAL.java
>   private void closeWriter(AsyncWriter writer, Path path) {
>     inflightWALClosures.put(path.getName(), writer); // NPE
>     closeExecutor.execute(() -> {
>       try {
>         writer.close();
>       } catch (IOException e) {
>         LOG.warn("close old writer failed", e);
>       } finally {
>         // call this even if the above close fails, as there is no other 
> chance we can set closed to
>         // true, it will not cause big problems.
>         markClosedAndClean(path);
>         inflightWALClosures.remove(path.getName());
>       }
>     });
>   }{code}
> I haven't figured out the exact reason why writer is NULL. Will keep digging 
> the root cause. Any help is appreciated!
>  
> I have attached the full logs from HMaster.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

Reply via email to