[ 
https://issues.apache.org/jira/browse/HBASE-28167?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17777057#comment-17777057
 ] 

Duo Zhang commented on HBASE-28167:
-----------------------------------

I guess the problem is here

https://github.com/apache/hbase/blob/abdc8cfca58cddf1cf18d0ef96db97e3b3345f0a/hbase-server/src/main/java/org/apache/hadoop/hbase/regionserver/wal/AsyncFSWAL.java#L800

In the doShutdown method, we call closeWriter(this.writer, getOldPath()) 
directly without testing whether writer is null. It is possible that we fail to 
create the wal writer when initializing and then when invoking WAL.clode, the 
writer is null and we face the NPE.

I think we can fix the NPE by adding a null check, as we have already done this 
on master branch. But actually it does not fully solve the problem here, as 
finally master will crash because of another exception...

Thanks.

> HMaster crashes due to NPE at AsyncFSWAL.closeWriter
> ----------------------------------------------------
>
>                 Key: HBASE-28167
>                 URL: https://issues.apache.org/jira/browse/HBASE-28167
>             Project: HBase
>          Issue Type: Bug
>          Components: master
>    Affects Versions: 2.4.17
>            Reporter: Ke Han
>            Priority: Major
>         Attachments: hbase--master-a82083cf5d18.log, persistent.tar.gz
>
>
> I am testing the upgrade process of HMaster, when starting up the new version 
> HMaster 2.4.17, it crashed immediately with the following exception.
> {code:java}
> 2023-10-17 21:03:35,892 ERROR [master/hmaster:16000:becomeActiveMaster] 
> master.HMaster: ***** ABORTING master hmaster,16000,1697576576301: Unhandled 
> exception. Starting shutdown. *****
> org.apache.hadoop.hbase.FailedCloseWALAfterInitializedErrorException: Failed 
> close after init wal failed.
>         at 
> org.apache.hadoop.hbase.wal.AbstractFSWALProvider.getWAL(AbstractFSWALProvider.java:167)
>         at 
> org.apache.hadoop.hbase.wal.AbstractFSWALProvider.getWAL(AbstractFSWALProvider.java:62)
>         at org.apache.hadoop.hbase.wal.WALFactory.getWAL(WALFactory.java:295)
>         at 
> org.apache.hadoop.hbase.master.region.MasterRegion.createWAL(MasterRegion.java:200)
>         at 
> org.apache.hadoop.hbase.master.region.MasterRegion.bootstrap(MasterRegion.java:220)
>         at 
> org.apache.hadoop.hbase.master.region.MasterRegion.create(MasterRegion.java:348)
>         at 
> org.apache.hadoop.hbase.master.region.MasterRegionFactory.create(MasterRegionFactory.java:104)
>         at 
> org.apache.hadoop.hbase.master.HMaster.finishActiveMasterInitialization(HMaster.java:855)
>         at 
> org.apache.hadoop.hbase.master.HMaster.startActiveMasterManager(HMaster.java:2193)
>         at 
> org.apache.hadoop.hbase.master.HMaster.lambda$run$0(HMaster.java:528)
>         at java.lang.Thread.run(Thread.java:750)
> Caused by: java.io.IOException: java.lang.NullPointerException
>         at 
> org.apache.hadoop.hbase.regionserver.wal.AbstractFSWAL.shutdown(AbstractFSWAL.java:979)
>         at 
> org.apache.hadoop.hbase.regionserver.wal.AbstractFSWAL.close(AbstractFSWAL.java:1006)
>         at 
> org.apache.hadoop.hbase.wal.AbstractFSWALProvider.getWAL(AbstractFSWALProvider.java:165)
>         ... 10 more
> Caused by: java.lang.NullPointerException
>         at 
> java.util.concurrent.ConcurrentHashMap.putVal(ConcurrentHashMap.java:1011)
>         at 
> java.util.concurrent.ConcurrentHashMap.put(ConcurrentHashMap.java:1006)
>         at 
> org.apache.hadoop.hbase.regionserver.wal.AsyncFSWAL.closeWriter(AsyncFSWAL.java:743)
>         at 
> org.apache.hadoop.hbase.regionserver.wal.AsyncFSWAL.doShutdown(AsyncFSWAL.java:800)
>         at 
> org.apache.hadoop.hbase.regionserver.wal.AbstractFSWAL$2.call(AbstractFSWAL.java:951)
>         at 
> org.apache.hadoop.hbase.regionserver.wal.AbstractFSWAL$2.call(AbstractFSWAL.java:946)
>         at java.util.concurrent.FutureTask.run(FutureTask.java:266)
>         at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
>         at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
>         ... 1 more {code}
> h1. Reproduce
> It happens non-deterministically, around (2 out of 1802 tests). It might 
> require a exception when HMaster interacts with the HDFS cluster since I 
> noticed the following warning before the NPE exception
> {code:java}
> 2023-10-17 21:03:35,857 WARN  [master/hmaster:16000:becomeActiveMaster] 
> asyncfs.FanOutOneBlockAsyncDFSOutputHelper: create fan-out dfs output 
> /hbase/MasterData/WALs/hmaster,16000,1697576576301/hmaster%2C16000%2C1697576576301.1697576615700
>  failed, retry = 0
> org.apache.hadoop.ipc.RemoteException(java.io.IOException): File 
> /hbase/MasterData/WALs/hmaster,16000,1697576576301/hmaster%2C16000%2C1697576576301.1697576615700
>  could only be replicated to 0 nodes instead of minReplication (=1).  There 
> are 1 datanode(s) running and no node(s) are excluded in this operation.
>         at 
> org.apache.hadoop.hdfs.server.blockmanagement.BlockManager.chooseTarget4NewBlock(BlockManager.java:1832)
>         at 
> org.apache.hadoop.hdfs.server.namenode.FSDirWriteFileOp.chooseTargetForNewBlock(FSDirWriteFileOp.java:265)
>         at 
> org.apache.hadoop.hdfs.server.namenode.FSNamesystem.getAdditionalBlock(FSNamesystem.java:2586)
>         at 
> org.apache.hadoop.hdfs.server.namenode.NameNodeRpcServer.addBlock(NameNodeRpcServer.java:889)
>         at 
> org.apache.hadoop.hdfs.protocolPB.ClientNamenodeProtocolServerSideTranslatorPB.addBlock(ClientNamenodeProtocolServerSideTranslatorPB.java:517)
>         at 
> org.apache.hadoop.hdfs.protocol.proto.ClientNamenodeProtocolProtos$ClientNamenodeProtocol$2.callBlockingMethod(ClientNamenodeProtocolProtos.java)
>         at 
> org.apache.hadoop.ipc.ProtobufRpcEngine$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine.java:528)
>         at 
> org.apache.hadoop.ipc.ProtobufRpcEngine$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine.java:498)
>         at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:1038)
>         at org.apache.hadoop.ipc.Server$RpcCall.run(Server.java:1003)
>  {code}
> This remote excpetion might lead to a NULL writer and it further causes NPE 
> to crash athe HMaster.
> h1. Root Cause
> When invoking *inflightWALClosures.put(path.getName(), writer);* the writer 
> is still null, this leads to an NPE which is not handled by HMaster and 
> causes the crash.
> {code:java}
> hbase-server/src/main/java/org/apache/hadoop/hbase/regionserver/wal/AsyncFSWAL.java
>   private void closeWriter(AsyncWriter writer, Path path) {
>     inflightWALClosures.put(path.getName(), writer); // NPE
>     closeExecutor.execute(() -> {
>       try {
>         writer.close();
>       } catch (IOException e) {
>         LOG.warn("close old writer failed", e);
>       } finally {
>         // call this even if the above close fails, as there is no other 
> chance we can set closed to
>         // true, it will not cause big problems.
>         markClosedAndClean(path);
>         inflightWALClosures.remove(path.getName());
>       }
>     });
>   }{code}
> I haven't figured out the exact reason why writer is NULL. Will keep digging 
> the root cause. Any help is appreciated!
>  
> I have attached the full logs from HMaster.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

Reply via email to