[
https://issues.apache.org/jira/browse/HDFS-15366?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17112422#comment-17112422
]
Wei-Chiu Chuang commented on HDFS-15366:
----------------------------------------
To add more color,
I did dig out this old stacktrace in a unit test 4 years ago. Post here for
future reference
{noformat}
java.lang.NullPointerException
at org.apache.hadoop.hdfs.server.namenode.INode.getParent(INode.java:660)
at
org.apache.hadoop.hdfs.server.namenode.INodeFile.getStoragePolicyID(INodeFile.java:392)
at
org.apache.hadoop.hdfs.server.blockmanagement.BlockManager$ReplicationWork.chooseTargets(BlockManager.java:4011)
at
org.apache.hadoop.hdfs.server.blockmanagement.BlockManager$ReplicationWork.access$300(BlockManager.java:3976)
at
org.apache.hadoop.hdfs.server.blockmanagement.BlockManager.computeReplicationWorkForBlocks(BlockManager.java:1478)
at
org.apache.hadoop.hdfs.server.blockmanagement.BlockManager.computeReplicationWork(BlockManager.java:1384)
at
org.apache.hadoop.hdfs.server.blockmanagement.BlockManager.computeDatanodeWork(BlockManager.java:3947)
at
org.apache.hadoop.hdfs.server.blockmanagement.BlockManager$ReplicationMonitor.run(BlockManager.java:3818)
{noformat}
It looks similar but not exactly the same. Notice that the
computeReplicationWork holds namesystem lock but not fsdirectory lock. My hunch
is there was a parallel thread that held fsdirectory lock but no namesystem
lock that deleted the inode at the same time.
It only occur once in four years to me. Looks a very rare race condition bug.
> Active NameNode went down with NPE
> ----------------------------------
>
> Key: HDFS-15366
> URL: https://issues.apache.org/jira/browse/HDFS-15366
> Project: Hadoop HDFS
> Issue Type: Bug
> Affects Versions: 2.7.3
> Reporter: sarun singla
> Priority: Major
>
> {code:java}
> 2020-05-12 00:31:54,565 ERROR blockmanagement.BlockManager
> (BlockManager.java:run(3816)) - ReplicationMonitor thread received Runtime
> exception.
> java.lang.NullPointerException
> at org.apache.hadoop.hdfs.server.namenode.INode.getParent(INode.java:629)
> at
> org.apache.hadoop.hdfs.server.namenode.FSDirectory.getRelativePathINodes(FSDirectory.java:1009)
> at
> org.apache.hadoop.hdfs.server.namenode.FSDirectory.getFullPathINodes(FSDirectory.java:1015)
> at
> org.apache.hadoop.hdfs.server.namenode.FSDirectory.getFullPathName(FSDirectory.java:1020)
> at
> org.apache.hadoop.hdfs.server.namenode.INode.getFullPathName(INode.java:591)
> at
> org.apache.hadoop.hdfs.server.namenode.INodeFile.getName(INodeFile.java:550)
> at
> org.apache.hadoop.hdfs.server.blockmanagement.BlockManager$ReplicationWork.chooseTargets(BlockManager.java:3912)
> at
> org.apache.hadoop.hdfs.server.blockmanagement.BlockManager$ReplicationWork.access$200(BlockManager.java:3875)
> at
> org.apache.hadoop.hdfs.server.blockmanagement.BlockManager.computeReplicationWorkForBlocks(BlockManager.java:1560)
> at
> org.apache.hadoop.hdfs.server.blockmanagement.BlockManager.computeReplicationWork(BlockManager.java:1452)
> at
> org.apache.hadoop.hdfs.server.blockmanagement.BlockManager.computeDatanodeWork(BlockManager.java:3847)
> at
> org.apache.hadoop.hdfs.server.blockmanagement.BlockManager$ReplicationMonitor.run(BlockManager.java:3799)
> at java.lang.Thread.run(Thread.java:748)
> 2020-05-12 00:31:54,567 INFO util.ExitUtil (ExitUtil.java:terminate(124)) -
> Exiting with status 1
> 2020-05-12 00:31:54,621 INFO namenode.NameNode (LogAdapter.java:info(47)) -
> SHUTDOWN_MSG:
> /************************************************************
> SHUTDOWN_MSG: Shutting down NameNode at xyz.com/xxx
> ************************************************************/{code}
--
This message was sent by Atlassian Jira
(v8.3.4#803005)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]