[
https://issues.apache.org/jira/browse/HDFS-12638?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16211413#comment-16211413
]
Daryn Sharp commented on HDFS-12638:
------------------------------------
bq. Yes, I think our code should bear with such orphan blocks, instead of
failing the NN with NPE like this. At least.
See below, they aren't really orphaned. I think it's correct for the NN to
crash if the namesystem data structures are corrupted.
bq. I assume when the snapshot gets deleted, these blocks will be also removed
from the blocks map. But before that, we need to live with such orphaned blocks
To the block manager, replication monitor, etc these copy-on-truncate blocks
are not (supposed to be) special. My prior point stated another way is the
block is not orphaned if it's in a snapshot diff. INodes are not orphaned when
only referenced via a snapshot diff. A block in the blocks map should not be
referencing an inode not in the inodes map. Direct namespace accessibility is
irrelevant to the block/inode/map linkages being correct.
We need to fix the bug, not mask it.
> NameNode exits due to ReplicationMonitor thread received Runtime exception in
> ReplicationWork#chooseTargets
> -----------------------------------------------------------------------------------------------------------
>
> Key: HDFS-12638
> URL: https://issues.apache.org/jira/browse/HDFS-12638
> Project: Hadoop HDFS
> Issue Type: Bug
> Components: hdfs
> Affects Versions: 2.8.2
> Reporter: Jiandan Yang
> Attachments: HDFS-12638-branch-2.8.2.001.patch
>
>
> Active NamNode exit due to NPE, I can confirm that the BlockCollection passed
> in when creating ReplicationWork is null, but I do not know why
> BlockCollection is null, By view history I found
> [HDFS-9754|https://issues.apache.org/jira/browse/HDFS-9754] remove judging
> whether BlockCollection is null.
> NN logs are as following:
> {code:java}
> 2017-10-11 16:29:06,161 ERROR [ReplicationMonitor]
> org.apache.hadoop.hdfs.server.blockmanagement.BlockManager:
> ReplicationMonitor thread received Runtime exception.
> java.lang.NullPointerException
> at
> org.apache.hadoop.hdfs.server.blockmanagement.ReplicationWork.chooseTargets(ReplicationWork.java:55)
> at
> org.apache.hadoop.hdfs.server.blockmanagement.BlockManager.computeReplicationWorkForBlocks(BlockManager.java:1532)
> at
> org.apache.hadoop.hdfs.server.blockmanagement.BlockManager.computeReplicationWork(BlockManager.java:1491)
> at
> org.apache.hadoop.hdfs.server.blockmanagement.BlockManager.computeDatanodeWork(BlockManager.java:3792)
> at
> org.apache.hadoop.hdfs.server.blockmanagement.BlockManager$ReplicationMonitor.run(BlockManager.java:3744)
> at java.lang.Thread.run(Thread.java:834)
> {code}
--
This message was sent by Atlassian JIRA
(v6.4.14#64029)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]