[
https://issues.apache.org/jira/browse/HDFS-12638?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16200361#comment-16200361
]
Kihwal Lee commented on HDFS-12638:
-----------------------------------
We have also seen blocks with "null" bc staying in the replication queue. They
were missing blocks so the replication monitor didn't even try to schedule them
and didn't crash. But metaSave was listing them as orphaned (bc == null,
deleted). Other than failing over (force queue reinitialization),there was no
way to clear them.
In your particular case, we can add a null check in {{scheduleReplication()}}
in addition to the existing deletion check. The missing block case is a bit
trickier, since the replication monitor will not touch them and nothing will
move it to a different priority level since the block is already deleted and
invalidated on datanodes. We should prevent it from getting added to the queue.
In any case, it is apparent that the new {{isDeleted()}} check cannot replace
the bc null check 100%. [~jingzhao] any thoughts?
> NameNode exits due to ReplicationMonitor thread received Runtime exception in
> ReplicationWork#chooseTargets
> -----------------------------------------------------------------------------------------------------------
>
> Key: HDFS-12638
> URL: https://issues.apache.org/jira/browse/HDFS-12638
> Project: Hadoop HDFS
> Issue Type: Bug
> Components: hdfs
> Affects Versions: 2.8.2
> Reporter: Jiandan Yang
>
> Active NamNode exit due to NPE, I can confirm that the BlockCollection passed
> in when creating ReplicationWork is null, but I do not know why
> BlockCollection is null, By view history I found
> [HDFS-9754|https://issues.apache.org/jira/browse/HDFS-9754] remove judging
> whether BlockCollection is null.
> NN logs are as following:
> {code:java}
> 2017-10-11 16:29:06,161 ERROR [ReplicationMonitor]
> org.apache.hadoop.hdfs.server.blockmanagement.BlockManager:
> ReplicationMonitor thread received Runtime exception.
> java.lang.NullPointerException
> at
> org.apache.hadoop.hdfs.server.blockmanagement.ReplicationWork.chooseTargets(ReplicationWork.java:55)
> at
> org.apache.hadoop.hdfs.server.blockmanagement.BlockManager.computeReplicationWorkForBlocks(BlockManager.java:1532)
> at
> org.apache.hadoop.hdfs.server.blockmanagement.BlockManager.computeReplicationWork(BlockManager.java:1491)
> at
> org.apache.hadoop.hdfs.server.blockmanagement.BlockManager.computeDatanodeWork(BlockManager.java:3792)
> at
> org.apache.hadoop.hdfs.server.blockmanagement.BlockManager$ReplicationMonitor.run(BlockManager.java:3744)
> at java.lang.Thread.run(Thread.java:834)
> {code}
--
This message was sent by Atlassian JIRA
(v6.4.14#64029)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]