[jira] [Comment Edited] (HDFS-12638) NameNode exits due to ReplicationMonitor thread received Runtime exception in ReplicationWork#chooseTargets

Konstantin Shvachko (JIRA) Sat, 21 Oct 2017 15:50:28 -0700

    [ 
https://issues.apache.org/jira/browse/HDFS-12638?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16214120#comment-16214120
 ]


Konstantin Shvachko edited comment on HDFS-12638 at 10/21/17 10:49 PM:
-----------------------------------------------------------------------

It looks to me that the main problem here is that {{unprotectedDelete()}} while 
collecting blocks in {{INodeFile.destroyAndCollectBlocks()}} invalidates block 
collection {{bcid}}, but leaves the block in the {{BlocksMap}}. Then 
{{FSNamesystem.delete()}} releases the lock and reacquires it again for actual 
block removal in {{FSNamesystem.removeBlocks()}}. So if {{ReplicationMonitor}} 
or {{NamenodeFsck}} kick in after the lock is released, but before the blocks 
are deleted from {{BlocksMap}} they can hit NPE accessing invalid (id = -1) 
INode.
Incremental block deletion was introduced in HDFS-6618, so all major versions 
should be affected.

For fixing this we should not invalidate {{bcid}} in 
{{NodeFile.destroyAndCollectBlocks()}}, but rather in 
{{BlockManager.removeBlockFromMap()}}, when the block is actually removed from 
{{BlocksMap}}.
I agree with [~daryn] we should fix the bug (invalid blocks in the map), rather 
than mitigate its consequences (NPE).


was (Author: shv):
It looks to me that the main problem here is that {{unprotectedDelete()}} while 
collecting blocks in {{INodeFile.destroyAndCollectBlocks()}} invalidates block 
collection {{bcid}}, but leaves the block in the {{BlocksMap}}. Then 
{{FSNamesystem.delete()}} releases the lock and reacquires it again for actual 
block removal in {{FSNamesystem.removeBlocks()}}. So if {{ReplicationMonitor}} 
or {{NamenodeFsck}} kick in after the lock is released, but before the blocks 
are deleted from {{BlocksMap}} they can hit NPE accessing invalid (id = -1) 
INode.
Incremental block deletion was introduced in HDFS-6618, so all major versions 
should be affected.

For fixing this we should not invalidate {{bcid}} in 
{{NodeFile.destroyAndCollectBlocks()}}, but rather in 
{{BlockManager.removeBlockFromMap()}}, when the block is actually removed from 
{{BlocksMap}}.

> NameNode exits due to ReplicationMonitor thread received Runtime exception in 
> ReplicationWork#chooseTargets
> -----------------------------------------------------------------------------------------------------------
>
>                 Key: HDFS-12638
>                 URL: https://issues.apache.org/jira/browse/HDFS-12638
>             Project: Hadoop HDFS
>          Issue Type: Bug
>          Components: hdfs
>    Affects Versions: 2.8.2
>            Reporter: Jiandan Yang 
>         Attachments: HDFS-12638-branch-2.8.2.001.patch
>
>
> Active NamNode exit due to NPE, I can confirm that the BlockCollection passed 
> in when creating ReplicationWork is null, but I do not know why 
> BlockCollection is null, By view history I found 
> [HDFS-9754|https://issues.apache.org/jira/browse/HDFS-9754] remove judging  
> whether  BlockCollection is null.
> NN logs are as following:
> {code:java}
> 2017-10-11 16:29:06,161 ERROR [ReplicationMonitor] 
> org.apache.hadoop.hdfs.server.blockmanagement.BlockManager: 
> ReplicationMonitor thread received Runtime exception.
> java.lang.NullPointerException
>         at 
> org.apache.hadoop.hdfs.server.blockmanagement.ReplicationWork.chooseTargets(ReplicationWork.java:55)
>         at 
> org.apache.hadoop.hdfs.server.blockmanagement.BlockManager.computeReplicationWorkForBlocks(BlockManager.java:1532)
>         at 
> org.apache.hadoop.hdfs.server.blockmanagement.BlockManager.computeReplicationWork(BlockManager.java:1491)
>         at 
> org.apache.hadoop.hdfs.server.blockmanagement.BlockManager.computeDatanodeWork(BlockManager.java:3792)
>         at 
> org.apache.hadoop.hdfs.server.blockmanagement.BlockManager$ReplicationMonitor.run(BlockManager.java:3744)
>         at java.lang.Thread.run(Thread.java:834)
> {code}



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[jira] [Comment Edited] (HDFS-12638) NameNode exits due to ReplicationMonitor thread received Runtime exception in ReplicationWork#chooseTargets

Reply via email to