Uma Maheswara Rao G created HDFS-4482:
-----------------------------------------

             Summary: ReplicationMonitor thread can exit with NPE due to the 
race between delete and replication of same file.
                 Key: HDFS-4482
                 URL: https://issues.apache.org/jira/browse/HDFS-4482
             Project: Hadoop HDFS
          Issue Type: Bug
            Reporter: Uma Maheswara Rao G
            Priority: Blocker


Trace:

{noformat}
java.lang.NullPointerException
        at 
org.apache.hadoop.hdfs.server.namenode.FSDirectory.getFullPathName(FSDirectory.java:1442)
        at 
org.apache.hadoop.hdfs.server.namenode.INode.getFullPathName(INode.java:269)
        at 
org.apache.hadoop.hdfs.server.namenode.INodeFile.getName(INodeFile.java:163)
        at 
org.apache.hadoop.hdfs.server.blockmanagement.BlockPlacementPolicy.chooseTarget(BlockPlacementPolicy.java:131)
        at 
org.apache.hadoop.hdfs.server.blockmanagement.BlockManager.computeReplicationWorkForBlocks(BlockManager.java:1157)
        at 
org.apache.hadoop.hdfs.server.blockmanagement.BlockManager.computeReplicationWork(BlockManager.java:1063)
        at 
org.apache.hadoop.hdfs.server.blockmanagement.BlockManager.computeDatanodeWork(BlockManager.java:3085)
        at 
org.apache.hadoop.hdfs.server.blockmanagement.BlockManager$ReplicationMonitor.run(BlockManager.java:3047)
        at java.lang.Thread.run(Thread.java:619)

{noformat}

What I am seeing here is:

1) create a file and write with 2 DNS
2) Close the file.
3) Kill one DN
4) Lat replication start.
  Info:
    {code}
 // choose replication targets: NOT HOLDING THE GLOBAL LOCK
      // It is costly to extract the filename for which chooseTargets is called,
      // so for now we pass in the block collection itself.
      rw.targets = blockplacement.chooseTarget(rw.bc,
          rw.additionalReplRequired, rw.srcNode, rw.liveReplicaNodes,
          excludedNodes, rw.block.getNumBytes());
{code}
Here we are choosing target outside the global lock. Inside we will try to get 
the src path from blockCollection(nothing but INodeFile here).

see the code for FSDirectory#getFullPathName
 Here it is incrementing the depth until it has parent. and Later it will 
iterate and access parent again in next loop.

Between this if file is deleted by client then that parent would have been set 
as null. So, here accessing the parent can cause NPE because it is not under 
lock.


2) 

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

Reply via email to