[ 
https://issues.apache.org/jira/browse/HDFS-13115?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16354849#comment-16354849
 ] 

Yongjun Zhang commented on HDFS-13115:
--------------------------------------

Noticed the following two methods in LeaseManager.java:

1.
{code}
  Collection<Long> getINodeIdWithLeases() {return leasesById.keySet();}
{code}

and 

2.
{code}
  private synchronized INode[] getINodesWithLease() {
    List<INode> inodes = new ArrayList<>(leasesById.size());
    INode currentINode;
    for (long inodeId : leasesById.keySet()) {
      currentINode = fsnamesystem.getFSDirectory().getInode(inodeId);
      // A file with an active lease could get deleted, or its
      // parent directories could get recursively deleted.
      if (currentINode != null &&
          currentINode.isFile() &&
          !fsnamesystem.isFileDeleted(currentINode.asFile())) {
        inodes.add(currentINode);
      }
    }
    return inodes.toArray(new INode[0]);
  }
{code}

When the second one is called, FSnamespace.readlock is assumed, however, when 
the first one is called, it's not.

In this case we hit the issue, the first one is used. The first one is used in 
quite some cases too. 

I believe it's reasonable to add the check as I proposed. and I'm providing a 
patch.




> In getNumUnderConstructionBlocks(), ignore the inodeIds for which the inodes 
> have been deleted 
> -----------------------------------------------------------------------------------------------
>
>                 Key: HDFS-13115
>                 URL: https://issues.apache.org/jira/browse/HDFS-13115
>             Project: Hadoop HDFS
>          Issue Type: Bug
>            Reporter: Yongjun Zhang
>            Assignee: Yongjun Zhang
>            Priority: Major
>
> In LeaseManager, 
> {code}
>  private synchronized INode[] getINodesWithLease() {
>     List<INode> inodes = new ArrayList<>(leasesById.size());
>     INode currentINode;
>     for (long inodeId : leasesById.keySet()) {
>       currentINode = fsnamesystem.getFSDirectory().getInode(inodeId);
>       // A file with an active lease could get deleted, or its
>       // parent directories could get recursively deleted.
>       if (currentINode != null &&
>           currentINode.isFile() &&
>           !fsnamesystem.isFileDeleted(currentINode.asFile())) {
>         inodes.add(currentINode);
>       }
>     }
>     return inodes.toArray(new INode[0]);
>   }
> {code}
> we can see that given an {{inodeId}},  
> {{fsnamesystem.getFSDirectory().getInode(inodeId)}} could return NULL . The 
> reason is explained in the comment.
> HDFS-12985 RCAed a case and solved that case, we saw that it fixes some 
> cases, but we are still seeing NullPointerException from FSnamesystem
> {code}
>   public long getCompleteBlocksTotal() {
>     // Calculate number of blocks under construction
>     long numUCBlocks = 0;
>     readLock();
>     try {
>       numUCBlocks = leaseManager.getNumUnderConstructionBlocks(); <=== here
>       return getBlocksTotal() - numUCBlocks;
>     } finally {
>       readUnlock();
>     }
>   }
> {code}
> The exception happens when the inode is removed for the given inodeid, see 
> LeaseManager code below:
> {code}
>   synchronized long getNumUnderConstructionBlocks() {
>     assert this.fsnamesystem.hasReadLock() : "The FSNamesystem read lock 
> wasn't"
>       + "acquired before counting under construction blocks";
>     long numUCBlocks = 0;
>     for (Long id : getINodeIdWithLeases()) {
>       final INodeFile cons = 
> fsnamesystem.getFSDirectory().getInode(id).asFile(); <=== here
>       Preconditions.checkState(cons.isUnderConstruction());
>       BlockInfo[] blocks = cons.getBlocks();
>       if(blocks == null)
>         continue;
>       for(BlockInfo b : blocks) {
>         if(!b.isComplete())
>           numUCBlocks++;
>       }
>     }
>     LOG.info("Number of blocks under construction: " + numUCBlocks);
>     return numUCBlocks;
>   }
> {code}
> Create this jira to add a check whether the inode is removed, as a safeguard, 
> to avoid the NullPointerException.
> Looks that after the inodeid is returned by {{getINodeIdWithLeases()}}, it 
> got deleted from FSDirectory map.
> Ideally we should find out who deleted it, like in HDFS-12985. 
> But it seems reasonable to me to have a safeguard here, like other code that 
> calls to {{fsnamesystem.getFSDirectory().getInode(id)}} in the code base.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to