[ 
https://issues.apache.org/jira/browse/HDFS-6618?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14053932#comment-14053932
 ] 

Jing Zhao commented on HDFS-6618:
---------------------------------

bq. If we're just trying to make a minimal change to put out the fire without 
fixing the (existing) leak issue, why not just move the call to 
dir.removeFromInodeMap(removedINodes) up inside the first try... catch block? 
Then we can file a JIRA for some refactoring which fixes the leak issue 

I agree with Colin. Maybe we can first fix the bug here by moving the 
removeFromInodeMap call into the FSNamesystem lock, and fix the remaining 
issues in a separate jira(s).

bq. Since the block is not removed from blocksMap, but the block still has 
reference to the block collection (i.e. inode), the block will look valid to 
BlockManager. This will cause memory leak, which will disappear when namenode 
is restarted.

The block deletion has to be after the logSync call here thus I guess it's hard 
to avoid this kind of leak (or we have to change the blocksMap structure etc.). 
Since the memory leak can go away after restarting, maybe we do not need to 
worry about this part too much right now and focus on the leak on the inodeMap.

> Remove deleted INodes from INodeMap right away
> ----------------------------------------------
>
>                 Key: HDFS-6618
>                 URL: https://issues.apache.org/jira/browse/HDFS-6618
>             Project: Hadoop HDFS
>          Issue Type: Bug
>    Affects Versions: 2.5.0
>            Reporter: Kihwal Lee
>            Assignee: Kihwal Lee
>            Priority: Blocker
>         Attachments: HDFS-6618.AbstractList.patch, 
> HDFS-6618.inodeRemover.patch, HDFS-6618.inodeRemover.v2.patch, HDFS-6618.patch
>
>
> After HDFS-6527, we have not seen the edit log corruption for weeks on 
> multiple clusters until yesterday. Previously, we would see it within 30 
> minutes on a cluster.
> But the same condition was reproduced even with HDFS-6527.  The only 
> explanation is that the RPC handler thread serving {{addBlock()}} was 
> accessing stale parent value.  Although nulling out parent is done inside the 
> {{FSNamesystem}} and {{FSDirectory}} write lock, there is no memory barrier 
> because there is no "synchronized" block involved in the process.
> I suggest making parent volatile.



--
This message was sent by Atlassian JIRA
(v6.2#6252)

Reply via email to