[ https://issues.apache.org/jira/browse/HDFS-6618?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14053932#comment-14053932 ]
Jing Zhao commented on HDFS-6618: --------------------------------- bq. If we're just trying to make a minimal change to put out the fire without fixing the (existing) leak issue, why not just move the call to dir.removeFromInodeMap(removedINodes) up inside the first try... catch block? Then we can file a JIRA for some refactoring which fixes the leak issue I agree with Colin. Maybe we can first fix the bug here by moving the removeFromInodeMap call into the FSNamesystem lock, and fix the remaining issues in a separate jira(s). bq. Since the block is not removed from blocksMap, but the block still has reference to the block collection (i.e. inode), the block will look valid to BlockManager. This will cause memory leak, which will disappear when namenode is restarted. The block deletion has to be after the logSync call here thus I guess it's hard to avoid this kind of leak (or we have to change the blocksMap structure etc.). Since the memory leak can go away after restarting, maybe we do not need to worry about this part too much right now and focus on the leak on the inodeMap. > Remove deleted INodes from INodeMap right away > ---------------------------------------------- > > Key: HDFS-6618 > URL: https://issues.apache.org/jira/browse/HDFS-6618 > Project: Hadoop HDFS > Issue Type: Bug > Affects Versions: 2.5.0 > Reporter: Kihwal Lee > Assignee: Kihwal Lee > Priority: Blocker > Attachments: HDFS-6618.AbstractList.patch, > HDFS-6618.inodeRemover.patch, HDFS-6618.inodeRemover.v2.patch, HDFS-6618.patch > > > After HDFS-6527, we have not seen the edit log corruption for weeks on > multiple clusters until yesterday. Previously, we would see it within 30 > minutes on a cluster. > But the same condition was reproduced even with HDFS-6527. The only > explanation is that the RPC handler thread serving {{addBlock()}} was > accessing stale parent value. Although nulling out parent is done inside the > {{FSNamesystem}} and {{FSDirectory}} write lock, there is no memory barrier > because there is no "synchronized" block involved in the process. > I suggest making parent volatile. -- This message was sent by Atlassian JIRA (v6.2#6252)