[jira] [Commented] (HDFS-7707) Edit log corruption due to delayed block removal again

Yongjun Zhang (JIRA) Fri, 30 Jan 2015 12:00:01 -0800

    [ 
https://issues.apache.org/jira/browse/HDFS-7707?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14299153#comment-14299153
 ]


Yongjun Zhang commented on HDFS-7707:
-------------------------------------

HI Kihwal,

Thanks a lot for your further comments. I did the analysis based on the edit 
log. I assumed {{commitBlockSynchronization()}} is involved due to the delayed 
block removal. Basically the same code path as examined by HDFS-6825. I will 
take a look at other path too.

Assuming {{commitBlockSynchronization}} is involved (. The {{iNodeFile}} is got 
by the following code:
{code}
    BlockCollection blockCollection = storedBlock.getBlockCollection();
    INodeFile iFile = ((INode)blockCollection).asFile();
{code}
Do you mean that we could get a wrong iFile here?

BTW, your comment rang a bell to me: when we delete a dir, what's the reason 
that {{tmpParent}} won't get a null at the {{dirX}} when trying to get the 
parent of {{dirX}} (if this happened)?
{code}
   while (true) {
      if (tmpParent == null ||
          tmpParent.searchChildren(tmpChild.getLocalNameBytes()) < 0) {
        return true;
      }
      if (tmpParent.isRoot()) {
        break;
      }
      tmpChild = tmpParent;
      tmpParent = tmpParent.getParent();
    }
{code}

Thanks.


> Edit log corruption due to delayed block removal again
> ------------------------------------------------------
>
>                 Key: HDFS-7707
>                 URL: https://issues.apache.org/jira/browse/HDFS-7707
>             Project: Hadoop HDFS
>          Issue Type: Bug
>          Components: namenode
>    Affects Versions: 2.6.0
>            Reporter: Yongjun Zhang
>            Assignee: Yongjun Zhang
>
> Edit log corruption is seen again, even with the fix of HDFS-6825. 
> Prior to HDFS-6825 fix, if dirX is deleted recursively, an OP_CLOSE can get 
> into edit log for the fileY under dirX, thus corrupting the edit log 
> (restarting NN with the edit log would fail). 
> What HDFS-6825 does to fix this issue is, to detect whether fileY is already 
> deleted by checking the ancestor dirs on it's path, if any of them doesn't 
> exist, then fileY is already deleted, and don't put OP_CLOSE to edit log for 
> the file.
> For this new edit log corruption, what I found was, the client first deleted 
> dirX recursively, then create another dir with exactly the same name as dirX 
> right away.  Because HDFS-6825 count on the namespace checking (whether dirX 
> exists in its parent dir) to decide whether a file has been deleted, the 
> newly created dirX defeats this checking, thus OP_CLOSE for the already 
> deleted file gets into the edit log, due to delayed block removal.
> What we need to do is to have a more robust way to detect whether a file has 
> been deleted.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (HDFS-7707) Edit log corruption due to delayed block removal again

Reply via email to