[ 
https://issues.apache.org/jira/browse/HDFS-9052?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14740108#comment-14740108
 ] 

Alex Ivanov commented on HDFS-9052:
-----------------------------------

I've confirmed that the CDH 5.0.0/5.0.5 code actually contains the Hadoop 2.4 
change you're referring to:
[Grepcode 
FSImageFormatPBSnapshot.java|http://grepcode.com/file/repository.cloudera.com/content/repositories/releases/org.apache.hadoop/hadoop-hdfs/2.3.0-cdh5.0.5/org/apache/hadoop/hdfs/server/namenode/snapshot/FSImageFormatPBSnapshot.java/]

Let me elaborate a little on my scenario. We have a log file in HDFS 
(useraction.log.crypto), which we overwrite, and then do a snapshot on a parent 
folder. So the snapshots look as follows in the XML translation of the FsImage 
(abridged):
{code}
SNAPSHOT ID=5063; NAME=s2015022606
<diff><inodeid>3432626</inodeid>                                // 
useractionlogs folder
  
<dirdiff><snapshotId>5063</snapshotId><isSnapshotroot>false</isSnapshotroot><childrenSize>3</childrenSize><name>useractionlogs</name>
    <created><name>useraction.log.crypto</name></created>
    <deleted><inode>3564976</inode></deleted>
  </dirdiff>
  ... 
</diff>
<diff><inodeid>3564976</inodeid>
  
<filediff><snapshotId>5063</snapshotId><size>1388</size><name>useraction.log.crypto</name></filediff>
</diff>
...

SNAPSHOT ID=5065; NAME=s2015022607
<diff><inodeid>3432626</inodeid>                                // 
useractionlogs folder
  
<dirdiff><snapshotId>5065</snapshotId><isSnapshotroot>false</isSnapshotroot><childrenSize>3</childrenSize><name>useractionlogs</name>
    <created><name>useraction.log.crypto</name></created>
    <deleted><inode>3565860</inode></deleted>
  </dirdiff>
  ...
</diff>
<diff><inodeid>3565860</inodeid>
  
<filediff><snapshotId>5065</snapshotId><size>1388</size><name>useraction.log.crypto</name></filediff>
</diff>
...
{code}

As you see, each snapshot deletes and re-creates the _useraction.log.crypto_ 
file. It seems that there are 2 ways for me to run into the AssertionError, and 
I don't think the fix for 
[HDFS-6908|https://issues.apache.org/jira/browse/HDFS-6908] addresses Case 2.

*Case 1 - Do a snapshot diff using the hdfs client*
{code}
[root@node1075]# hdfs snapshotDiff /data/tenants/pdx-svt.baseline84/wddata 
s2015022606 s2015022607
snapshotDiff: Element already exists: element=useraction.log.crypto, 
DELETED=[useraction.log.crypto]
{code}

Here's the _tail_ from the Namenode logs:
{code}
2015-09-11 01:44:59,558 WARN  [IPC Server handler 89 on 8020] ipc.Server 
(Server.java:run(2002)) - IPC Server handler 89 on 8020, call 
org.apache.hadoop.hdfs.protocol.ClientProtocol.getSnapshotDiffReport from 
10.52.209.75:60609 Call#0 Retry#1: error: java.lang.AssertionError: Element 
already exists: element=useraction.log.crypto, DELETED=[useraction.log.crypto]
java.lang.AssertionError: Element already exists: 
element=useraction.log.crypto, DELETED=[useraction.log.crypto]
        at org.apache.hadoop.hdfs.util.Diff.insert(Diff.java:193)
        at org.apache.hadoop.hdfs.util.Diff.delete(Diff.java:239)
        at org.apache.hadoop.hdfs.util.Diff.combinePosterior(Diff.java:462)
        at 
org.apache.hadoop.hdfs.server.namenode.snapshot.DirectoryWithSnapshotFeature$DirectoryDiff$2.initChildren(DirectoryWithSnapshotFeature.java:293)
        at 
org.apache.hadoop.hdfs.server.namenode.snapshot.DirectoryWithSnapshotFeature$DirectoryDiff$2.iterator(DirectoryWithSnapshotFeature.java:303)
        at 
org.apache.hadoop.hdfs.server.namenode.snapshot.INodeDirectorySnapshottable.computeDiffRecursively(INodeDirectorySnapshottable.java:441)
        at 
org.apache.hadoop.hdfs.server.namenode.snapshot.INodeDirectorySnapshottable.computeDiffRecursively(INodeDirectorySnapshottable.java:446)
        at 
org.apache.hadoop.hdfs.server.namenode.snapshot.INodeDirectorySnapshottable.computeDiffRecursively(INodeDirectorySnapshottable.java:446)
        at 
org.apache.hadoop.hdfs.server.namenode.snapshot.INodeDirectorySnapshottable.computeDiff(INodeDirectorySnapshottable.java:390)
        at 
org.apache.hadoop.hdfs.server.namenode.snapshot.SnapshotManager.diff(SnapshotManager.java:378)
        at 
org.apache.hadoop.hdfs.server.namenode.FSNamesystem.getSnapshotDiffReport(FSNamesystem.java:7123)
        at 
org.apache.hadoop.hdfs.server.namenode.NameNodeRpcServer.getSnapshotDiffReport(NameNodeRpcServer.java:1261)
...
{code}

It seems that the fix for the other Jira is likely to address this issue. 
However, the next case doesn't seem to follow the same exception stack.

*Case 2 - Deleting a snapshot*
{code}
[root@node1075]# hdfs dfs -deleteSnapshot 
/data/tenants/pdx-svt.baseline84/wddata s2015022607_maintainer_soft_del
deleteSnapshot: Element already exists: element=useraction.log.crypto, 
DELETED=[useraction.log.crypto]
{code}

Here's the exception from the Namenode logs this time:
{code}
2015-09-11 03:25:16,403 WARN  [IPC Server handler 10 on 8020] ipc.Server 
(Server.java:run(2002)) - IPC Server handler 10 on 8020, call 
org.apache.hadoop.hdfs.protocol.ClientProtocol.deleteSnapshot from 
10.52.209.75:61662 Call#1 Retry#0: error: java.lang.AssertionError: Element 
already exists: element=useraction.log.crypto, DELETED=[useraction.log.crypto]
java.lang.AssertionError: Element already exists: 
element=useraction.log.crypto, DELETED=[useraction.log.crypto]
        at org.apache.hadoop.hdfs.util.Diff.insert(Diff.java:193)
        at org.apache.hadoop.hdfs.util.Diff.delete(Diff.java:239)
        at org.apache.hadoop.hdfs.util.Diff.combinePosterior(Diff.java:462)
        at 
org.apache.hadoop.hdfs.server.namenode.snapshot.DirectoryWithSnapshotFeature$DirectoryDiff$2.initChildren(DirectoryWithSnapshotFeature.java:293)
        at 
org.apache.hadoop.hdfs.server.namenode.snapshot.DirectoryWithSnapshotFeature$DirectoryDiff$2.iterator(DirectoryWithSnapshotFeature.java:303)
        at 
org.apache.hadoop.hdfs.server.namenode.snapshot.DirectoryWithSnapshotFeature.cleanDeletedINode(DirectoryWithSnapshotFeature.java:531)
        at 
org.apache.hadoop.hdfs.server.namenode.snapshot.DirectoryWithSnapshotFeature.cleanDirectory(DirectoryWithSnapshotFeature.java:823)
        at 
org.apache.hadoop.hdfs.server.namenode.INodeDirectory.cleanSubtree(INodeDirectory.java:714)
        at 
org.apache.hadoop.hdfs.server.namenode.INodeDirectory.cleanSubtreeRecursively(INodeDirectory.java:684)
        at 
org.apache.hadoop.hdfs.server.namenode.snapshot.DirectoryWithSnapshotFeature.cleanDirectory(DirectoryWithSnapshotFeature.java:830)
        at 
org.apache.hadoop.hdfs.server.namenode.INodeDirectory.cleanSubtree(INodeDirectory.java:714)
        at 
org.apache.hadoop.hdfs.server.namenode.snapshot.INodeDirectorySnapshottable.removeSnapshot(INodeDirectorySnapshottable.java:341)
        at 
org.apache.hadoop.hdfs.server.namenode.snapshot.SnapshotManager.deleteSnapshot(SnapshotManager.java:238)
        at 
org.apache.hadoop.hdfs.server.namenode.FSNamesystem.deleteSnapshot(FSNamesystem.java:7173)
        at 
org.apache.hadoop.hdfs.server.namenode.NameNodeRpcServer.deleteSnapshot(NameNodeRpcServer.java:1222)
{code}

As you see, the method where the fix is, _computeDiffBetweenSnapshots_, is 
never called. I'd assume that some kind of check is needed in the ELSE clause 
of the following method in _Diff.java_ before _insert_ is called:
{code}
public UndoInfo<E> delete(final E element) {
    final int c = search(created, element.getKey());
    E previous = null;
    Integer d = null;
    if (c >= 0) {
      // remove a newly created element
      previous = created.remove(c);
    } else {
      // not in c-list, it must be in previous
      d = search(deleted, element.getKey());
      insert(ListType.DELETED, element, d);
    }
    return new UndoInfo<E>(c, previous, d);
  }
{code}

> deleteSnapshot runs into AssertionError
> ---------------------------------------
>
>                 Key: HDFS-9052
>                 URL: https://issues.apache.org/jira/browse/HDFS-9052
>             Project: Hadoop HDFS
>          Issue Type: Bug
>            Reporter: Alex Ivanov
>
> CDH 5.0.5 upgraded from CDH 5.0.0 (Hadoop 2.3)
> Upon deleting a snapshot, we run into the following assertion error. The 
> scenario is as follows:
> 1. We have a program that deletes snapshots in reverse chronological order.
> 2. The program deletes a couple of hundred snapshots successfully but runs 
> into the following exception:
> java.lang.AssertionError: Element already exists: 
> element=useraction.log.crypto, DELETED=[useraction.log.crypto]
> 3. There seems to be an issue with that snapshot, which causes a file, which 
> normally gets overwritten in every snapshot to be added to the SnapshotDiff 
> delete queue twice.
> 4. Once the deleteSnapshot is run on the problematic snapshot, if the 
> Namenode is restarted, it cannot be started again until the transaction is 
> removed from the EditLog.
> 5. Sometimes the bad snapshot can be deleted but the prior snapshot seems to 
> "inherit" the same issue.
> 6. The error below is from Namenode starting when the DELETE_SNAPSHOT 
> transaction is replayed from the EditLog.
> 2015-09-01 22:59:59,140 INFO  [IPC Server handler 0 on 8022] BlockStateChange 
> (BlockManager.java:logAddStoredBlock(2342)) - BLOCK* addStoredBlock: blockMap 
> updated: 10.52.209.77:1004 is added to 
> blk_1080833995_7093259{blockUCState=UNDER_CONSTRUCTION, primaryNodeIndex=-1, 
> replicas=[ReplicaUnderConstruction[[DISK]DS-16de62e5-f6e2-4ea7-aad9-f8567bded7d7:NORMAL|FINALIZED]]}
>  size 0
> 2015-09-01 22:59:59,140 INFO  [IPC Server handler 0 on 8022] BlockStateChange 
> (BlockManager.java:logAddStoredBlock(2342)) - BLOCK* addStoredBlock: blockMap 
> updated: 10.52.209.77:1004 is added to 
> blk_1080833996_7093260{blockUCState=UNDER_CONSTRUCTION, primaryNodeIndex=-1, 
> replicas=[ReplicaUnderConstruction[[DISK]DS-1def2b07-d87f-49dd-b14f-ef230342088d:NORMAL|FINALIZED]]}
>  size 0
> 2015-09-01 22:59:59,141 ERROR [IPC Server handler 0 on 8022] 
> namenode.FSEditLogLoader (FSEditLogLoader.java:loadEditRecords(232)) - 
> Encountered exception on operation DeleteSnapshotOp 
> [snapshotRoot=/data/tenants/pdx-svt.baseline84/wddata, 
> snapshotName=s2015022614_maintainer_soft_del, 
> RpcClientId=7942c957-a7cf-44c1-880d-6eea690e1b19, RpcCallId=1]
> 2015-09-01 22:59:59,141 ERROR [IPC Server handler 0 on 8022] 
> namenode.FSEditLogLoader (FSEditLogLoader.java:loadEditRecords(232)) - 
> Encountered exception on operation DeleteSnapshotOp 
> [snapshotRoot=/data/tenants/pdx-svt.baseline84/wddata, 
> snapshotName=s2015022614_maintainer_soft_del, 
> RpcClientId=7942c957-a7cf-44c1-880d-6eea690e1b19, RpcCallId=1]
> java.lang.AssertionError: Element already exists: 
> element=useraction.log.crypto, DELETED=[useraction.log.crypto]
>         at org.apache.hadoop.hdfs.util.Diff.insert(Diff.java:193)
>         at org.apache.hadoop.hdfs.util.Diff.delete(Diff.java:239)
>         at org.apache.hadoop.hdfs.util.Diff.combinePosterior(Diff.java:462)
>         at 
> org.apache.hadoop.hdfs.server.namenode.snapshot.DirectoryWithSnapshotFeature$DirectoryDiff$2.initChildren(DirectoryWithSnapshotFeature.java:293)
>         at 
> org.apache.hadoop.hdfs.server.namenode.snapshot.DirectoryWithSnapshotFeature$DirectoryDiff$2.iterator(DirectoryWithSnapshotFeature.java:303)
>         at 
> org.apache.hadoop.hdfs.server.namenode.snapshot.DirectoryWithSnapshotFeature.cleanDeletedINode(DirectoryWithSnapshotFeature.java:531)
>         at 
> org.apache.hadoop.hdfs.server.namenode.snapshot.DirectoryWithSnapshotFeature.cleanDirectory(DirectoryWithSnapshotFeature.java:823)
>         at 
> org.apache.hadoop.hdfs.server.namenode.INodeDirectory.cleanSubtree(INodeDirectory.java:714)
>         at 
> org.apache.hadoop.hdfs.server.namenode.INodeDirectory.cleanSubtreeRecursively(INodeDirectory.java:684)
>         at 
> org.apache.hadoop.hdfs.server.namenode.snapshot.DirectoryWithSnapshotFeature.cleanDirectory(DirectoryWithSnapshotFeature.java:830)
>         at 
> org.apache.hadoop.hdfs.server.namenode.INodeDirectory.cleanSubtree(INodeDirectory.java:714)
>         at 
> org.apache.hadoop.hdfs.server.namenode.snapshot.INodeDirectorySnapshottable.removeSnapshot(INodeDirectorySnapshottable.java:341)
>         at 
> org.apache.hadoop.hdfs.server.namenode.snapshot.SnapshotManager.deleteSnapshot(SnapshotManager.java:238)
>         at 
> org.apache.hadoop.hdfs.server.namenode.FSEditLogLoader.applyEditLogOp(FSEditLogLoader.java:667)
>         at 
> org.apache.hadoop.hdfs.server.namenode.FSEditLogLoader.loadEditRecords(FSEditLogLoader.java:224)
>         at 
> org.apache.hadoop.hdfs.server.namenode.FSEditLogLoader.loadFSEdits(FSEditLogLoader.java:133)
>         at 
> org.apache.hadoop.hdfs.server.namenode.FSImage.loadEdits(FSImage.java:802)
>         at 
> org.apache.hadoop.hdfs.server.namenode.FSImage.loadEdits(FSImage.java:783)



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Reply via email to