[
https://issues.apache.org/jira/browse/HDFS-9052?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14740108#comment-14740108
]
Alex Ivanov commented on HDFS-9052:
-----------------------------------
I've confirmed that the CDH 5.0.0/5.0.5 code actually contains the Hadoop 2.4
change you're referring to:
[Grepcode
FSImageFormatPBSnapshot.java|http://grepcode.com/file/repository.cloudera.com/content/repositories/releases/org.apache.hadoop/hadoop-hdfs/2.3.0-cdh5.0.5/org/apache/hadoop/hdfs/server/namenode/snapshot/FSImageFormatPBSnapshot.java/]
Let me elaborate a little on my scenario. We have a log file in HDFS
(useraction.log.crypto), which we overwrite, and then do a snapshot on a parent
folder. So the snapshots look as follows in the XML translation of the FsImage
(abridged):
{code}
SNAPSHOT ID=5063; NAME=s2015022606
<diff><inodeid>3432626</inodeid> //
useractionlogs folder
<dirdiff><snapshotId>5063</snapshotId><isSnapshotroot>false</isSnapshotroot><childrenSize>3</childrenSize><name>useractionlogs</name>
<created><name>useraction.log.crypto</name></created>
<deleted><inode>3564976</inode></deleted>
</dirdiff>
...
</diff>
<diff><inodeid>3564976</inodeid>
<filediff><snapshotId>5063</snapshotId><size>1388</size><name>useraction.log.crypto</name></filediff>
</diff>
...
SNAPSHOT ID=5065; NAME=s2015022607
<diff><inodeid>3432626</inodeid> //
useractionlogs folder
<dirdiff><snapshotId>5065</snapshotId><isSnapshotroot>false</isSnapshotroot><childrenSize>3</childrenSize><name>useractionlogs</name>
<created><name>useraction.log.crypto</name></created>
<deleted><inode>3565860</inode></deleted>
</dirdiff>
...
</diff>
<diff><inodeid>3565860</inodeid>
<filediff><snapshotId>5065</snapshotId><size>1388</size><name>useraction.log.crypto</name></filediff>
</diff>
...
{code}
As you see, each snapshot deletes and re-creates the _useraction.log.crypto_
file. It seems that there are 2 ways for me to run into the AssertionError, and
I don't think the fix for
[HDFS-6908|https://issues.apache.org/jira/browse/HDFS-6908] addresses Case 2.
*Case 1 - Do a snapshot diff using the hdfs client*
{code}
[root@node1075]# hdfs snapshotDiff /data/tenants/pdx-svt.baseline84/wddata
s2015022606 s2015022607
snapshotDiff: Element already exists: element=useraction.log.crypto,
DELETED=[useraction.log.crypto]
{code}
Here's the _tail_ from the Namenode logs:
{code}
2015-09-11 01:44:59,558 WARN [IPC Server handler 89 on 8020] ipc.Server
(Server.java:run(2002)) - IPC Server handler 89 on 8020, call
org.apache.hadoop.hdfs.protocol.ClientProtocol.getSnapshotDiffReport from
10.52.209.75:60609 Call#0 Retry#1: error: java.lang.AssertionError: Element
already exists: element=useraction.log.crypto, DELETED=[useraction.log.crypto]
java.lang.AssertionError: Element already exists:
element=useraction.log.crypto, DELETED=[useraction.log.crypto]
at org.apache.hadoop.hdfs.util.Diff.insert(Diff.java:193)
at org.apache.hadoop.hdfs.util.Diff.delete(Diff.java:239)
at org.apache.hadoop.hdfs.util.Diff.combinePosterior(Diff.java:462)
at
org.apache.hadoop.hdfs.server.namenode.snapshot.DirectoryWithSnapshotFeature$DirectoryDiff$2.initChildren(DirectoryWithSnapshotFeature.java:293)
at
org.apache.hadoop.hdfs.server.namenode.snapshot.DirectoryWithSnapshotFeature$DirectoryDiff$2.iterator(DirectoryWithSnapshotFeature.java:303)
at
org.apache.hadoop.hdfs.server.namenode.snapshot.INodeDirectorySnapshottable.computeDiffRecursively(INodeDirectorySnapshottable.java:441)
at
org.apache.hadoop.hdfs.server.namenode.snapshot.INodeDirectorySnapshottable.computeDiffRecursively(INodeDirectorySnapshottable.java:446)
at
org.apache.hadoop.hdfs.server.namenode.snapshot.INodeDirectorySnapshottable.computeDiffRecursively(INodeDirectorySnapshottable.java:446)
at
org.apache.hadoop.hdfs.server.namenode.snapshot.INodeDirectorySnapshottable.computeDiff(INodeDirectorySnapshottable.java:390)
at
org.apache.hadoop.hdfs.server.namenode.snapshot.SnapshotManager.diff(SnapshotManager.java:378)
at
org.apache.hadoop.hdfs.server.namenode.FSNamesystem.getSnapshotDiffReport(FSNamesystem.java:7123)
at
org.apache.hadoop.hdfs.server.namenode.NameNodeRpcServer.getSnapshotDiffReport(NameNodeRpcServer.java:1261)
...
{code}
It seems that the fix for the other Jira is likely to address this issue.
However, the next case doesn't seem to follow the same exception stack.
*Case 2 - Deleting a snapshot*
{code}
[root@node1075]# hdfs dfs -deleteSnapshot
/data/tenants/pdx-svt.baseline84/wddata s2015022607_maintainer_soft_del
deleteSnapshot: Element already exists: element=useraction.log.crypto,
DELETED=[useraction.log.crypto]
{code}
Here's the exception from the Namenode logs this time:
{code}
2015-09-11 03:25:16,403 WARN [IPC Server handler 10 on 8020] ipc.Server
(Server.java:run(2002)) - IPC Server handler 10 on 8020, call
org.apache.hadoop.hdfs.protocol.ClientProtocol.deleteSnapshot from
10.52.209.75:61662 Call#1 Retry#0: error: java.lang.AssertionError: Element
already exists: element=useraction.log.crypto, DELETED=[useraction.log.crypto]
java.lang.AssertionError: Element already exists:
element=useraction.log.crypto, DELETED=[useraction.log.crypto]
at org.apache.hadoop.hdfs.util.Diff.insert(Diff.java:193)
at org.apache.hadoop.hdfs.util.Diff.delete(Diff.java:239)
at org.apache.hadoop.hdfs.util.Diff.combinePosterior(Diff.java:462)
at
org.apache.hadoop.hdfs.server.namenode.snapshot.DirectoryWithSnapshotFeature$DirectoryDiff$2.initChildren(DirectoryWithSnapshotFeature.java:293)
at
org.apache.hadoop.hdfs.server.namenode.snapshot.DirectoryWithSnapshotFeature$DirectoryDiff$2.iterator(DirectoryWithSnapshotFeature.java:303)
at
org.apache.hadoop.hdfs.server.namenode.snapshot.DirectoryWithSnapshotFeature.cleanDeletedINode(DirectoryWithSnapshotFeature.java:531)
at
org.apache.hadoop.hdfs.server.namenode.snapshot.DirectoryWithSnapshotFeature.cleanDirectory(DirectoryWithSnapshotFeature.java:823)
at
org.apache.hadoop.hdfs.server.namenode.INodeDirectory.cleanSubtree(INodeDirectory.java:714)
at
org.apache.hadoop.hdfs.server.namenode.INodeDirectory.cleanSubtreeRecursively(INodeDirectory.java:684)
at
org.apache.hadoop.hdfs.server.namenode.snapshot.DirectoryWithSnapshotFeature.cleanDirectory(DirectoryWithSnapshotFeature.java:830)
at
org.apache.hadoop.hdfs.server.namenode.INodeDirectory.cleanSubtree(INodeDirectory.java:714)
at
org.apache.hadoop.hdfs.server.namenode.snapshot.INodeDirectorySnapshottable.removeSnapshot(INodeDirectorySnapshottable.java:341)
at
org.apache.hadoop.hdfs.server.namenode.snapshot.SnapshotManager.deleteSnapshot(SnapshotManager.java:238)
at
org.apache.hadoop.hdfs.server.namenode.FSNamesystem.deleteSnapshot(FSNamesystem.java:7173)
at
org.apache.hadoop.hdfs.server.namenode.NameNodeRpcServer.deleteSnapshot(NameNodeRpcServer.java:1222)
{code}
As you see, the method where the fix is, _computeDiffBetweenSnapshots_, is
never called. I'd assume that some kind of check is needed in the ELSE clause
of the following method in _Diff.java_ before _insert_ is called:
{code}
public UndoInfo<E> delete(final E element) {
final int c = search(created, element.getKey());
E previous = null;
Integer d = null;
if (c >= 0) {
// remove a newly created element
previous = created.remove(c);
} else {
// not in c-list, it must be in previous
d = search(deleted, element.getKey());
insert(ListType.DELETED, element, d);
}
return new UndoInfo<E>(c, previous, d);
}
{code}
> deleteSnapshot runs into AssertionError
> ---------------------------------------
>
> Key: HDFS-9052
> URL: https://issues.apache.org/jira/browse/HDFS-9052
> Project: Hadoop HDFS
> Issue Type: Bug
> Reporter: Alex Ivanov
>
> CDH 5.0.5 upgraded from CDH 5.0.0 (Hadoop 2.3)
> Upon deleting a snapshot, we run into the following assertion error. The
> scenario is as follows:
> 1. We have a program that deletes snapshots in reverse chronological order.
> 2. The program deletes a couple of hundred snapshots successfully but runs
> into the following exception:
> java.lang.AssertionError: Element already exists:
> element=useraction.log.crypto, DELETED=[useraction.log.crypto]
> 3. There seems to be an issue with that snapshot, which causes a file, which
> normally gets overwritten in every snapshot to be added to the SnapshotDiff
> delete queue twice.
> 4. Once the deleteSnapshot is run on the problematic snapshot, if the
> Namenode is restarted, it cannot be started again until the transaction is
> removed from the EditLog.
> 5. Sometimes the bad snapshot can be deleted but the prior snapshot seems to
> "inherit" the same issue.
> 6. The error below is from Namenode starting when the DELETE_SNAPSHOT
> transaction is replayed from the EditLog.
> 2015-09-01 22:59:59,140 INFO [IPC Server handler 0 on 8022] BlockStateChange
> (BlockManager.java:logAddStoredBlock(2342)) - BLOCK* addStoredBlock: blockMap
> updated: 10.52.209.77:1004 is added to
> blk_1080833995_7093259{blockUCState=UNDER_CONSTRUCTION, primaryNodeIndex=-1,
> replicas=[ReplicaUnderConstruction[[DISK]DS-16de62e5-f6e2-4ea7-aad9-f8567bded7d7:NORMAL|FINALIZED]]}
> size 0
> 2015-09-01 22:59:59,140 INFO [IPC Server handler 0 on 8022] BlockStateChange
> (BlockManager.java:logAddStoredBlock(2342)) - BLOCK* addStoredBlock: blockMap
> updated: 10.52.209.77:1004 is added to
> blk_1080833996_7093260{blockUCState=UNDER_CONSTRUCTION, primaryNodeIndex=-1,
> replicas=[ReplicaUnderConstruction[[DISK]DS-1def2b07-d87f-49dd-b14f-ef230342088d:NORMAL|FINALIZED]]}
> size 0
> 2015-09-01 22:59:59,141 ERROR [IPC Server handler 0 on 8022]
> namenode.FSEditLogLoader (FSEditLogLoader.java:loadEditRecords(232)) -
> Encountered exception on operation DeleteSnapshotOp
> [snapshotRoot=/data/tenants/pdx-svt.baseline84/wddata,
> snapshotName=s2015022614_maintainer_soft_del,
> RpcClientId=7942c957-a7cf-44c1-880d-6eea690e1b19, RpcCallId=1]
> 2015-09-01 22:59:59,141 ERROR [IPC Server handler 0 on 8022]
> namenode.FSEditLogLoader (FSEditLogLoader.java:loadEditRecords(232)) -
> Encountered exception on operation DeleteSnapshotOp
> [snapshotRoot=/data/tenants/pdx-svt.baseline84/wddata,
> snapshotName=s2015022614_maintainer_soft_del,
> RpcClientId=7942c957-a7cf-44c1-880d-6eea690e1b19, RpcCallId=1]
> java.lang.AssertionError: Element already exists:
> element=useraction.log.crypto, DELETED=[useraction.log.crypto]
> at org.apache.hadoop.hdfs.util.Diff.insert(Diff.java:193)
> at org.apache.hadoop.hdfs.util.Diff.delete(Diff.java:239)
> at org.apache.hadoop.hdfs.util.Diff.combinePosterior(Diff.java:462)
> at
> org.apache.hadoop.hdfs.server.namenode.snapshot.DirectoryWithSnapshotFeature$DirectoryDiff$2.initChildren(DirectoryWithSnapshotFeature.java:293)
> at
> org.apache.hadoop.hdfs.server.namenode.snapshot.DirectoryWithSnapshotFeature$DirectoryDiff$2.iterator(DirectoryWithSnapshotFeature.java:303)
> at
> org.apache.hadoop.hdfs.server.namenode.snapshot.DirectoryWithSnapshotFeature.cleanDeletedINode(DirectoryWithSnapshotFeature.java:531)
> at
> org.apache.hadoop.hdfs.server.namenode.snapshot.DirectoryWithSnapshotFeature.cleanDirectory(DirectoryWithSnapshotFeature.java:823)
> at
> org.apache.hadoop.hdfs.server.namenode.INodeDirectory.cleanSubtree(INodeDirectory.java:714)
> at
> org.apache.hadoop.hdfs.server.namenode.INodeDirectory.cleanSubtreeRecursively(INodeDirectory.java:684)
> at
> org.apache.hadoop.hdfs.server.namenode.snapshot.DirectoryWithSnapshotFeature.cleanDirectory(DirectoryWithSnapshotFeature.java:830)
> at
> org.apache.hadoop.hdfs.server.namenode.INodeDirectory.cleanSubtree(INodeDirectory.java:714)
> at
> org.apache.hadoop.hdfs.server.namenode.snapshot.INodeDirectorySnapshottable.removeSnapshot(INodeDirectorySnapshottable.java:341)
> at
> org.apache.hadoop.hdfs.server.namenode.snapshot.SnapshotManager.deleteSnapshot(SnapshotManager.java:238)
> at
> org.apache.hadoop.hdfs.server.namenode.FSEditLogLoader.applyEditLogOp(FSEditLogLoader.java:667)
> at
> org.apache.hadoop.hdfs.server.namenode.FSEditLogLoader.loadEditRecords(FSEditLogLoader.java:224)
> at
> org.apache.hadoop.hdfs.server.namenode.FSEditLogLoader.loadFSEdits(FSEditLogLoader.java:133)
> at
> org.apache.hadoop.hdfs.server.namenode.FSImage.loadEdits(FSImage.java:802)
> at
> org.apache.hadoop.hdfs.server.namenode.FSImage.loadEdits(FSImage.java:783)
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)