[
https://issues.apache.org/jira/browse/HDFS-7809?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14326067#comment-14326067
]
Kihwal Lee commented on HDFS-7809:
----------------------------------
Stack trace:
{panel}
2015-02-13 01:07:45,628
\[org.apache.hadoop.hdfs.server.datanode.DataNode$2@278a83a0\] WARN
datanode.DataNode: recoverBlocks FAILED:
RecoveringBlock\{BP-xxxxx:blk_12345_10000; getBlockSize()=4150; corrupt=false;
offset=-1; locs=\[1.2.3.4:1004, 1.2.3.5:1004, 1.2.3.6:1004\]\}
org.apache.hadoop.ipc.RemoteException(org.apache.hadoop.hdfs.protocol.NSQuotaExceededException):
Failed to record modification for snapshot: The NameSpace quota (directories
and files) is exceeded: quota=50000 file count=50001
at
org.apache.hadoop.hdfs.server.namenode.DirectoryWithQuotaFeature.verifyNamespaceQuota(DirectoryWithQuotaFeature.java:138)
at
org.apache.hadoop.hdfs.server.namenode.DirectoryWithQuotaFeature.verifyQuota(DirectoryWithQuotaFeature.java:153)
at
org.apache.hadoop.hdfs.server.namenode.DirectoryWithQuotaFeature.addSpaceConsumed(DirectoryWithQuotaFeature.java:96)
at
org.apache.hadoop.hdfs.server.namenode.INodeDirectory.addSpaceConsumed(INodeDirectory.java:136)
at
org.apache.hadoop.hdfs.server.namenode.INode.addSpaceConsumed2Parent(INode.java:484)
at
org.apache.hadoop.hdfs.server.namenode.INodeDirectory.addSpaceConsumed(INodeDirectory.java:138)
at
org.apache.hadoop.hdfs.server.namenode.INode.addSpaceConsumed2Parent(INode.java:484)
at
org.apache.hadoop.hdfs.server.namenode.INodeDirectory.addSpaceConsumed(INodeDirectory.java:138)
at
org.apache.hadoop.hdfs.server.namenode.INode.addSpaceConsumed2Parent(INode.java:484)
at
org.apache.hadoop.hdfs.server.namenode.INode.addSpaceConsumed(INode.java:474)
at
org.apache.hadoop.hdfs.server.namenode.snapshot.AbstractINodeDiffList.addDiff(AbstractINodeDiffList.java:125)
at
org.apache.hadoop.hdfs.server.namenode.snapshot.AbstractINodeDiffList.checkAndAddLatestSnapshotDiff(AbstractINodeDiffList.java:284)
at
org.apache.hadoop.hdfs.server.namenode.snapshot.AbstractINodeDiffList.saveSelf2Snapshot(AbstractINodeDiffList.java:296)
at
org.apache.hadoop.hdfs.server.namenode.INodeFile.recordModification(INodeFile.java:305)
at
org.apache.hadoop.hdfs.server.namenode.FSNamesystem.finalizeINodeFileUnderConstruction(FSNamesystem.java:4202)
at
org.apache.hadoop.hdfs.server.namenode.FSNamesystem.closeFileCommitBlocks(FSNamesystem.java:4419)
at
org.apache.hadoop.hdfs.server.namenode.FSNamesystem.commitBlockSynchronization(FSNamesystem.java:4383)
at
org.apache.hadoop.hdfs.server.namenode.NameNodeRpcServer.commitBlockSynchronization(NameNodeRpcServer.java:699)
at
org.apache.hadoop.hdfs.protocolPB.DatanodeProtocolServerSideTranslatorPB.commitBlockSynchronization(DatanodeProtocolServerSideTranslatorPB.java:270)
at
org.apache.hadoop.hdfs.protocol.proto.DatanodeProtocolProtos$DatanodeProtocolService$2.callBlockingMethod(DatanodeProtocolProtos.java:28073)
at
org.apache.hadoop.ipc.ProtobufRpcEngine$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine.java:585)
at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:928)
at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:2053)
at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:2049)
at java.security.AccessController.doPrivileged(Native Method)
at javax.security.auth.Subject.doAs(Subject.java:415)
at
org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1637)
at org.apache.hadoop.ipc.Server$Handler.run(Server.java:2047)
{panel}
> Block and lease recovery failure caused by snapshot issue
> ---------------------------------------------------------
>
> Key: HDFS-7809
> URL: https://issues.apache.org/jira/browse/HDFS-7809
> Project: Hadoop HDFS
> Issue Type: Bug
> Affects Versions: 2.5.0
> Reporter: Kihwal Lee
> Priority: Critical
>
> On a cluster running 2.5, we have observed a decommissioning failure due to a
> file that had been under construction for 3 days. It turned out that the
> file was abandoned and a lease recovery was carried out by the name node 3
> days ago.
> The block recovery failed because the name node threw a quota exception while
> serving {{commitBlockSynchronization()}}. After this failure, no further
> attempt for recovery was made, leaving the file in under-construction state
> forever.
> Furthermore, the nature of the recovery failure is very strange. Even though
> *snapshot was never used* in the cluster, it was trying to record the diff
> and that required incrementing {{nsquota}} by 1. The user happened to ran out
> of his {{nsquota}} at that time, so it failed and caused
> {{commitBlockSynchronization()}} to fail. We do see quota discrepancies
> occasionally. Probably those were caused by something like this all along?
> Few observations:
> - Lease recovery did not complete, yet didn't get retried.
> - No snapshot was in use, but somehow it went through snapshot-related code
> path.
> - quota update during {{commitBlockSynchronization()}} should be done
> unconditionally.
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)