[
https://issues.apache.org/jira/browse/HDFS-7809?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14326487#comment-14326487
]
Kihwal Lee edited comment on HDFS-7809 at 2/18/15 8:23 PM:
-----------------------------------------------------------
Thanks, [~jingzhao]. It would have been nicer if the bug was dealt with in a
separate jira. I will dupe this to one of the jiras.
was (Author: kihwal):
Thanks, [~jingzhao]. I would have been nicer if the bug was dealt with in a
separate jira. I will dupe this to one of the jiras.
> Block and lease recovery failure caused by snapshot issue
> ---------------------------------------------------------
>
> Key: HDFS-7809
> URL: https://issues.apache.org/jira/browse/HDFS-7809
> Project: Hadoop HDFS
> Issue Type: Bug
> Affects Versions: 2.5.0
> Reporter: Kihwal Lee
> Priority: Critical
>
> On a cluster running 2.5, we have observed a decommissioning failure due to a
> file that had been under construction for 3 days. It turned out that the
> file was abandoned and a lease recovery was carried out by the name node 3
> days ago.
> The block recovery failed because the name node threw a quota exception while
> serving {{commitBlockSynchronization()}}. After this failure, no further
> attempt for recovery was made, leaving the file in under-construction state
> forever.
> Furthermore, the nature of the recovery failure is very strange. Even though
> *snapshot was never used* in the cluster, it was trying to record the diff
> and that required incrementing {{nsquota}} by 1. The user happened to ran out
> of his {{nsquota}} at that time, so it failed and caused
> {{commitBlockSynchronization()}} to fail. We do see quota discrepancies
> occasionally. Probably those were caused by something like this all along?
> Few observations:
> - Lease recovery did not complete, yet didn't get retried.
> - No snapshot was in use, but somehow it went through snapshot-related code
> path.
> - quota update during {{commitBlockSynchronization()}} should be done
> unconditionally.
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)