Quota bug for partial blocks allows quotas to be violated 
----------------------------------------------------------

                 Key: HDFS-1377
                 URL: https://issues.apache.org/jira/browse/HDFS-1377
             Project: Hadoop HDFS
          Issue Type: Bug
          Components: name-node
    Affects Versions: 0.21.0, 0.20.2, 0.20.1
            Reporter: Eli Collins
            Assignee: Eli Collins
            Priority: Blocker
             Fix For: 0.20.3, 0.21.1, 0.22.0


There's a bug in the quota code that causes them not to be respected when a 
file is not an exact multiple of the block size. Here's an example:

{code}
$ hadoop fs -mkdir /test
$ hadoop dfsadmin -setSpaceQuota 384M /test
$ ls dir/ | wc -l   # dir contains 101 files
101
$ du -ms dir        # each is 3mb
304     dir
$ hadoop fs -put dir /test
$ hadoop fs -count -q /test
        none             inf       402653184      -550502400            2       
   101          317718528 hdfs://haus01.sf.cloudera.com:10020/test
$ hadoop fs -stat "%o %r" /test/dir/f30
134217728 3    # three 128mb blocks
{code}

INodeDirectoryWithQuota caches the number of bytes consumed by it's children in 
{{diskspace}}. The quota adjustment code has a bug that causes {{diskspace}} to 
get updated incorrectly when a file is not an exact multiple of the block size 
(the value ends up being negative). 

This causes the quota checking code to think that the files in the directory 
consumes less space than they actually do, so the verifyQuota does not throw a 
QuotaExceededException even when the directory is over quota. However the bug 
isn't visible to users because {{fs count -q}} reports the numbers generated by 
INode#getContentSummary which adds up the sizes of the blocks rather than use 
the cached INodeDirectoryWithQuota#diskspace value.

In FSDirectory#addBlock the disk space consumed is set conservatively to the 
full block size * the number of replicas:

{code}
updateCount(inodes, inodes.length-1, 0,
    fileNode.getPreferredBlockSize()*fileNode.getReplication(), true);
{code}

In FSNameSystem#addStoredBlock we adjust for this conservative estimate by 
subtracting out the difference between the conservative estimate and what the 
number of bytes actually stored was:

{code}
//Updated space consumed if required.
INodeFile file = (storedBlock != null) ? storedBlock.getINode() : null;
long diff = (file == null) ? 0 :
    (file.getPreferredBlockSize() - storedBlock.getNumBytes());

if (diff > 0 && file.isUnderConstruction() &&
    cursize < storedBlock.getNumBytes()) {
...
    dir.updateSpaceConsumed(path, 0, -diff*file.getReplication());
{code}

We do the same in FSDirectory#replaceNode when completing the file, but at a 
file granularity (I believe the intent here is to correct for the cases when 
there's a failure replicating blocks and recovery). Since oldnode is under 
construction INodeFile#diskspaceConsumed will use the preferred block size  (vs 
of Block#getNumBytes used by newnode) so we will again subtract out the 
difference between the full block size and what the number of bytes actually 
stored was:

{code}
long dsOld = oldnode.diskspaceConsumed();
...
//check if disk space needs to be updated.
long dsNew = 0;
if (updateDiskspace && (dsNew = newnode.diskspaceConsumed()) != dsOld) {
  try {
    updateSpaceConsumed(path, 0, dsNew-dsOld);
...
{code}

So in the above example we started with diskspace at 384mb (3 * 128mb) and then 
we subtract 375mb (to reflect only 9mb raw was actually used) twice so for each 
file the diskspace for the directory is - 366mb (384mb minus 2 * 375mb). Which 
is why the quota gets negative and yet we can still write more files.

So a directory with lots of single block files (if you have multiple blocks on 
the final partial block ends up subtracting from the diskspace used) ends up 
having a quota that's way off.

I think the fix is to in FSDirectory#replaceNode not have the diskspaceConsumed 
calculations differ when the old and new INode have the same blocks. I'll work 
on a patch which also adds a quota test for blocks that are not multiples of 
the block size and warns in INodeDirectory#computeContentSummary if the 
computed size does not reflect the cached value.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

Reply via email to