[
https://issues.apache.org/jira/browse/HDFS-2053?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13049206#comment-13049206
]
Michael Noll commented on HDFS-2053:
------------------------------------
Hi Eli, many thanks for your quick reply and feedback!
FYI: I have integrated your suggestions into the patch. Currently I am waiting
for "ant test" to finish to see whether {{TestQuota}} does indeed trigger this
assert, and thus whether I actually need to add any special test to it.
@unit testing:
>From what I have seen {{TestQuota#testSpaceCommands()}} would be the place to
>add a test for this issue, and
>{{dfs.getContentSummary().getSpaceConsumed(Path)}} would be the way to
>"indirectly" check the diskspace consumed by a directory and its children. It
>seems to be semantically equivalent to the actual
>{{INodeDirectory#computeContentSummary(long[])}} method I want to test but it
>appears to be several layers up the call stack [1]. Is this correct?
If so my test case using {{dfs.getContentSummary()}} would be basically:
1. Create parent dir + 3 subdirs {{A,B,C}}.
2. {{DFSTestUtil.createFile()}} a file of {a,b,c} * {{fileLength}} in
{{A,B,C}}, respectively.
3. Test whether {{getSpaceConsumed()}} of {{A,B,C}} equals the expected value,
i.e. {a,b,c} * {{fileLength}} * {{replication}}.
4. Test whether {{getSpaceConsumed()}} of the parent dir equals {a+b+c} *
{{fileLength}} * {{replication}}.
I'm just asking since I'm not fully sure whether {{dfs.getContentSummary()}}
would not mask etc. any hidden issues in the "lower level" method
{{INodeDirectory.computeContentSummary()}}.
Best,
Michael
[1] {{FSDirectory#getContentSummary(String)}} seems to be the method that
actually calls {{InodeDirectory#computeContentSummary(long[])}} at some point.
> NameNode detects "Inconsistent diskspace" for directories with quota-enabled
> subdirectories (introduced by HDFS-1377)
> ---------------------------------------------------------------------------------------------------------------------
>
> Key: HDFS-2053
> URL: https://issues.apache.org/jira/browse/HDFS-2053
> Project: Hadoop HDFS
> Issue Type: Bug
> Components: name-node
> Affects Versions: 0.20.3, 0.20.204.0, 0.20.205.0
> Environment: Hadoop release 0.20.203.0 with the HDFS-1377 patch
> applied.
> My impression is that the same issue exists also in the other branches where
> the HDFS-1377 patch has been applied to (see description).
> Reporter: Michael Noll
> Assignee: Michael Noll
> Priority: Minor
> Fix For: 0.20.3, 0.20.204.0, 0.20.205.0
>
> Attachments: HDFS-2053_v1.txt, HDFS-2053_v2.txt
>
>
> *How to reproduce*
> {code}
> # create test directories
> $ hadoop fs -mkdir /hdfs-1377/A
> $ hadoop fs -mkdir /hdfs-1377/B
> $ hadoop fs -mkdir /hdfs-1377/C
> # ...add some test data (few kB or MB) to all three dirs...
> # set space quota for subdir C only
> $ hadoop dfsadmin -setSpaceQuota 1g /hdfs-1377/C
> # the following two commands _on the parent dir_ trigger the warning
> $ hadoop fs -dus /hdfs-1377
> $ hadoop fs -count -q /hdfs-1377
> {code}
> Warning message in the namenode logs:
> {code}
> 2011-06-09 09:42:39,817 WARN org.apache.hadoop.hdfs.server.namenode.NameNode:
> Inconsistent diskspace for directory C. Cached: 433872320 Computed: 438465355
> {code}
> Note that the commands are run on the _parent directory_ but the warning is
> shown for the _subdirectory_ with space quota.
> *Background*
> The bug was introduced by the HDFS-1377 patch, which is currently committed
> to at least branch-0.20, branch-0.20-security, branch-0.20-security-204,
> branch-0.20-security-205 and release-0.20.3-rc2. In the patch,
> {{src/hdfs/org/apache/hadoop/hdfs/server/namenode/INodeDirectory.java}} was
> updated to trigger the warning above if the cached and computed diskspace
> values are not the same for a directory with quota.
> The warning is written by {{computecontentSummary(long[] summary)}} in
> {{INodeDirectory}}. In the method an inode's children are recursively walked
> through while the {{summary}} parameter is passed and updated along the way.
> {code}
> /** {@inheritDoc} */
> long[] computeContentSummary(long[] summary) {
> if (children != null) {
> for (INode child : children) {
> child.computeContentSummary(summary);
> }
> }
> {code}
> The condition that triggers the warning message compares the current node's
> cached diskspace (via {{node.diskspaceConsumed()}}) with the corresponding
> field in {{summary}}.
> {code}
> if (-1 != node.getDsQuota() && space != summary[3]) {
> NameNode.LOG.warn("Inconsistent diskspace for directory "
> +getLocalName()+". Cached: "+space+" Computed: "+summary[3]);
> {code}
> However {{summary}} may already include diskspace information from other
> inodes at this point (i.e. from different subtrees than the subtree of the
> node for which the warning message is shown; in our example for the tree at
> {{/hdfs-1377}}, {{summary}} can already contain information from
> {{/hdfs-1377/A}} and {{/hdfs-1377/B}} when it is passed to inode
> {{/hdfs-1377/C}}). Hence the cached value for {{C}} can incorrectly be
> different from the computed value.
> *How to fix*
> The supplied patch creates a fresh summary array for the subtree of the
> current node. The walk through the children passes and updates this
> {{subtreeSummary}} array, and the condition is checked against
> {{subtreeSummary}} instead of the original {{summary}}. The original
> {{summary}} is updated with the values of {{subtreeSummary}} before it
> returns.
> *Unit Tests*
> I have run "ant test" on my patched build without any errors*. However the
> existing unit tests did not catch this issue for the original HDFS-1377
> patch, so this might not mean anything. ;-)
> That said I am unsure what the most appropriate way to unit test this issue
> would be. A straight-forward approach would be to automate the steps in the
> _How to reproduce section_ above and check whether the NN logs an incorrect
> warning message. But I'm not sure how this check could be implemented. Feel
> free to provide some pointers if you have some ideas.
> *Note about Fix Version/s*
> The patch _should_ apply to all branches where the HDFS-1377 patch has
> committed to. In my environment, the build was Hadoop 0.20.203.0 release
> with a (trivial) backport of HDFS-1377 (0.20.203.0 release does not ship with
> the HDFS-1377 fix). I could apply the patch successfully to
> {{branch-0.20-security}}, {{branch-0.20-security-204}} and
> {{release-0.20.3-rc2}}, for instance. Since I'm a bit confused regarding the
> upcoming 0.20.x release versions (0.20.x vs. 0.20.20x.y) I have been so bold
> and added 0.20.203.0 to the list of affected versions even though it is
> actually only affected when HDFS-1377 is applied to it...
> Best,
> Michael
> *Well, I get one error for {{TestRumenJobTraces}} but first this seems to be
> completely unrelated and second I get the same test error when running the
> tests on the stock 0.20.203.0 release build.
--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira