[ 
https://issues.apache.org/jira/browse/HDFS-5995?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13918505#comment-13918505
 ] 

Chris Nauroth commented on HDFS-5995:
-------------------------------------

bq. That's why we came up with {{dfs.namenode.max.op.size}} – see HDFS-3440 for 
details. Chris Nauroth, it sounds like the ACL code is not honoring this 
parameter. We should fix it.

[~cmccabe], no, that's not quite the problem here.  The ACL ops go through the 
same old stream classes, so the max op size will be honored just like any other 
op.  It's not the op itself that is too large.  The problem is that some of the 
ops now can encode a variable length array of ACL entries.  The first field is 
the array length, and we're using that to pre-allocate an array of matching 
size (avoiding realloc costs).  The corruption tests are causing the input 
stream to get positioned such that a really large integer gets misinterpreted 
as this array size, so the pre-allocation uses too much memory.

I believe Haohui's argument is that there is very little done in the way of 
validating the actual data contained within an op before we try to apply it.  
This goes beyond just checking the number of bytes in the op.

My first attempt at a patch was a test-only fix, but I wasn't successful.  It's 
very hard to corrupt edits in exactly the right way so that you can predict how 
the stream gets positioned subsequently and then make sure the wrong integer 
doesn't get misinterpreted as the array size.  If we do find a way to do that, 
then the test is going to be very tightly coupled to the exact implementation 
details of the current edit log, so it would be a brittle test.  Maybe I'm just 
not seeing a clever trick here though.

FWIW, I don't think the approach in the current patch is such a bad thing.  
It's just constraining the pre-allocation size.  It's still a dynamically sized 
array, so for example, if the max entries changes from 32 to some other number 
in the future (or even configurable), then we're never going to truncate the 
array and drop entries.  Perhaps it's good to try to prevent this one possible 
corruption case like Colin suggested, even if we can't cover every possible 
kind of corruption.

> TestFSEditLogLoader#testValidateEditLogWithCorruptBody gets OutOfMemoryError 
> and dumps heap.
> --------------------------------------------------------------------------------------------
>
>                 Key: HDFS-5995
>                 URL: https://issues.apache.org/jira/browse/HDFS-5995
>             Project: Hadoop HDFS
>          Issue Type: Test
>          Components: namenode, test
>    Affects Versions: 3.0.0
>            Reporter: Chris Nauroth
>            Assignee: Chris Nauroth
>            Priority: Minor
>         Attachments: HDFS-5995.1.patch
>
>
> {{TestFSEditLogLoader#testValidateEditLogWithCorruptBody}} is experiencing 
> {{OutOfMemoryError}} and dumping heap since the merge of HDFS-4685.  This 
> doesn't actually cause the test to fail, because it's a failure test that 
> corrupts an edit log intentionally.  Still, this might cause confusion if 
> someone reviews the build logs and thinks this is a more serious problem.



--
This message was sent by Atlassian JIRA
(v6.2#6252)

Reply via email to