[ https://issues.apache.org/jira/browse/HDFS-5995?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13918505#comment-13918505 ]
Chris Nauroth commented on HDFS-5995: ------------------------------------- bq. That's why we came up with {{dfs.namenode.max.op.size}} – see HDFS-3440 for details. Chris Nauroth, it sounds like the ACL code is not honoring this parameter. We should fix it. [~cmccabe], no, that's not quite the problem here. The ACL ops go through the same old stream classes, so the max op size will be honored just like any other op. It's not the op itself that is too large. The problem is that some of the ops now can encode a variable length array of ACL entries. The first field is the array length, and we're using that to pre-allocate an array of matching size (avoiding realloc costs). The corruption tests are causing the input stream to get positioned such that a really large integer gets misinterpreted as this array size, so the pre-allocation uses too much memory. I believe Haohui's argument is that there is very little done in the way of validating the actual data contained within an op before we try to apply it. This goes beyond just checking the number of bytes in the op. My first attempt at a patch was a test-only fix, but I wasn't successful. It's very hard to corrupt edits in exactly the right way so that you can predict how the stream gets positioned subsequently and then make sure the wrong integer doesn't get misinterpreted as the array size. If we do find a way to do that, then the test is going to be very tightly coupled to the exact implementation details of the current edit log, so it would be a brittle test. Maybe I'm just not seeing a clever trick here though. FWIW, I don't think the approach in the current patch is such a bad thing. It's just constraining the pre-allocation size. It's still a dynamically sized array, so for example, if the max entries changes from 32 to some other number in the future (or even configurable), then we're never going to truncate the array and drop entries. Perhaps it's good to try to prevent this one possible corruption case like Colin suggested, even if we can't cover every possible kind of corruption. > TestFSEditLogLoader#testValidateEditLogWithCorruptBody gets OutOfMemoryError > and dumps heap. > -------------------------------------------------------------------------------------------- > > Key: HDFS-5995 > URL: https://issues.apache.org/jira/browse/HDFS-5995 > Project: Hadoop HDFS > Issue Type: Test > Components: namenode, test > Affects Versions: 3.0.0 > Reporter: Chris Nauroth > Assignee: Chris Nauroth > Priority: Minor > Attachments: HDFS-5995.1.patch > > > {{TestFSEditLogLoader#testValidateEditLogWithCorruptBody}} is experiencing > {{OutOfMemoryError}} and dumping heap since the merge of HDFS-4685. This > doesn't actually cause the test to fail, because it's a failure test that > corrupts an edit log intentionally. Still, this might cause confusion if > someone reviews the build logs and thinks this is a more serious problem. -- This message was sent by Atlassian JIRA (v6.2#6252)