[
https://issues.apache.org/jira/browse/MAPREDUCE-7444?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
ConfX updated MAPREDUCE-7444:
-----------------------------
Affects Version/s: 3.3.3
> LineReader improperly sets AdditionalRecord, causing extra line read
> --------------------------------------------------------------------
>
> Key: MAPREDUCE-7444
> URL: https://issues.apache.org/jira/browse/MAPREDUCE-7444
> Project: Hadoop Map/Reduce
> Issue Type: Bug
> Affects Versions: 3.3.3
> Reporter: ConfX
> Priority: Critical
> Attachments: reproduce.sh
>
>
> h2. What happened:
> After setting {{io.file.buffer.size}} precisely to some values, the
> LineReader sometimes mistakenly reads an extra line after the split.
> Note that this bug is highly critical. A malicious party can create a file
> that makes the while loop run forever given knowledge of the buffer size
> setting and control over the file being read.
> h2. Buggy code:
> Consider reading a record file with multiple lines. In MapReduce, this is
> done using
> [org.apache.hadoop.mapred.LineRecordReader|https://github.com/confuzz-org/fuzz-hadoop/blob/confuzz/hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-core/src/main/java/org/apache/hadoop/mapred/LineRecordReader.java#L249],
> which then calls
> [org.apache.hadoop.util.LineReader.readCustomLine|https://github.com/confuzz-org/fuzz-hadoop/blob/confuzz/hadoop-common-project/hadoop-common/src/main/java/org/apache/hadoop/util/LineReader.java#L268]
> if you specify your own delimiter ('\n' or "\r\n").
> Intrinsically, if the file is compressed, the reading of the file is
> implemented using [org.apache.hadoop.mapreduce.lib.input.
> CompressedSplitLineReader|https://github.com/confuzz-org/fuzz-hadoop/blob/confuzz/hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-core/src/main/java/org/apache/hadoop/mapreduce/lib/input/CompressedSplitLineReader.java].
> Notice that in its
> [fillBuffer|https://github.com/confuzz-org/fuzz-hadoop/blob/confuzz/hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-core/src/main/java/org/apache/hadoop/mapreduce/lib/input/CompressedSplitLineReader.java#L128]
> function to refill the buffer, when {{inDelimiter}} is set to {{{}true{}}},
> if haven't reached EOF the switch for additional record is set to true
> (suppose not using CRLF).
> Now, we sometimes want to read the file in different splits (or up until a
> specific position). Consider the case where the end of the split is a
> delimiter. This is where interesting thing happen. We use an easy example to
> illustrate this. Suppose that the delimiter is 10 bytes long, and the buffer
> size is set precise enough such that the end of the second to last buffer
> ends right in the exact middle of the delimiter. In the final round of loop
> in {{{}readCustomLine{}}}, the buffer would be refilled first:
> {noformat}
> ...
> if (bufferPosn >= bufferLength) {
> startPosn = bufferPosn = 0;
> bufferLength = fillBuffer(in, buffer, ambiguousByteCount > 0);
> ...{noformat}
> note that {{needAdditionalRecord}} would be set to {{true}} in
> {{{}CompressedSplitLineReader{}}}.
> Next, after some reading and string comparing the second half of the final
> delimiter, the function would try to calculate the bytes read:
> {noformat}
> int readLength = bufferPosn - startPosn;
> bytesConsumed += readLength;
> int appendLength = readLength - delPosn;
> ...
> if (appendLength >= 0 && ambiguousByteCount > 0) {
> ...
> unsetNeedAdditionalRecordAfterSplit();
> }{noformat}
> note that here the {{readLength}} would be 5 (half of the delimiter length)
> and {{appendLength}} would be {{{}5-10=-5{}}}. Thus, the condition for the
> {{if}} clause would be false and the {{needAdditionalRecord}} would not be
> unset.
> In fact, the {{needAdditionalRecord}} switch would never be unset all the way
> until {{readCustomLine}} ends.
> However, it should be set to {{{}false{}}}, or the next time the user calls
> {{next}} the reader would read at least another line since the {{while}}
> condition in {{{}next{}}}:
> {noformat}
> while (getFilePosition() <= end || in.needAdditionalRecordAfterSplit())
> {{noformat}
> is true due to not reaching EOF and {{needAdditionalRecord}} set to
> {{{}true{}}}.
> This can lead to severe problems if every time after the split the start of
> the final buffer falls in the delimiter.
> h2. How to reproduce:
> (1) Set {{io.file.buffer.size }} to {{188}}
> (2) Run test:
> {{org.apache.hadoop.mapred.TestLineRecordReader#testBzipWithMultibyteDelimiter}}
> For an easy reproduction please run the {{reproduce.sh}} included in the
> attachment.
> h2. StackTrace:
> {noformat}
> java.lang.AssertionError: Unexpected number of records in split expected:<60>
> but was:<61>
> at org.junit.Assert.fail(Assert.java:89)
> at org.junit.Assert.failNotEquals(Assert.java:835)
> at org.junit.Assert.assertEquals(Assert.java:647)
> at
> org.apache.hadoop.mapred.TestLineRecordReader.testSplitRecordsForFile(TestLineRecordReader.java:110)
> at
> org.apache.hadoop.mapred.TestLineRecordReader.testBzipWithMultibyteDelimiter(TestLineRecordReader.java:684){noformat}
> For an easy reproduction, run the reproduce.sh in the attachment. We are
> happy to provide a patch if this issue is confirmed.
--
This message was sent by Atlassian Jira
(v8.20.10#820010)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]