[jira] [Updated] (MAPREDUCE-7444) LineReader improperly sets AdditionalRecord, causing extra line read

ConfX (Jira) Mon, 31 Jul 2023 08:34:06 -0700


     [ 
https://issues.apache.org/jira/browse/MAPREDUCE-7444?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]


ConfX updated MAPREDUCE-7444:
-----------------------------
    Affects Version/s: 3.3.3

> LineReader improperly sets AdditionalRecord, causing extra line read
> --------------------------------------------------------------------
>
>                 Key: MAPREDUCE-7444
>                 URL: https://issues.apache.org/jira/browse/MAPREDUCE-7444
>             Project: Hadoop Map/Reduce
>          Issue Type: Bug
>    Affects Versions: 3.3.3
>            Reporter: ConfX
>            Priority: Critical
>         Attachments: reproduce.sh
>
>
> h2. What happened:
> After setting {{io.file.buffer.size}} precisely to some values, the 
> LineReader sometimes mistakenly reads an extra line after the split.
> Note that this bug is highly critical. A malicious party can create a file 
> that makes the while loop run forever given knowledge of the buffer size 
> setting and control over the file being read.
> h2. Buggy code:
> Consider reading a record file with multiple lines. In MapReduce, this is 
> done using 
> [org.apache.hadoop.mapred.LineRecordReader|https://github.com/confuzz-org/fuzz-hadoop/blob/confuzz/hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-core/src/main/java/org/apache/hadoop/mapred/LineRecordReader.java#L249],
>  which then calls 
> [org.apache.hadoop.util.LineReader.readCustomLine|https://github.com/confuzz-org/fuzz-hadoop/blob/confuzz/hadoop-common-project/hadoop-common/src/main/java/org/apache/hadoop/util/LineReader.java#L268]
>  if you specify your own delimiter ('\n' or "\r\n").
> Intrinsically, if the file is compressed, the reading of the file is 
> implemented using [org.apache.hadoop.mapreduce.lib.input. 
> CompressedSplitLineReader|https://github.com/confuzz-org/fuzz-hadoop/blob/confuzz/hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-core/src/main/java/org/apache/hadoop/mapreduce/lib/input/CompressedSplitLineReader.java].
>  Notice that in its 
> [fillBuffer|https://github.com/confuzz-org/fuzz-hadoop/blob/confuzz/hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-core/src/main/java/org/apache/hadoop/mapreduce/lib/input/CompressedSplitLineReader.java#L128]
>  function to refill the buffer, when {{inDelimiter}} is set to {{{}true{}}}, 
> if haven't reached EOF the switch for additional record is set to true 
> (suppose not using CRLF).
> Now, we sometimes want to read the file in different splits (or up until a 
> specific position). Consider the case where the end of the split is a 
> delimiter. This is where interesting thing happen. We use an easy example to 
> illustrate this. Suppose that the delimiter is 10 bytes long, and the buffer 
> size is set precise enough such that the end of the second to last buffer 
> ends right in the exact middle of the delimiter. In the final round of loop 
> in {{{}readCustomLine{}}}, the buffer would be refilled first:
> {noformat}
> ...
> if (bufferPosn >= bufferLength) {
>     startPosn = bufferPosn = 0;
>     bufferLength = fillBuffer(in, buffer, ambiguousByteCount > 0);
> ...{noformat}
> note that {{needAdditionalRecord}} would be set to {{true}} in 
> {{{}CompressedSplitLineReader{}}}.
> Next, after some reading and string comparing the second half of the final 
> delimiter, the function would try to calculate the bytes read:
> {noformat}
> int readLength = bufferPosn - startPosn;
> bytesConsumed += readLength;
> int appendLength = readLength - delPosn;
> ...
> if (appendLength >= 0 && ambiguousByteCount > 0) {
> ...
>     unsetNeedAdditionalRecordAfterSplit();
> }{noformat}
> note that here the {{readLength}} would be 5 (half of the delimiter length) 
> and {{appendLength}} would be {{{}5-10=-5{}}}. Thus, the condition for the 
> {{if}} clause would be false and the {{needAdditionalRecord}} would not be 
> unset.
> In fact, the {{needAdditionalRecord}} switch would never be unset all the way 
> until {{readCustomLine}} ends.
> However, it should be set to {{{}false{}}}, or the next time the user calls 
> {{next}} the reader would read at least another line since the {{while}} 
> condition in {{{}next{}}}:
> {noformat}
> while (getFilePosition() <= end || in.needAdditionalRecordAfterSplit()) 
> {{noformat}
> is true due to not reaching EOF and {{needAdditionalRecord}} set to 
> {{{}true{}}}.
> This can lead to severe problems if every time after the split the start of 
> the final buffer falls in the delimiter.
> h2. How to reproduce:
> (1) Set {{io.file.buffer.size }} to {{188}}
> (2) Run test: 
> {{org.apache.hadoop.mapred.TestLineRecordReader#testBzipWithMultibyteDelimiter}}
> For an easy reproduction please run the {{reproduce.sh}} included in the 
> attachment.
> h2. StackTrace:
> {noformat}
> java.lang.AssertionError: Unexpected number of records in split expected:<60> 
> but was:<61>
>     at org.junit.Assert.fail(Assert.java:89)
>     at org.junit.Assert.failNotEquals(Assert.java:835)
>     at org.junit.Assert.assertEquals(Assert.java:647)
>     at 
> org.apache.hadoop.mapred.TestLineRecordReader.testSplitRecordsForFile(TestLineRecordReader.java:110)
>     at 
> org.apache.hadoop.mapred.TestLineRecordReader.testBzipWithMultibyteDelimiter(TestLineRecordReader.java:684){noformat}
> For an easy reproduction, run the reproduce.sh in the attachment. We are 
> happy to provide a patch if this issue is confirmed.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[jira] [Updated] (MAPREDUCE-7444) LineReader improperly sets AdditionalRecord, causing extra line read

Reply via email to