[ https://issues.apache.org/jira/browse/MAPREDUCE-7444?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17944021#comment-17944021 ]
terrytlu commented on MAPREDUCE-7444: ------------------------------------- hello [~FuzzingTeam] ,did you find any solution? We also encountered a similar problem. > LineReader improperly sets AdditionalRecord, causing extra line read > -------------------------------------------------------------------- > > Key: MAPREDUCE-7444 > URL: https://issues.apache.org/jira/browse/MAPREDUCE-7444 > Project: Hadoop Map/Reduce > Issue Type: Bug > Affects Versions: 3.3.3 > Reporter: ConfX > Priority: Critical > Attachments: reproduce.sh > > > h2. What happened: > After setting {{io.file.buffer.size}} precisely to some values, the > LineReader sometimes mistakenly reads an extra line after the split. > Note that this bug is highly critical. A malicious party can create a file > that makes the while loop run forever given knowledge of the buffer size > setting and control over the file being read. > h2. Buggy code: > Consider reading a record file with multiple lines. In MapReduce, this is > done using > [org.apache.hadoop.mapred.LineRecordReader|https://github.com/confuzz-org/fuzz-hadoop/blob/confuzz/hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-core/src/main/java/org/apache/hadoop/mapred/LineRecordReader.java#L249], > which then calls > [org.apache.hadoop.util.LineReader.readCustomLine|https://github.com/confuzz-org/fuzz-hadoop/blob/confuzz/hadoop-common-project/hadoop-common/src/main/java/org/apache/hadoop/util/LineReader.java#L268] > if you specify your own delimiter ('\n' or "\r\n"). > Intrinsically, if the file is compressed, the reading of the file is > implemented using [org.apache.hadoop.mapreduce.lib.input. > CompressedSplitLineReader|https://github.com/confuzz-org/fuzz-hadoop/blob/confuzz/hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-core/src/main/java/org/apache/hadoop/mapreduce/lib/input/CompressedSplitLineReader.java]. > Notice that in its > [fillBuffer|https://github.com/confuzz-org/fuzz-hadoop/blob/confuzz/hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-core/src/main/java/org/apache/hadoop/mapreduce/lib/input/CompressedSplitLineReader.java#L128] > function to refill the buffer, when {{inDelimiter}} is set to {{{}true{}}}, > if haven't reached EOF the switch for additional record is set to true > (suppose not using CRLF). > Now, we sometimes want to read the file in different splits (or up until a > specific position). Consider the case where the end of the split is a > delimiter. This is where interesting thing happen. We use an easy example to > illustrate this. Suppose that the delimiter is 10 bytes long, and the buffer > size is set precise enough such that the end of the second to last buffer > ends right in the exact middle of the delimiter. In the final round of loop > in {{{}readCustomLine{}}}, the buffer would be refilled first: > {noformat} > ... > if (bufferPosn >= bufferLength) { > startPosn = bufferPosn = 0; > bufferLength = fillBuffer(in, buffer, ambiguousByteCount > 0); > ...{noformat} > note that {{needAdditionalRecord}} would be set to {{true}} in > {{{}CompressedSplitLineReader{}}}. > Next, after some reading and string comparing the second half of the final > delimiter, the function would try to calculate the bytes read: > {noformat} > int readLength = bufferPosn - startPosn; > bytesConsumed += readLength; > int appendLength = readLength - delPosn; > ... > if (appendLength >= 0 && ambiguousByteCount > 0) { > ... > unsetNeedAdditionalRecordAfterSplit(); > }{noformat} > note that here the {{readLength}} would be 5 (half of the delimiter length) > and {{appendLength}} would be {{{}5-10=-5{}}}. Thus, the condition for the > {{if}} clause would be false and the {{needAdditionalRecord}} would not be > unset. > In fact, the {{needAdditionalRecord}} switch would never be unset all the way > until {{readCustomLine}} ends. > However, it should be set to {{{}false{}}}, or the next time the user calls > {{next}} the reader would read at least another line since the {{while}} > condition in {{{}next{}}}: > {noformat} > while (getFilePosition() <= end || in.needAdditionalRecordAfterSplit()) > {{noformat} > is true due to not reaching EOF and {{needAdditionalRecord}} set to > {{{}true{}}}. > This can lead to severe problems if every time after the split the start of > the final buffer falls in the delimiter. > h2. How to reproduce: > (1) Set {{io.file.buffer.size }} to {{188}} > (2) Run test: > {{org.apache.hadoop.mapred.TestLineRecordReader#testBzipWithMultibyteDelimiter}} > For an easy reproduction please run the {{reproduce.sh}} included in the > attachment. > h2. StackTrace: > {noformat} > java.lang.AssertionError: Unexpected number of records in split expected:<60> > but was:<61> > at org.junit.Assert.fail(Assert.java:89) > at org.junit.Assert.failNotEquals(Assert.java:835) > at org.junit.Assert.assertEquals(Assert.java:647) > at > org.apache.hadoop.mapred.TestLineRecordReader.testSplitRecordsForFile(TestLineRecordReader.java:110) > at > org.apache.hadoop.mapred.TestLineRecordReader.testBzipWithMultibyteDelimiter(TestLineRecordReader.java:684){noformat} > For an easy reproduction, run the reproduce.sh in the attachment. We are > happy to provide a patch if this issue is confirmed. -- This message was sent by Atlassian Jira (v8.20.10#820010) --------------------------------------------------------------------- To unsubscribe, e-mail: mapreduce-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: mapreduce-issues-h...@hadoop.apache.org