ConfX created MAPREDUCE-7444: -------------------------------- Summary: LineReader improperly sets AdditionalRecord, causing extra line read Key: MAPREDUCE-7444 URL: https://issues.apache.org/jira/browse/MAPREDUCE-7444 Project: Hadoop Map/Reduce Issue Type: Bug Reporter: ConfX Attachments: reproduce.sh
h2. What happened: After setting {{io.file.buffer.size}} precisely to some values, the LineReader sometimes mistakenly reads an extra line after the split. Note that this bug is highly critical. A malicious party can create a file that makes the while loop run forever given knowledge of the buffer size setting and control over the file being read. h2. Buggy code: Consider reading a record file with multiple lines. In MapReduce, this is done using [org.apache.hadoop.mapred.LineRecordReader|https://github.com/confuzz-org/fuzz-hadoop/blob/confuzz/hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-core/src/main/java/org/apache/hadoop/mapred/LineRecordReader.java#L249], which then calls [org.apache.hadoop.util.LineReader.readCustomLine|https://github.com/confuzz-org/fuzz-hadoop/blob/confuzz/hadoop-common-project/hadoop-common/src/main/java/org/apache/hadoop/util/LineReader.java#L268] if you specify your own delimiter ('\n' or "\r\n"). Intrinsically, if the file is compressed, the reading of the file is implemented using [org.apache.hadoop.mapreduce.lib.input. CompressedSplitLineReader|https://github.com/confuzz-org/fuzz-hadoop/blob/confuzz/hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-core/src/main/java/org/apache/hadoop/mapreduce/lib/input/CompressedSplitLineReader.java]. Notice that in its [fillBuffer|https://github.com/confuzz-org/fuzz-hadoop/blob/confuzz/hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-core/src/main/java/org/apache/hadoop/mapreduce/lib/input/CompressedSplitLineReader.java#L128] function to refill the buffer, when {{inDelimiter}} is set to {{{}true{}}}, if haven't reached EOF the switch for additional record is set to true (suppose not using CRLF). Now, we sometimes want to read the file in different splits (or up until a specific position). Consider the case where the end of the split is a delimiter. This is where interesting thing happen. We use an easy example to illustrate this. Suppose that the delimiter is 10 bytes long, and the buffer size is set precise enough such that the end of the second to last buffer ends right in the exact middle of the delimiter. In the final round of loop in {{{}readCustomLine{}}}, the buffer would be refilled first: {noformat} ... if (bufferPosn >= bufferLength) { startPosn = bufferPosn = 0; bufferLength = fillBuffer(in, buffer, ambiguousByteCount > 0); ...{noformat} note that {{needAdditionalRecord}} would be set to {{true}} in {{{}CompressedSplitLineReader{}}}. Next, after some reading and string comparing the second half of the final delimiter, the function would try to calculate the bytes read: {noformat} int readLength = bufferPosn - startPosn; bytesConsumed += readLength; int appendLength = readLength - delPosn; ... if (appendLength >= 0 && ambiguousByteCount > 0) { ... unsetNeedAdditionalRecordAfterSplit(); }{noformat} note that here the {{readLength}} would be 5 (half of the delimiter length) and {{appendLength}} would be {{{}5-10=-5{}}}. Thus, the condition for the {{if}} clause would be false and the {{needAdditionalRecord}} would not be unset. In fact, the {{needAdditionalRecord}} switch would never be unset all the way until {{readCustomLine}} ends. However, it should be set to {{{}false{}}}, or the next time the user calls {{next}} the reader would read at least another line since the {{while}} condition in {{{}next{}}}: {noformat} while (getFilePosition() <= end || in.needAdditionalRecordAfterSplit()) {{noformat} is true due to not reaching EOF and {{needAdditionalRecord}} set to {{{}true{}}}. This can lead to severe problems if every time after the split the start of the final buffer falls in the delimiter. h2. How to reproduce: (1) Set {{io.file.buffer.size }} to {{188}} (2) Run test: {{org.apache.hadoop.mapred.TestLineRecordReader#testBzipWithMultibyteDelimiter}} For an easy reproduction please run the {{reproduce.sh}} included in the attachment. h2. StackTrace: {noformat} java.lang.AssertionError: Unexpected number of records in split expected:<60> but was:<61> at org.junit.Assert.fail(Assert.java:89) at org.junit.Assert.failNotEquals(Assert.java:835) at org.junit.Assert.assertEquals(Assert.java:647) at org.apache.hadoop.mapred.TestLineRecordReader.testSplitRecordsForFile(TestLineRecordReader.java:110) at org.apache.hadoop.mapred.TestLineRecordReader.testBzipWithMultibyteDelimiter(TestLineRecordReader.java:684){noformat} For an easy reproduction, run the reproduce.sh in the attachment. We are happy to provide a patch if this issue is confirmed. -- This message was sent by Atlassian Jira (v8.20.10#820010) --------------------------------------------------------------------- To unsubscribe, e-mail: mapreduce-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: mapreduce-issues-h...@hadoop.apache.org