[ https://issues.apache.org/jira/browse/HADOOP-18321?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Prabhu Joseph resolved HADOOP-18321. ------------------------------------ Fix Version/s: 3.4.0 Resolution: Fixed > Fix when to read an additional record from a BZip2 text file split > ------------------------------------------------------------------ > > Key: HADOOP-18321 > URL: https://issues.apache.org/jira/browse/HADOOP-18321 > Project: Hadoop Common > Issue Type: Bug > Affects Versions: 3.3.3 > Reporter: Ashutosh Gupta > Assignee: Ashutosh Gupta > Priority: Critical > Labels: pull-request-available > Fix For: 3.4.0 > > Time Spent: 1.5h > Remaining Estimate: 0h > > Fix data correctness issue with TextInputFormat that can occur when reading > BZip2 compressed text files. When triggered this bug would cause a split to > return the first record of the succeeding split that reads the next BZip2 > block, thereby duplicating that record. > *When the bug is triggered?* > The condition for the bug to occur requires that the flag > "needAdditionalRecord" in CompressedSplitLineReader to be set to true by > #fillBuffer at an inappropriate time: when we haven't read the remaining > bytes of split. This can happen when the inDelimiter parameter is true while > #fillBuffer is invoked while reading the next line. The inDelimiter parameter > is true when either 1) the last byte of the buffer is a CR character ('\r') > if using the default delimiters, or 2) the last bytes of the buffer are a > common prefix of the delimiter if using a custom delimiter. > This can occur in various edge cases, illustrated by five unit tests added in > this change -- specifically the five that would fail without the fix are as > listed below: > # > BaseTestLineRecordReaderBZip2.customDelimiter_lastRecordDelimiterStartsAtNextBlockStart > # BaseTestLineRecordReaderBZip2.firstBlockEndsWithLF_secondBlockStartsWithCR > # BaseTestLineRecordReaderBZip2.delimitedByCRSpanningThreeBlocks > # BaseTestLineRecordReaderBZip2.usingCRDelimiterWithSmallestBufferSize > # > BaseTestLineRecordReaderBZip2.customDelimiter_lastThreeBytesInBlockAreDelimiter > For background, the purpose of "needAdditionalRecord" field in > CompressedSplitLineReader is to indicate to LineRecordReader via the > #needAdditionalRecordAfterSplit method that an extra record lying beyond the > split range should be included in the split. This complication arises due to > a problem when splitting text files. When a split starts at a position > greater than zero, we do not know whether the first line belongs to the last > record in the prior split or is a new record. The solution done in Hadoop is > to make splits that start at position greater than zero to always discard the > first line and then have the prior split decide whether it should include the > first line of the next split or not (as part of the last record or as a new > record). This works well even in the case of a single line spanning multiple > splits. > *What is the fix?* > The fix is to prevent ever setting "needAdditionalRecord" if the bytes filled > to the buffer are not the bytes immediately outside the range of the split. > When reading compressed data, CompressedSplitLineReader requires/assumes that > the stream's #read method never returns bytes from more than one compression > block at a time. This ensures that #fillBuffer gets invoked to read the first > byte of the next block. This next block may or may not be part of the split > we are reading. If we detect that the last bytes of the prior block maybe > part of a delimiter, then we may decide that we should read an additional > record, but we should only do that when this next block is not part of our > split *and* we aren't filling the buffer again beyond our split range. This > is because we are only concerned whether the we need to read the very first > line of the next split as a separate record. If it going to be part of the > last record, then we don't need to read an extra record, or in the special > case of CR + LF (i.e. "\r\n"), if the LF is the first byte of the next split, > it will be treated as an empty line, thus we don't need to include an extra > record into the mix. > Thus, to emphasize. It is when we read the first bytes outside our split > range that matters. But the current logic doesn't take that into account in > CompressedSplitLineReader. This is in contrast to UncompressedSplitLineReader > which does. -- This message was sent by Atlassian Jira (v8.20.10#820010) --------------------------------------------------------------------- To unsubscribe, e-mail: common-dev-unsubscr...@hadoop.apache.org For additional commands, e-mail: common-dev-h...@hadoop.apache.org