[ 
https://issues.apache.org/jira/browse/HADOOP-18321?focusedWorklogId=788672&page=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-788672
 ]

ASF GitHub Bot logged work on HADOOP-18321:
-------------------------------------------

                Author: ASF GitHub Bot
            Created on: 07/Jul/22 16:53
            Start Date: 07/Jul/22 16:53
    Worklog Time Spent: 10m 
      Work Description: steveloughran commented on code in PR #4521:
URL: https://github.com/apache/hadoop/pull/4521#discussion_r916090530


##########
hadoop-common-project/hadoop-common/src/test/java/org/apache/hadoop/io/compress/bzip2/TestBZip2TextFileWriter.java:
##########
@@ -0,0 +1,91 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ *     http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+package org.apache.hadoop.io.compress.bzip2;
+
+import static 
org.apache.hadoop.io.compress.bzip2.BZip2TextFileWriter.BLOCK_SIZE;
+import static org.junit.Assert.assertEquals;
+
+import java.io.ByteArrayInputStream;

Review Comment:
   bit late, but the imports are completely out of sync with the normal hadoop 
rules. check your ide settings.





Issue Time Tracking
-------------------

    Worklog Id:     (was: 788672)
    Time Spent: 2h  (was: 1h 50m)

> Fix when to read an additional record from a BZip2 text file split
> ------------------------------------------------------------------
>
>                 Key: HADOOP-18321
>                 URL: https://issues.apache.org/jira/browse/HADOOP-18321
>             Project: Hadoop Common
>          Issue Type: Bug
>    Affects Versions: 3.3.3
>            Reporter: Ashutosh Gupta
>            Assignee: Ashutosh Gupta
>            Priority: Critical
>              Labels: pull-request-available
>             Fix For: 3.4.0
>
>          Time Spent: 2h
>  Remaining Estimate: 0h
>
> Fix data correctness issue with TextInputFormat that can occur when reading 
> BZip2 compressed text files. When triggered this bug would cause a split to 
> return the first record of the succeeding split that reads the next BZip2 
> block, thereby duplicating that record.
> *When the bug is triggered?*
> The condition for the bug to occur requires that the flag 
> "needAdditionalRecord" in CompressedSplitLineReader to be set to true by 
> #fillBuffer at an inappropriate time: when we haven't read the remaining 
> bytes of split. This can happen when the inDelimiter parameter is true while 
> #fillBuffer is invoked while reading the next line. The inDelimiter parameter 
> is true when either 1) the last byte of the buffer is a CR character ('\r') 
> if using the default delimiters, or 2) the last bytes of the buffer are a 
> common prefix of the delimiter if using a custom delimiter.
> This can occur in various edge cases, illustrated by five unit tests added in 
> this change -- specifically the five that would fail without the fix are as 
> listed below:
>  # 
> BaseTestLineRecordReaderBZip2.customDelimiter_lastRecordDelimiterStartsAtNextBlockStart
>  # BaseTestLineRecordReaderBZip2.firstBlockEndsWithLF_secondBlockStartsWithCR
>  # BaseTestLineRecordReaderBZip2.delimitedByCRSpanningThreeBlocks
>  # BaseTestLineRecordReaderBZip2.usingCRDelimiterWithSmallestBufferSize
>  # 
> BaseTestLineRecordReaderBZip2.customDelimiter_lastThreeBytesInBlockAreDelimiter
> For background, the purpose of "needAdditionalRecord" field in 
> CompressedSplitLineReader is to indicate to LineRecordReader via the 
> #needAdditionalRecordAfterSplit method that an extra record lying beyond the 
> split range should be included in the split. This complication arises due to 
> a problem when splitting text files. When a split starts at a position 
> greater than zero, we do not know whether the first line belongs to the last 
> record in the prior split or is a new record. The solution done in Hadoop is 
> to make splits that start at position greater than zero to always discard the 
> first line and then have the prior split decide whether it should include the 
> first line of the next split or not (as part of the last record or as a new 
> record). This works well even in the case of a single line spanning multiple 
> splits.
> *What is the fix?*
> The fix is to prevent ever setting "needAdditionalRecord" if the bytes filled 
> to the buffer are not the bytes immediately outside the range of the split.
> When reading compressed data, CompressedSplitLineReader requires/assumes that 
> the stream's #read method never returns bytes from more than one compression 
> block at a time. This ensures that #fillBuffer gets invoked to read the first 
> byte of the next block. This next block may or may not be part of the split 
> we are reading. If we detect that the last bytes of the prior block maybe 
> part of a delimiter, then we may decide that we should read an additional 
> record, but we should only do that when this next block is not part of our 
> split *and* we aren't filling the buffer again beyond our split range. This 
> is because we are only concerned whether the we need to read the very first 
> line of the next split as a separate record. If it going to be part of the 
> last record, then we don't need to read an extra record, or in the special 
> case of CR + LF (i.e. "\r\n"), if the LF is the first byte of the next split, 
> it will be treated as an empty line, thus we don't need to include an extra 
> record into the mix.
> Thus, to emphasize. It is when we read the first bytes outside our split 
> range that matters. But the current logic doesn't take that into account in 
> CompressedSplitLineReader. This is in contrast to UncompressedSplitLineReader 
> which does.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

---------------------------------------------------------------------
To unsubscribe, e-mail: common-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: common-issues-h...@hadoop.apache.org

Reply via email to