subject:"\[jira\] \[Work logged\] \(HADOOP\-18321\) Fix when to read an additional record from a BZip2 text file split"

[jira] [Work logged] (HADOOP-18321) Fix when to read an additional record from a BZip2 text file split

2022-07-12 Thread ASF GitHub Bot (Jira)



 [ 
https://issues.apache.org/jira/browse/HADOOP-18321?focusedWorklogId=789985=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-789985
 ]

ASF GitHub Bot logged work on HADOOP-18321:
---

Author: ASF GitHub Bot
Created on: 12/Jul/22 11:03
Start Date: 12/Jul/22 11:03
Worklog Time Spent: 10m 
  Work Description: steveloughran commented on code in PR #4521:
URL: https://github.com/apache/hadoop/pull/4521#discussion_r918837493


##
hadoop-common-project/hadoop-common/src/test/java/org/apache/hadoop/io/compress/bzip2/TestBZip2TextFileWriter.java:
##
@@ -0,0 +1,91 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ * http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+package org.apache.hadoop.io.compress.bzip2;
+
+import static 
org.apache.hadoop.io.compress.bzip2.BZip2TextFileWriter.BLOCK_SIZE;
+import static org.junit.Assert.assertEquals;
+
+import java.io.ByteArrayInputStream;

Review Comment:
   thx. 





Issue Time Tracking
---

Worklog Id: (was: 789985)
Time Spent: 2h 40m  (was: 2.5h)

> Fix when to read an additional record from a BZip2 text file split
> --
>
> Key: HADOOP-18321
> URL: https://issues.apache.org/jira/browse/HADOOP-18321
> Project: Hadoop Common
>  Issue Type: Bug
>Affects Versions: 3.3.3
>Reporter: Ashutosh Gupta
>Assignee: Ashutosh Gupta
>Priority: Critical
>  Labels: pull-request-available
> Fix For: 3.4.0
>
>  Time Spent: 2h 40m
>  Remaining Estimate: 0h
>
> Fix data correctness issue with TextInputFormat that can occur when reading 
> BZip2 compressed text files. When triggered this bug would cause a split to 
> return the first record of the succeeding split that reads the next BZip2 
> block, thereby duplicating that record.
> *When the bug is triggered?*
> The condition for the bug to occur requires that the flag 
> "needAdditionalRecord" in CompressedSplitLineReader to be set to true by 
> #fillBuffer at an inappropriate time: when we haven't read the remaining 
> bytes of split. This can happen when the inDelimiter parameter is true while 
> #fillBuffer is invoked while reading the next line. The inDelimiter parameter 
> is true when either 1) the last byte of the buffer is a CR character ('\r') 
> if using the default delimiters, or 2) the last bytes of the buffer are a 
> common prefix of the delimiter if using a custom delimiter.
> This can occur in various edge cases, illustrated by five unit tests added in 
> this change -- specifically the five that would fail without the fix are as 
> listed below:
>  # 
> BaseTestLineRecordReaderBZip2.customDelimiter_lastRecordDelimiterStartsAtNextBlockStart
>  # BaseTestLineRecordReaderBZip2.firstBlockEndsWithLF_secondBlockStartsWithCR
>  # BaseTestLineRecordReaderBZip2.delimitedByCRSpanningThreeBlocks
>  # BaseTestLineRecordReaderBZip2.usingCRDelimiterWithSmallestBufferSize
>  # 
> BaseTestLineRecordReaderBZip2.customDelimiter_lastThreeBytesInBlockAreDelimiter
> For background, the purpose of "needAdditionalRecord" field in 
> CompressedSplitLineReader is to indicate to LineRecordReader via the 
> #needAdditionalRecordAfterSplit method that an extra record lying beyond the 
> split range should be included in the split. This complication arises due to 
> a problem when splitting text files. When a split starts at a position 
> greater than zero, we do not know whether the first line belongs to the last 
> record in the prior split or is a new record. The solution done in Hadoop is 
> to make splits that start at position greater than zero to always discard the 
> first line and then have the prior split decide whether it should include the 
> first line of the next split or not (as part of the last record or as a new 
> record). This works well even in the case of a single line spanning multiple 
> splits.
> *What is the fix?*
> The fix is to prevent ever setting "needAdditionalRecord" if the bytes filled 
> to the buffer are not the bytes immediately outside the range of the split.
>

[jira] [Work logged] (HADOOP-18321) Fix when to read an additional record from a BZip2 text file split

2022-07-11 Thread ASF GitHub Bot (Jira)



 [ 
https://issues.apache.org/jira/browse/HADOOP-18321?focusedWorklogId=789669=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-789669
 ]

ASF GitHub Bot logged work on HADOOP-18321:
---

Author: ASF GitHub Bot
Created on: 11/Jul/22 17:03
Start Date: 11/Jul/22 17:03
Worklog Time Spent: 10m 
  Work Description: ashutoshcipher commented on code in PR #4521:
URL: https://github.com/apache/hadoop/pull/4521#discussion_r918164884


##
hadoop-common-project/hadoop-common/src/test/java/org/apache/hadoop/io/compress/bzip2/TestBZip2TextFileWriter.java:
##
@@ -0,0 +1,91 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ * http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+package org.apache.hadoop.io.compress.bzip2;
+
+import static 
org.apache.hadoop.io.compress.bzip2.BZip2TextFileWriter.BLOCK_SIZE;
+import static org.junit.Assert.assertEquals;
+
+import java.io.ByteArrayInputStream;

Review Comment:
   @steveloughran - I will file a JIRA and fix the imports. 





Issue Time Tracking
---

Worklog Id: (was: 789669)
Time Spent: 2.5h  (was: 2h 20m)

> Fix when to read an additional record from a BZip2 text file split
> --
>
> Key: HADOOP-18321
> URL: https://issues.apache.org/jira/browse/HADOOP-18321
> Project: Hadoop Common
>  Issue Type: Bug
>Affects Versions: 3.3.3
>Reporter: Ashutosh Gupta
>Assignee: Ashutosh Gupta
>Priority: Critical
>  Labels: pull-request-available
> Fix For: 3.4.0
>
>  Time Spent: 2.5h
>  Remaining Estimate: 0h
>
> Fix data correctness issue with TextInputFormat that can occur when reading 
> BZip2 compressed text files. When triggered this bug would cause a split to 
> return the first record of the succeeding split that reads the next BZip2 
> block, thereby duplicating that record.
> *When the bug is triggered?*
> The condition for the bug to occur requires that the flag 
> "needAdditionalRecord" in CompressedSplitLineReader to be set to true by 
> #fillBuffer at an inappropriate time: when we haven't read the remaining 
> bytes of split. This can happen when the inDelimiter parameter is true while 
> #fillBuffer is invoked while reading the next line. The inDelimiter parameter 
> is true when either 1) the last byte of the buffer is a CR character ('\r') 
> if using the default delimiters, or 2) the last bytes of the buffer are a 
> common prefix of the delimiter if using a custom delimiter.
> This can occur in various edge cases, illustrated by five unit tests added in 
> this change -- specifically the five that would fail without the fix are as 
> listed below:
>  # 
> BaseTestLineRecordReaderBZip2.customDelimiter_lastRecordDelimiterStartsAtNextBlockStart
>  # BaseTestLineRecordReaderBZip2.firstBlockEndsWithLF_secondBlockStartsWithCR
>  # BaseTestLineRecordReaderBZip2.delimitedByCRSpanningThreeBlocks
>  # BaseTestLineRecordReaderBZip2.usingCRDelimiterWithSmallestBufferSize
>  # 
> BaseTestLineRecordReaderBZip2.customDelimiter_lastThreeBytesInBlockAreDelimiter
> For background, the purpose of "needAdditionalRecord" field in 
> CompressedSplitLineReader is to indicate to LineRecordReader via the 
> #needAdditionalRecordAfterSplit method that an extra record lying beyond the 
> split range should be included in the split. This complication arises due to 
> a problem when splitting text files. When a split starts at a position 
> greater than zero, we do not know whether the first line belongs to the last 
> record in the prior split or is a new record. The solution done in Hadoop is 
> to make splits that start at position greater than zero to always discard the 
> first line and then have the prior split decide whether it should include the 
> first line of the next split or not (as part of the last record or as a new 
> record). This works well even in the case of a single line spanning multiple 
> splits.
> *What is the fix?*
> The fix is to prevent ever setting "needAdditionalRecord" if the bytes filled 
> to the buffer are not the

[jira] [Work logged] (HADOOP-18321) Fix when to read an additional record from a BZip2 text file split

2022-07-11 Thread ASF GitHub Bot (Jira)



 [ 
https://issues.apache.org/jira/browse/HADOOP-18321?focusedWorklogId=789469=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-789469
 ]

ASF GitHub Bot logged work on HADOOP-18321:
---

Author: ASF GitHub Bot
Created on: 11/Jul/22 11:08
Start Date: 11/Jul/22 11:08
Worklog Time Spent: 10m 
  Work Description: steveloughran commented on code in PR #4521:
URL: https://github.com/apache/hadoop/pull/4521#discussion_r917810202


##
hadoop-common-project/hadoop-common/src/test/java/org/apache/hadoop/io/compress/bzip2/TestBZip2TextFileWriter.java:
##
@@ -0,0 +1,91 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ * http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+package org.apache.hadoop.io.compress.bzip2;
+
+import static 
org.apache.hadoop.io.compress.bzip2.BZip2TextFileWriter.BLOCK_SIZE;
+import static org.junit.Assert.assertEquals;
+
+import java.io.ByteArrayInputStream;

Review Comment:
   that file puts statics at the bottom. at least it should. if it doesn't 
that's a bug





Issue Time Tracking
---

Worklog Id: (was: 789469)
Time Spent: 2h 20m  (was: 2h 10m)

> Fix when to read an additional record from a BZip2 text file split
> --
>
> Key: HADOOP-18321
> URL: https://issues.apache.org/jira/browse/HADOOP-18321
> Project: Hadoop Common
>  Issue Type: Bug
>Affects Versions: 3.3.3
>Reporter: Ashutosh Gupta
>Assignee: Ashutosh Gupta
>Priority: Critical
>  Labels: pull-request-available
> Fix For: 3.4.0
>
>  Time Spent: 2h 20m
>  Remaining Estimate: 0h
>
> Fix data correctness issue with TextInputFormat that can occur when reading 
> BZip2 compressed text files. When triggered this bug would cause a split to 
> return the first record of the succeeding split that reads the next BZip2 
> block, thereby duplicating that record.
> *When the bug is triggered?*
> The condition for the bug to occur requires that the flag 
> "needAdditionalRecord" in CompressedSplitLineReader to be set to true by 
> #fillBuffer at an inappropriate time: when we haven't read the remaining 
> bytes of split. This can happen when the inDelimiter parameter is true while 
> #fillBuffer is invoked while reading the next line. The inDelimiter parameter 
> is true when either 1) the last byte of the buffer is a CR character ('\r') 
> if using the default delimiters, or 2) the last bytes of the buffer are a 
> common prefix of the delimiter if using a custom delimiter.
> This can occur in various edge cases, illustrated by five unit tests added in 
> this change -- specifically the five that would fail without the fix are as 
> listed below:
>  # 
> BaseTestLineRecordReaderBZip2.customDelimiter_lastRecordDelimiterStartsAtNextBlockStart
>  # BaseTestLineRecordReaderBZip2.firstBlockEndsWithLF_secondBlockStartsWithCR
>  # BaseTestLineRecordReaderBZip2.delimitedByCRSpanningThreeBlocks
>  # BaseTestLineRecordReaderBZip2.usingCRDelimiterWithSmallestBufferSize
>  # 
> BaseTestLineRecordReaderBZip2.customDelimiter_lastThreeBytesInBlockAreDelimiter
> For background, the purpose of "needAdditionalRecord" field in 
> CompressedSplitLineReader is to indicate to LineRecordReader via the 
> #needAdditionalRecordAfterSplit method that an extra record lying beyond the 
> split range should be included in the split. This complication arises due to 
> a problem when splitting text files. When a split starts at a position 
> greater than zero, we do not know whether the first line belongs to the last 
> record in the prior split or is a new record. The solution done in Hadoop is 
> to make splits that start at position greater than zero to always discard the 
> first line and then have the prior split decide whether it should include the 
> first line of the next split or not (as part of the last record or as a new 
> record). This works well even in the case of a single line spanning multiple 
> splits.
> *What is the fix?*
> The fix is to prevent ever setting "needAdditionalRecord" if the bytes

[jira] [Work logged] (HADOOP-18321) Fix when to read an additional record from a BZip2 text file split

2022-07-07 Thread ASF GitHub Bot (Jira)



 [ 
https://issues.apache.org/jira/browse/HADOOP-18321?focusedWorklogId=788675=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-788675
 ]

ASF GitHub Bot logged work on HADOOP-18321:
---

Author: ASF GitHub Bot
Created on: 07/Jul/22 16:58
Start Date: 07/Jul/22 16:58
Worklog Time Spent: 10m 
  Work Description: ashutoshcipher commented on code in PR #4521:
URL: https://github.com/apache/hadoop/pull/4521#discussion_r916095242


##
hadoop-common-project/hadoop-common/src/test/java/org/apache/hadoop/io/compress/bzip2/TestBZip2TextFileWriter.java:
##
@@ -0,0 +1,91 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ * http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+package org.apache.hadoop.io.compress.bzip2;
+
+import static 
org.apache.hadoop.io.compress.bzip2.BZip2TextFileWriter.BLOCK_SIZE;
+import static org.junit.Assert.assertEquals;
+
+import java.io.ByteArrayInputStream;

Review Comment:
   Thanks @steveloughran for pointing it out. I am using this for code 
formatting -  
https://github.com/apache/hadoop/blob/trunk/dev-support/code-formatter/hadoop_idea_formatter.xml





Issue Time Tracking
---

Worklog Id: (was: 788675)
Time Spent: 2h 10m  (was: 2h)

> Fix when to read an additional record from a BZip2 text file split
> --
>
> Key: HADOOP-18321
> URL: https://issues.apache.org/jira/browse/HADOOP-18321
> Project: Hadoop Common
>  Issue Type: Bug
>Affects Versions: 3.3.3
>Reporter: Ashutosh Gupta
>Assignee: Ashutosh Gupta
>Priority: Critical
>  Labels: pull-request-available
> Fix For: 3.4.0
>
>  Time Spent: 2h 10m
>  Remaining Estimate: 0h
>
> Fix data correctness issue with TextInputFormat that can occur when reading 
> BZip2 compressed text files. When triggered this bug would cause a split to 
> return the first record of the succeeding split that reads the next BZip2 
> block, thereby duplicating that record.
> *When the bug is triggered?*
> The condition for the bug to occur requires that the flag 
> "needAdditionalRecord" in CompressedSplitLineReader to be set to true by 
> #fillBuffer at an inappropriate time: when we haven't read the remaining 
> bytes of split. This can happen when the inDelimiter parameter is true while 
> #fillBuffer is invoked while reading the next line. The inDelimiter parameter 
> is true when either 1) the last byte of the buffer is a CR character ('\r') 
> if using the default delimiters, or 2) the last bytes of the buffer are a 
> common prefix of the delimiter if using a custom delimiter.
> This can occur in various edge cases, illustrated by five unit tests added in 
> this change -- specifically the five that would fail without the fix are as 
> listed below:
>  # 
> BaseTestLineRecordReaderBZip2.customDelimiter_lastRecordDelimiterStartsAtNextBlockStart
>  # BaseTestLineRecordReaderBZip2.firstBlockEndsWithLF_secondBlockStartsWithCR
>  # BaseTestLineRecordReaderBZip2.delimitedByCRSpanningThreeBlocks
>  # BaseTestLineRecordReaderBZip2.usingCRDelimiterWithSmallestBufferSize
>  # 
> BaseTestLineRecordReaderBZip2.customDelimiter_lastThreeBytesInBlockAreDelimiter
> For background, the purpose of "needAdditionalRecord" field in 
> CompressedSplitLineReader is to indicate to LineRecordReader via the 
> #needAdditionalRecordAfterSplit method that an extra record lying beyond the 
> split range should be included in the split. This complication arises due to 
> a problem when splitting text files. When a split starts at a position 
> greater than zero, we do not know whether the first line belongs to the last 
> record in the prior split or is a new record. The solution done in Hadoop is 
> to make splits that start at position greater than zero to always discard the 
> first line and then have the prior split decide whether it should include the 
> first line of the next split or not (as part of the last record or as a new 
> record). This works well even in the case of a single line spanning multiple 
> splits.
> *What

[jira] [Work logged] (HADOOP-18321) Fix when to read an additional record from a BZip2 text file split

2022-07-07 Thread ASF GitHub Bot (Jira)



 [ 
https://issues.apache.org/jira/browse/HADOOP-18321?focusedWorklogId=788672=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-788672
 ]

ASF GitHub Bot logged work on HADOOP-18321:
---

Author: ASF GitHub Bot
Created on: 07/Jul/22 16:53
Start Date: 07/Jul/22 16:53
Worklog Time Spent: 10m 
  Work Description: steveloughran commented on code in PR #4521:
URL: https://github.com/apache/hadoop/pull/4521#discussion_r916090530


##
hadoop-common-project/hadoop-common/src/test/java/org/apache/hadoop/io/compress/bzip2/TestBZip2TextFileWriter.java:
##
@@ -0,0 +1,91 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ * http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+package org.apache.hadoop.io.compress.bzip2;
+
+import static 
org.apache.hadoop.io.compress.bzip2.BZip2TextFileWriter.BLOCK_SIZE;
+import static org.junit.Assert.assertEquals;
+
+import java.io.ByteArrayInputStream;

Review Comment:
   bit late, but the imports are completely out of sync with the normal hadoop 
rules. check your ide settings.





Issue Time Tracking
---

Worklog Id: (was: 788672)
Time Spent: 2h  (was: 1h 50m)

> Fix when to read an additional record from a BZip2 text file split
> --
>
> Key: HADOOP-18321
> URL: https://issues.apache.org/jira/browse/HADOOP-18321
> Project: Hadoop Common
>  Issue Type: Bug
>Affects Versions: 3.3.3
>Reporter: Ashutosh Gupta
>Assignee: Ashutosh Gupta
>Priority: Critical
>  Labels: pull-request-available
> Fix For: 3.4.0
>
>  Time Spent: 2h
>  Remaining Estimate: 0h
>
> Fix data correctness issue with TextInputFormat that can occur when reading 
> BZip2 compressed text files. When triggered this bug would cause a split to 
> return the first record of the succeeding split that reads the next BZip2 
> block, thereby duplicating that record.
> *When the bug is triggered?*
> The condition for the bug to occur requires that the flag 
> "needAdditionalRecord" in CompressedSplitLineReader to be set to true by 
> #fillBuffer at an inappropriate time: when we haven't read the remaining 
> bytes of split. This can happen when the inDelimiter parameter is true while 
> #fillBuffer is invoked while reading the next line. The inDelimiter parameter 
> is true when either 1) the last byte of the buffer is a CR character ('\r') 
> if using the default delimiters, or 2) the last bytes of the buffer are a 
> common prefix of the delimiter if using a custom delimiter.
> This can occur in various edge cases, illustrated by five unit tests added in 
> this change -- specifically the five that would fail without the fix are as 
> listed below:
>  # 
> BaseTestLineRecordReaderBZip2.customDelimiter_lastRecordDelimiterStartsAtNextBlockStart
>  # BaseTestLineRecordReaderBZip2.firstBlockEndsWithLF_secondBlockStartsWithCR
>  # BaseTestLineRecordReaderBZip2.delimitedByCRSpanningThreeBlocks
>  # BaseTestLineRecordReaderBZip2.usingCRDelimiterWithSmallestBufferSize
>  # 
> BaseTestLineRecordReaderBZip2.customDelimiter_lastThreeBytesInBlockAreDelimiter
> For background, the purpose of "needAdditionalRecord" field in 
> CompressedSplitLineReader is to indicate to LineRecordReader via the 
> #needAdditionalRecordAfterSplit method that an extra record lying beyond the 
> split range should be included in the split. This complication arises due to 
> a problem when splitting text files. When a split starts at a position 
> greater than zero, we do not know whether the first line belongs to the last 
> record in the prior split or is a new record. The solution done in Hadoop is 
> to make splits that start at position greater than zero to always discard the 
> first line and then have the prior split decide whether it should include the 
> first line of the next split or not (as part of the last record or as a new 
> record). This works well even in the case of a single line spanning multiple 
> splits.
> *What is the fix?*
> The fix is to prevent ever setting "needAdditionalRecord" if

[jira] [Work logged] (HADOOP-18321) Fix when to read an additional record from a BZip2 text file split

2022-07-06 Thread ASF GitHub Bot (Jira)



 [ 
https://issues.apache.org/jira/browse/HADOOP-18321?focusedWorklogId=788175=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-788175
 ]

ASF GitHub Bot logged work on HADOOP-18321:
---

Author: ASF GitHub Bot
Created on: 06/Jul/22 10:25
Start Date: 06/Jul/22 10:25
Worklog Time Spent: 10m 
  Work Description: ashutoshcipher commented on PR #4521:
URL: https://github.com/apache/hadoop/pull/4521#issuecomment-1176053540

   Thanks @aajisaka @PrabhuJoseph and @saswata-dutta :)




Issue Time Tracking
---

Worklog Id: (was: 788175)
Time Spent: 1h 50m  (was: 1h 40m)

> Fix when to read an additional record from a BZip2 text file split
> --
>
> Key: HADOOP-18321
> URL: https://issues.apache.org/jira/browse/HADOOP-18321
> Project: Hadoop Common
>  Issue Type: Bug
>Affects Versions: 3.3.3
>Reporter: Ashutosh Gupta
>Assignee: Ashutosh Gupta
>Priority: Critical
>  Labels: pull-request-available
> Fix For: 3.4.0
>
>  Time Spent: 1h 50m
>  Remaining Estimate: 0h
>
> Fix data correctness issue with TextInputFormat that can occur when reading 
> BZip2 compressed text files. When triggered this bug would cause a split to 
> return the first record of the succeeding split that reads the next BZip2 
> block, thereby duplicating that record.
> *When the bug is triggered?*
> The condition for the bug to occur requires that the flag 
> "needAdditionalRecord" in CompressedSplitLineReader to be set to true by 
> #fillBuffer at an inappropriate time: when we haven't read the remaining 
> bytes of split. This can happen when the inDelimiter parameter is true while 
> #fillBuffer is invoked while reading the next line. The inDelimiter parameter 
> is true when either 1) the last byte of the buffer is a CR character ('\r') 
> if using the default delimiters, or 2) the last bytes of the buffer are a 
> common prefix of the delimiter if using a custom delimiter.
> This can occur in various edge cases, illustrated by five unit tests added in 
> this change -- specifically the five that would fail without the fix are as 
> listed below:
>  # 
> BaseTestLineRecordReaderBZip2.customDelimiter_lastRecordDelimiterStartsAtNextBlockStart
>  # BaseTestLineRecordReaderBZip2.firstBlockEndsWithLF_secondBlockStartsWithCR
>  # BaseTestLineRecordReaderBZip2.delimitedByCRSpanningThreeBlocks
>  # BaseTestLineRecordReaderBZip2.usingCRDelimiterWithSmallestBufferSize
>  # 
> BaseTestLineRecordReaderBZip2.customDelimiter_lastThreeBytesInBlockAreDelimiter
> For background, the purpose of "needAdditionalRecord" field in 
> CompressedSplitLineReader is to indicate to LineRecordReader via the 
> #needAdditionalRecordAfterSplit method that an extra record lying beyond the 
> split range should be included in the split. This complication arises due to 
> a problem when splitting text files. When a split starts at a position 
> greater than zero, we do not know whether the first line belongs to the last 
> record in the prior split or is a new record. The solution done in Hadoop is 
> to make splits that start at position greater than zero to always discard the 
> first line and then have the prior split decide whether it should include the 
> first line of the next split or not (as part of the last record or as a new 
> record). This works well even in the case of a single line spanning multiple 
> splits.
> *What is the fix?*
> The fix is to prevent ever setting "needAdditionalRecord" if the bytes filled 
> to the buffer are not the bytes immediately outside the range of the split.
> When reading compressed data, CompressedSplitLineReader requires/assumes that 
> the stream's #read method never returns bytes from more than one compression 
> block at a time. This ensures that #fillBuffer gets invoked to read the first 
> byte of the next block. This next block may or may not be part of the split 
> we are reading. If we detect that the last bytes of the prior block maybe 
> part of a delimiter, then we may decide that we should read an additional 
> record, but we should only do that when this next block is not part of our 
> split *and* we aren't filling the buffer again beyond our split range. This 
> is because we are only concerned whether the we need to read the very first 
> line of the next split as a separate record. If it going to be part of the 
> last record, then we don't need to read an extra record, or in the special 
> case of CR + LF (i.e. "\r\n"), if the LF is the first byte of the next split, 
> it will be treated as an empty line, thus we don't need to include an extra 
> record into the mix.
> Thus, to emphasize. It is when we read the first bytes outside our split 
> range that matters. But the current logic

[jira] [Work logged] (HADOOP-18321) Fix when to read an additional record from a BZip2 text file split

2022-07-06 Thread ASF GitHub Bot (Jira)



 [ 
https://issues.apache.org/jira/browse/HADOOP-18321?focusedWorklogId=788144=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-788144
 ]

ASF GitHub Bot logged work on HADOOP-18321:
---

Author: ASF GitHub Bot
Created on: 06/Jul/22 06:33
Start Date: 06/Jul/22 06:33
Worklog Time Spent: 10m 
  Work Description: aajisaka commented on PR #4521:
URL: https://github.com/apache/hadoop/pull/4521#issuecomment-1175834804

   My late +1. Thank you @PrabhuJoseph and @ashutoshcipher 




Issue Time Tracking
---

Worklog Id: (was: 788144)
Time Spent: 1h 40m  (was: 1.5h)

> Fix when to read an additional record from a BZip2 text file split
> --
>
> Key: HADOOP-18321
> URL: https://issues.apache.org/jira/browse/HADOOP-18321
> Project: Hadoop Common
>  Issue Type: Bug
>Affects Versions: 3.3.3
>Reporter: Ashutosh Gupta
>Assignee: Ashutosh Gupta
>Priority: Critical
>  Labels: pull-request-available
> Fix For: 3.4.0
>
>  Time Spent: 1h 40m
>  Remaining Estimate: 0h
>
> Fix data correctness issue with TextInputFormat that can occur when reading 
> BZip2 compressed text files. When triggered this bug would cause a split to 
> return the first record of the succeeding split that reads the next BZip2 
> block, thereby duplicating that record.
> *When the bug is triggered?*
> The condition for the bug to occur requires that the flag 
> "needAdditionalRecord" in CompressedSplitLineReader to be set to true by 
> #fillBuffer at an inappropriate time: when we haven't read the remaining 
> bytes of split. This can happen when the inDelimiter parameter is true while 
> #fillBuffer is invoked while reading the next line. The inDelimiter parameter 
> is true when either 1) the last byte of the buffer is a CR character ('\r') 
> if using the default delimiters, or 2) the last bytes of the buffer are a 
> common prefix of the delimiter if using a custom delimiter.
> This can occur in various edge cases, illustrated by five unit tests added in 
> this change -- specifically the five that would fail without the fix are as 
> listed below:
>  # 
> BaseTestLineRecordReaderBZip2.customDelimiter_lastRecordDelimiterStartsAtNextBlockStart
>  # BaseTestLineRecordReaderBZip2.firstBlockEndsWithLF_secondBlockStartsWithCR
>  # BaseTestLineRecordReaderBZip2.delimitedByCRSpanningThreeBlocks
>  # BaseTestLineRecordReaderBZip2.usingCRDelimiterWithSmallestBufferSize
>  # 
> BaseTestLineRecordReaderBZip2.customDelimiter_lastThreeBytesInBlockAreDelimiter
> For background, the purpose of "needAdditionalRecord" field in 
> CompressedSplitLineReader is to indicate to LineRecordReader via the 
> #needAdditionalRecordAfterSplit method that an extra record lying beyond the 
> split range should be included in the split. This complication arises due to 
> a problem when splitting text files. When a split starts at a position 
> greater than zero, we do not know whether the first line belongs to the last 
> record in the prior split or is a new record. The solution done in Hadoop is 
> to make splits that start at position greater than zero to always discard the 
> first line and then have the prior split decide whether it should include the 
> first line of the next split or not (as part of the last record or as a new 
> record). This works well even in the case of a single line spanning multiple 
> splits.
> *What is the fix?*
> The fix is to prevent ever setting "needAdditionalRecord" if the bytes filled 
> to the buffer are not the bytes immediately outside the range of the split.
> When reading compressed data, CompressedSplitLineReader requires/assumes that 
> the stream's #read method never returns bytes from more than one compression 
> block at a time. This ensures that #fillBuffer gets invoked to read the first 
> byte of the next block. This next block may or may not be part of the split 
> we are reading. If we detect that the last bytes of the prior block maybe 
> part of a delimiter, then we may decide that we should read an additional 
> record, but we should only do that when this next block is not part of our 
> split *and* we aren't filling the buffer again beyond our split range. This 
> is because we are only concerned whether the we need to read the very first 
> line of the next split as a separate record. If it going to be part of the 
> last record, then we don't need to read an extra record, or in the special 
> case of CR + LF (i.e. "\r\n"), if the LF is the first byte of the next split, 
> it will be treated as an empty line, thus we don't need to include an extra 
> record into the mix.
> Thus, to emphasize. It is when we read the first bytes outside our split 
> range that matters. But the current logic

[jira] [Work logged] (HADOOP-18321) Fix when to read an additional record from a BZip2 text file split

2022-07-05 Thread ASF GitHub Bot (Jira)



 [ 
https://issues.apache.org/jira/browse/HADOOP-18321?focusedWorklogId=788135=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-788135
 ]

ASF GitHub Bot logged work on HADOOP-18321:
---

Author: ASF GitHub Bot
Created on: 06/Jul/22 04:30
Start Date: 06/Jul/22 04:30
Worklog Time Spent: 10m 
  Work Description: PrabhuJoseph merged PR #4521:
URL: https://github.com/apache/hadoop/pull/4521




Issue Time Tracking
---

Worklog Id: (was: 788135)
Time Spent: 1.5h  (was: 1h 20m)

> Fix when to read an additional record from a BZip2 text file split
> --
>
> Key: HADOOP-18321
> URL: https://issues.apache.org/jira/browse/HADOOP-18321
> Project: Hadoop Common
>  Issue Type: Bug
>Affects Versions: 3.3.3
>Reporter: Ashutosh Gupta
>Assignee: Ashutosh Gupta
>Priority: Critical
>  Labels: pull-request-available
>  Time Spent: 1.5h
>  Remaining Estimate: 0h
>
> Fix data correctness issue with TextInputFormat that can occur when reading 
> BZip2 compressed text files. When triggered this bug would cause a split to 
> return the first record of the succeeding split that reads the next BZip2 
> block, thereby duplicating that record.
> *When the bug is triggered?*
> The condition for the bug to occur requires that the flag 
> "needAdditionalRecord" in CompressedSplitLineReader to be set to true by 
> #fillBuffer at an inappropriate time: when we haven't read the remaining 
> bytes of split. This can happen when the inDelimiter parameter is true while 
> #fillBuffer is invoked while reading the next line. The inDelimiter parameter 
> is true when either 1) the last byte of the buffer is a CR character ('\r') 
> if using the default delimiters, or 2) the last bytes of the buffer are a 
> common prefix of the delimiter if using a custom delimiter.
> This can occur in various edge cases, illustrated by five unit tests added in 
> this change -- specifically the five that would fail without the fix are as 
> listed below:
>  # 
> BaseTestLineRecordReaderBZip2.customDelimiter_lastRecordDelimiterStartsAtNextBlockStart
>  # BaseTestLineRecordReaderBZip2.firstBlockEndsWithLF_secondBlockStartsWithCR
>  # BaseTestLineRecordReaderBZip2.delimitedByCRSpanningThreeBlocks
>  # BaseTestLineRecordReaderBZip2.usingCRDelimiterWithSmallestBufferSize
>  # 
> BaseTestLineRecordReaderBZip2.customDelimiter_lastThreeBytesInBlockAreDelimiter
> For background, the purpose of "needAdditionalRecord" field in 
> CompressedSplitLineReader is to indicate to LineRecordReader via the 
> #needAdditionalRecordAfterSplit method that an extra record lying beyond the 
> split range should be included in the split. This complication arises due to 
> a problem when splitting text files. When a split starts at a position 
> greater than zero, we do not know whether the first line belongs to the last 
> record in the prior split or is a new record. The solution done in Hadoop is 
> to make splits that start at position greater than zero to always discard the 
> first line and then have the prior split decide whether it should include the 
> first line of the next split or not (as part of the last record or as a new 
> record). This works well even in the case of a single line spanning multiple 
> splits.
> *What is the fix?*
> The fix is to prevent ever setting "needAdditionalRecord" if the bytes filled 
> to the buffer are not the bytes immediately outside the range of the split.
> When reading compressed data, CompressedSplitLineReader requires/assumes that 
> the stream's #read method never returns bytes from more than one compression 
> block at a time. This ensures that #fillBuffer gets invoked to read the first 
> byte of the next block. This next block may or may not be part of the split 
> we are reading. If we detect that the last bytes of the prior block maybe 
> part of a delimiter, then we may decide that we should read an additional 
> record, but we should only do that when this next block is not part of our 
> split *and* we aren't filling the buffer again beyond our split range. This 
> is because we are only concerned whether the we need to read the very first 
> line of the next split as a separate record. If it going to be part of the 
> last record, then we don't need to read an extra record, or in the special 
> case of CR + LF (i.e. "\r\n"), if the LF is the first byte of the next split, 
> it will be treated as an empty line, thus we don't need to include an extra 
> record into the mix.
> Thus, to emphasize. It is when we read the first bytes outside our split 
> range that matters. But the current logic doesn't take that into account in 
> CompressedSplitLineReader. This is in contrast to UncompressedSplitLineReader 
> which

[jira] [Work logged] (HADOOP-18321) Fix when to read an additional record from a BZip2 text file split

2022-07-05 Thread ASF GitHub Bot (Jira)



 [ 
https://issues.apache.org/jira/browse/HADOOP-18321?focusedWorklogId=788134=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-788134
 ]

ASF GitHub Bot logged work on HADOOP-18321:
---

Author: ASF GitHub Bot
Created on: 06/Jul/22 04:29
Start Date: 06/Jul/22 04:29
Worklog Time Spent: 10m 
  Work Description: PrabhuJoseph commented on PR #4521:
URL: https://github.com/apache/hadoop/pull/4521#issuecomment-1175768212

   Thanks @ashutoshcipher for the patch and @aajisaka for the review.




Issue Time Tracking
---

Worklog Id: (was: 788134)
Time Spent: 1h 20m  (was: 1h 10m)

> Fix when to read an additional record from a BZip2 text file split
> --
>
> Key: HADOOP-18321
> URL: https://issues.apache.org/jira/browse/HADOOP-18321
> Project: Hadoop Common
>  Issue Type: Bug
>Affects Versions: 3.3.3
>Reporter: Ashutosh Gupta
>Assignee: Ashutosh Gupta
>Priority: Critical
>  Labels: pull-request-available
>  Time Spent: 1h 20m
>  Remaining Estimate: 0h
>
> Fix data correctness issue with TextInputFormat that can occur when reading 
> BZip2 compressed text files. When triggered this bug would cause a split to 
> return the first record of the succeeding split that reads the next BZip2 
> block, thereby duplicating that record.
> *When the bug is triggered?*
> The condition for the bug to occur requires that the flag 
> "needAdditionalRecord" in CompressedSplitLineReader to be set to true by 
> #fillBuffer at an inappropriate time: when we haven't read the remaining 
> bytes of split. This can happen when the inDelimiter parameter is true while 
> #fillBuffer is invoked while reading the next line. The inDelimiter parameter 
> is true when either 1) the last byte of the buffer is a CR character ('\r') 
> if using the default delimiters, or 2) the last bytes of the buffer are a 
> common prefix of the delimiter if using a custom delimiter.
> This can occur in various edge cases, illustrated by five unit tests added in 
> this change -- specifically the five that would fail without the fix are as 
> listed below:
>  # 
> BaseTestLineRecordReaderBZip2.customDelimiter_lastRecordDelimiterStartsAtNextBlockStart
>  # BaseTestLineRecordReaderBZip2.firstBlockEndsWithLF_secondBlockStartsWithCR
>  # BaseTestLineRecordReaderBZip2.delimitedByCRSpanningThreeBlocks
>  # BaseTestLineRecordReaderBZip2.usingCRDelimiterWithSmallestBufferSize
>  # 
> BaseTestLineRecordReaderBZip2.customDelimiter_lastThreeBytesInBlockAreDelimiter
> For background, the purpose of "needAdditionalRecord" field in 
> CompressedSplitLineReader is to indicate to LineRecordReader via the 
> #needAdditionalRecordAfterSplit method that an extra record lying beyond the 
> split range should be included in the split. This complication arises due to 
> a problem when splitting text files. When a split starts at a position 
> greater than zero, we do not know whether the first line belongs to the last 
> record in the prior split or is a new record. The solution done in Hadoop is 
> to make splits that start at position greater than zero to always discard the 
> first line and then have the prior split decide whether it should include the 
> first line of the next split or not (as part of the last record or as a new 
> record). This works well even in the case of a single line spanning multiple 
> splits.
> *What is the fix?*
> The fix is to prevent ever setting "needAdditionalRecord" if the bytes filled 
> to the buffer are not the bytes immediately outside the range of the split.
> When reading compressed data, CompressedSplitLineReader requires/assumes that 
> the stream's #read method never returns bytes from more than one compression 
> block at a time. This ensures that #fillBuffer gets invoked to read the first 
> byte of the next block. This next block may or may not be part of the split 
> we are reading. If we detect that the last bytes of the prior block maybe 
> part of a delimiter, then we may decide that we should read an additional 
> record, but we should only do that when this next block is not part of our 
> split *and* we aren't filling the buffer again beyond our split range. This 
> is because we are only concerned whether the we need to read the very first 
> line of the next split as a separate record. If it going to be part of the 
> last record, then we don't need to read an extra record, or in the special 
> case of CR + LF (i.e. "\r\n"), if the LF is the first byte of the next split, 
> it will be treated as an empty line, thus we don't need to include an extra 
> record into the mix.
> Thus, to emphasize. It is when we read the first bytes outside our split 
> range that matters. But the current logic doesn't take that

[jira] [Work logged] (HADOOP-18321) Fix when to read an additional record from a BZip2 text file split

2022-07-05 Thread ASF GitHub Bot (Jira)



 [ 
https://issues.apache.org/jira/browse/HADOOP-18321?focusedWorklogId=787856=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-787856
 ]

ASF GitHub Bot logged work on HADOOP-18321:
---

Author: ASF GitHub Bot
Created on: 05/Jul/22 13:34
Start Date: 05/Jul/22 13:34
Worklog Time Spent: 10m 
  Work Description: hadoop-yetus commented on PR #4521:
URL: https://github.com/apache/hadoop/pull/4521#issuecomment-1175068343

   :confetti_ball: **+1 overall**
   
   
   
   
   
   
   | Vote | Subsystem | Runtime |  Logfile | Comment |
   |::|--:|:|::|:---:|
   | +0 :ok: |  reexec  |  13m  6s |  |  Docker mode activated.  |
    _ Prechecks _ |
   | +1 :green_heart: |  dupname  |   0m  0s |  |  No case conflicting files 
found.  |
   | +0 :ok: |  codespell  |   0m  1s |  |  codespell was not available.  |
   | +0 :ok: |  detsecrets  |   0m  1s |  |  detect-secrets was not available.  
|
   | +1 :green_heart: |  @author  |   0m  0s |  |  The patch does not contain 
any @author tags.  |
   | +1 :green_heart: |  test4tests  |   0m  0s |  |  The patch appears to 
include 9 new or modified test files.  |
    _ trunk Compile Tests _ |
   | +0 :ok: |  mvndep  |  14m 36s |  |  Maven dependency ordering for branch  |
   | +1 :green_heart: |  mvninstall  |  24m 49s |  |  trunk passed  |
   | +1 :green_heart: |  compile  |  23m  1s |  |  trunk passed with JDK 
Private Build-11.0.15+10-Ubuntu-0ubuntu0.20.04.1  |
   | +1 :green_heart: |  compile  |  20m 34s |  |  trunk passed with JDK 
Private Build-1.8.0_312-8u312-b07-0ubuntu1~20.04-b07  |
   | +1 :green_heart: |  checkstyle  |   4m 25s |  |  trunk passed  |
   | +1 :green_heart: |  mvnsite  |   3m 46s |  |  trunk passed  |
   | +1 :green_heart: |  javadoc  |   3m  1s |  |  trunk passed with JDK 
Private Build-11.0.15+10-Ubuntu-0ubuntu0.20.04.1  |
   | +1 :green_heart: |  javadoc  |   2m 36s |  |  trunk passed with JDK 
Private Build-1.8.0_312-8u312-b07-0ubuntu1~20.04-b07  |
   | +1 :green_heart: |  spotbugs  |   5m 23s |  |  trunk passed  |
   | +1 :green_heart: |  shadedclient  |  22m 25s |  |  branch has no errors 
when building and testing our client artifacts.  |
    _ Patch Compile Tests _ |
   | +0 :ok: |  mvndep  |   0m 28s |  |  Maven dependency ordering for patch  |
   | +1 :green_heart: |  mvninstall  |   1m 47s |  |  the patch passed  |
   | +1 :green_heart: |  compile  |  22m 18s |  |  the patch passed with JDK 
Private Build-11.0.15+10-Ubuntu-0ubuntu0.20.04.1  |
   | +1 :green_heart: |  javac  |  22m 18s |  |  the patch passed  |
   | +1 :green_heart: |  compile  |  20m 33s |  |  the patch passed with JDK 
Private Build-1.8.0_312-8u312-b07-0ubuntu1~20.04-b07  |
   | +1 :green_heart: |  javac  |  20m 33s |  |  the patch passed  |
   | +1 :green_heart: |  blanks  |   0m  0s |  |  The patch has no blanks 
issues.  |
   | +1 :green_heart: |  checkstyle  |   4m 12s |  |  root: The patch generated 
0 new + 167 unchanged - 1 fixed = 167 total (was 168)  |
   | +1 :green_heart: |  mvnsite  |   3m 44s |  |  the patch passed  |
   | +1 :green_heart: |  javadoc  |   2m 55s |  |  the patch passed with JDK 
Private Build-11.0.15+10-Ubuntu-0ubuntu0.20.04.1  |
   | +1 :green_heart: |  javadoc  |   2m 36s |  |  the patch passed with JDK 
Private Build-1.8.0_312-8u312-b07-0ubuntu1~20.04-b07  |
   | +1 :green_heart: |  spotbugs  |   5m 29s |  |  the patch passed  |
   | +1 :green_heart: |  shadedclient  |  22m 47s |  |  patch has no errors 
when building and testing our client artifacts.  |
    _ Other Tests _ |
   | +1 :green_heart: |  unit  |  18m 52s |  |  hadoop-common in the patch 
passed.  |
   | +1 :green_heart: |  unit  |   7m 22s |  |  hadoop-mapreduce-client-core in 
the patch passed.  |
   | +1 :green_heart: |  asflicense  |   1m 37s |  |  The patch does not 
generate ASF License warnings.  |
   |  |   | 256m 57s |  |  |
   
   
   | Subsystem | Report/Notes |
   |--:|:-|
   | Docker | ClientAPI=1.41 ServerAPI=1.41 base: 
https://ci-hadoop.apache.org/job/hadoop-multibranch/job/PR-4521/3/artifact/out/Dockerfile
 |
   | GITHUB PR | https://github.com/apache/hadoop/pull/4521 |
   | Optional Tests | dupname asflicense compile javac javadoc mvninstall 
mvnsite unit shadedclient spotbugs checkstyle codespell detsecrets |
   | uname | Linux cd47bfeba05f 4.15.0-112-generic #113-Ubuntu SMP Thu Jul 9 
23:41:39 UTC 2020 x86_64 x86_64 x86_64 GNU/Linux |
   | Build tool | maven |
   | Personality | dev-support/bin/hadoop.sh |
   | git revision | trunk / a0f755ee92048b52d8bb2bb74e076758fcdc8fee |
   | Default Java | Private Build-1.8.0_312-8u312-b07-0ubuntu1~20.04-b07 |
   | Multi-JDK versions | /usr/lib/jvm/java-11-openjdk-amd64:Private 
Build-11.0.15+10-Ubuntu-0ubuntu0.20.04.1 
/usr/lib/jvm/java-8-openjdk-amd64:Private 
Build-1.8.0_312-8u312-b07-0ubuntu1~20.04-b07 |
   |  Test Results |

[jira] [Work logged] (HADOOP-18321) Fix when to read an additional record from a BZip2 text file split

2022-07-05 Thread ASF GitHub Bot (Jira)



 [ 
https://issues.apache.org/jira/browse/HADOOP-18321?focusedWorklogId=787770=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-787770
 ]

ASF GitHub Bot logged work on HADOOP-18321:
---

Author: ASF GitHub Bot
Created on: 05/Jul/22 09:17
Start Date: 05/Jul/22 09:17
Worklog Time Spent: 10m 
  Work Description: ashutoshcipher commented on code in PR #4521:
URL: https://github.com/apache/hadoop/pull/4521#discussion_r913576390


##
hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-core/src/test/java/org/apache/hadoop/mapreduce/lib/input/TestLineRecordReaderBZip2.java:
##
@@ -22,8 +22,13 @@
 public final class TestLineRecordReaderBZip2
 extends BaseTestLineRecordReaderBZip2 {
 
+  @Override
+  public void setUp() throws Exception {
+super.setUp();
+  }

Review Comment:
   Thanks @aajisaka - I have addressed the above comment.





Issue Time Tracking
---

Worklog Id: (was: 787770)
Time Spent: 1h  (was: 50m)

> Fix when to read an additional record from a BZip2 text file split
> --
>
> Key: HADOOP-18321
> URL: https://issues.apache.org/jira/browse/HADOOP-18321
> Project: Hadoop Common
>  Issue Type: Bug
>Affects Versions: 3.3.3
>Reporter: Ashutosh Gupta
>Assignee: Ashutosh Gupta
>Priority: Critical
>  Labels: pull-request-available
>  Time Spent: 1h
>  Remaining Estimate: 0h
>
> Fix data correctness issue with TextInputFormat that can occur when reading 
> BZip2 compressed text files. When triggered this bug would cause a split to 
> return the first record of the succeeding split that reads the next BZip2 
> block, thereby duplicating that record.
> *When the bug is triggered?*
> The condition for the bug to occur requires that the flag 
> "needAdditionalRecord" in CompressedSplitLineReader to be set to true by 
> #fillBuffer at an inappropriate time: when we haven't read the remaining 
> bytes of split. This can happen when the inDelimiter parameter is true while 
> #fillBuffer is invoked while reading the next line. The inDelimiter parameter 
> is true when either 1) the last byte of the buffer is a CR character ('\r') 
> if using the default delimiters, or 2) the last bytes of the buffer are a 
> common prefix of the delimiter if using a custom delimiter.
> This can occur in various edge cases, illustrated by five unit tests added in 
> this change -- specifically the five that would fail without the fix are as 
> listed below:
>  # 
> BaseTestLineRecordReaderBZip2.customDelimiter_lastRecordDelimiterStartsAtNextBlockStart
>  # BaseTestLineRecordReaderBZip2.firstBlockEndsWithLF_secondBlockStartsWithCR
>  # BaseTestLineRecordReaderBZip2.delimitedByCRSpanningThreeBlocks
>  # BaseTestLineRecordReaderBZip2.usingCRDelimiterWithSmallestBufferSize
>  # 
> BaseTestLineRecordReaderBZip2.customDelimiter_lastThreeBytesInBlockAreDelimiter
> For background, the purpose of "needAdditionalRecord" field in 
> CompressedSplitLineReader is to indicate to LineRecordReader via the 
> #needAdditionalRecordAfterSplit method that an extra record lying beyond the 
> split range should be included in the split. This complication arises due to 
> a problem when splitting text files. When a split starts at a position 
> greater than zero, we do not know whether the first line belongs to the last 
> record in the prior split or is a new record. The solution done in Hadoop is 
> to make splits that start at position greater than zero to always discard the 
> first line and then have the prior split decide whether it should include the 
> first line of the next split or not (as part of the last record or as a new 
> record). This works well even in the case of a single line spanning multiple 
> splits.
> *What is the fix?*
> The fix is to prevent ever setting "needAdditionalRecord" if the bytes filled 
> to the buffer are not the bytes immediately outside the range of the split.
> When reading compressed data, CompressedSplitLineReader requires/assumes that 
> the stream's #read method never returns bytes from more than one compression 
> block at a time. This ensures that #fillBuffer gets invoked to read the first 
> byte of the next block. This next block may or may not be part of the split 
> we are reading. If we detect that the last bytes of the prior block maybe 
> part of a delimiter, then we may decide that we should read an additional 
> record, but we should only do that when this next block is not part of our 
> split *and* we aren't filling the buffer again beyond our split range. This 
> is because we are only concerned whether the we need to read the very first 
> line of the next split as a separate record. If it going to be part of the 
> last record, then we

[jira] [Work logged] (HADOOP-18321) Fix when to read an additional record from a BZip2 text file split

2022-07-05 Thread ASF GitHub Bot (Jira)



 [ 
https://issues.apache.org/jira/browse/HADOOP-18321?focusedWorklogId=787768=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-787768
 ]

ASF GitHub Bot logged work on HADOOP-18321:
---

Author: ASF GitHub Bot
Created on: 05/Jul/22 09:09
Start Date: 05/Jul/22 09:09
Worklog Time Spent: 10m 
  Work Description: aajisaka commented on PR #4521:
URL: https://github.com/apache/hadoop/pull/4521#issuecomment-1174815875

   Other than the above comment, I'm +1 for this change.
   Background: This fix has been merged internally and working without any 
failure related to this fix in several months.




Issue Time Tracking
---

Worklog Id: (was: 787768)
Time Spent: 50m  (was: 40m)

> Fix when to read an additional record from a BZip2 text file split
> --
>
> Key: HADOOP-18321
> URL: https://issues.apache.org/jira/browse/HADOOP-18321
> Project: Hadoop Common
>  Issue Type: Bug
>Affects Versions: 3.3.3
>Reporter: Ashutosh Gupta
>Assignee: Ashutosh Gupta
>Priority: Critical
>  Labels: pull-request-available
>  Time Spent: 50m
>  Remaining Estimate: 0h
>
> Fix data correctness issue with TextInputFormat that can occur when reading 
> BZip2 compressed text files. When triggered this bug would cause a split to 
> return the first record of the succeeding split that reads the next BZip2 
> block, thereby duplicating that record.
> *When the bug is triggered?*
> The condition for the bug to occur requires that the flag 
> "needAdditionalRecord" in CompressedSplitLineReader to be set to true by 
> #fillBuffer at an inappropriate time: when we haven't read the remaining 
> bytes of split. This can happen when the inDelimiter parameter is true while 
> #fillBuffer is invoked while reading the next line. The inDelimiter parameter 
> is true when either 1) the last byte of the buffer is a CR character ('\r') 
> if using the default delimiters, or 2) the last bytes of the buffer are a 
> common prefix of the delimiter if using a custom delimiter.
> This can occur in various edge cases, illustrated by five unit tests added in 
> this change -- specifically the five that would fail without the fix are as 
> listed below:
>  # 
> BaseTestLineRecordReaderBZip2.customDelimiter_lastRecordDelimiterStartsAtNextBlockStart
>  # BaseTestLineRecordReaderBZip2.firstBlockEndsWithLF_secondBlockStartsWithCR
>  # BaseTestLineRecordReaderBZip2.delimitedByCRSpanningThreeBlocks
>  # BaseTestLineRecordReaderBZip2.usingCRDelimiterWithSmallestBufferSize
>  # 
> BaseTestLineRecordReaderBZip2.customDelimiter_lastThreeBytesInBlockAreDelimiter
> For background, the purpose of "needAdditionalRecord" field in 
> CompressedSplitLineReader is to indicate to LineRecordReader via the 
> #needAdditionalRecordAfterSplit method that an extra record lying beyond the 
> split range should be included in the split. This complication arises due to 
> a problem when splitting text files. When a split starts at a position 
> greater than zero, we do not know whether the first line belongs to the last 
> record in the prior split or is a new record. The solution done in Hadoop is 
> to make splits that start at position greater than zero to always discard the 
> first line and then have the prior split decide whether it should include the 
> first line of the next split or not (as part of the last record or as a new 
> record). This works well even in the case of a single line spanning multiple 
> splits.
> *What is the fix?*
> The fix is to prevent ever setting "needAdditionalRecord" if the bytes filled 
> to the buffer are not the bytes immediately outside the range of the split.
> When reading compressed data, CompressedSplitLineReader requires/assumes that 
> the stream's #read method never returns bytes from more than one compression 
> block at a time. This ensures that #fillBuffer gets invoked to read the first 
> byte of the next block. This next block may or may not be part of the split 
> we are reading. If we detect that the last bytes of the prior block maybe 
> part of a delimiter, then we may decide that we should read an additional 
> record, but we should only do that when this next block is not part of our 
> split *and* we aren't filling the buffer again beyond our split range. This 
> is because we are only concerned whether the we need to read the very first 
> line of the next split as a separate record. If it going to be part of the 
> last record, then we don't need to read an extra record, or in the special 
> case of CR + LF (i.e. "\r\n"), if the LF is the first byte of the next split, 
> it will be treated as an empty line, thus we don't need to include an extra 
> record into the mix.
> Thus, to emphasize. It is when we read the

[jira] [Work logged] (HADOOP-18321) Fix when to read an additional record from a BZip2 text file split

2022-07-05 Thread ASF GitHub Bot (Jira)



 [ 
https://issues.apache.org/jira/browse/HADOOP-18321?focusedWorklogId=787766=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-787766
 ]

ASF GitHub Bot logged work on HADOOP-18321:
---

Author: ASF GitHub Bot
Created on: 05/Jul/22 09:06
Start Date: 05/Jul/22 09:06
Worklog Time Spent: 10m 
  Work Description: aajisaka commented on code in PR #4521:
URL: https://github.com/apache/hadoop/pull/4521#discussion_r913565934


##
hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-core/src/test/java/org/apache/hadoop/mapreduce/lib/input/TestLineRecordReaderBZip2.java:
##
@@ -22,8 +22,13 @@
 public final class TestLineRecordReaderBZip2
 extends BaseTestLineRecordReaderBZip2 {
 
+  @Override
+  public void setUp() throws Exception {
+super.setUp();
+  }

Review Comment:
   I think the lines are not required for the checkstyle fix.





Issue Time Tracking
---

Worklog Id: (was: 787766)
Time Spent: 40m  (was: 0.5h)

> Fix when to read an additional record from a BZip2 text file split
> --
>
> Key: HADOOP-18321
> URL: https://issues.apache.org/jira/browse/HADOOP-18321
> Project: Hadoop Common
>  Issue Type: Bug
>Affects Versions: 3.3.3
>Reporter: Ashutosh Gupta
>Assignee: Ashutosh Gupta
>Priority: Critical
>  Labels: pull-request-available
>  Time Spent: 40m
>  Remaining Estimate: 0h
>
> Fix data correctness issue with TextInputFormat that can occur when reading 
> BZip2 compressed text files. When triggered this bug would cause a split to 
> return the first record of the succeeding split that reads the next BZip2 
> block, thereby duplicating that record.
> *When the bug is triggered?*
> The condition for the bug to occur requires that the flag 
> "needAdditionalRecord" in CompressedSplitLineReader to be set to true by 
> #fillBuffer at an inappropriate time: when we haven't read the remaining 
> bytes of split. This can happen when the inDelimiter parameter is true while 
> #fillBuffer is invoked while reading the next line. The inDelimiter parameter 
> is true when either 1) the last byte of the buffer is a CR character ('\r') 
> if using the default delimiters, or 2) the last bytes of the buffer are a 
> common prefix of the delimiter if using a custom delimiter.
> This can occur in various edge cases, illustrated by five unit tests added in 
> this change -- specifically the five that would fail without the fix are as 
> listed below:
>  # 
> BaseTestLineRecordReaderBZip2.customDelimiter_lastRecordDelimiterStartsAtNextBlockStart
>  # BaseTestLineRecordReaderBZip2.firstBlockEndsWithLF_secondBlockStartsWithCR
>  # BaseTestLineRecordReaderBZip2.delimitedByCRSpanningThreeBlocks
>  # BaseTestLineRecordReaderBZip2.usingCRDelimiterWithSmallestBufferSize
>  # 
> BaseTestLineRecordReaderBZip2.customDelimiter_lastThreeBytesInBlockAreDelimiter
> For background, the purpose of "needAdditionalRecord" field in 
> CompressedSplitLineReader is to indicate to LineRecordReader via the 
> #needAdditionalRecordAfterSplit method that an extra record lying beyond the 
> split range should be included in the split. This complication arises due to 
> a problem when splitting text files. When a split starts at a position 
> greater than zero, we do not know whether the first line belongs to the last 
> record in the prior split or is a new record. The solution done in Hadoop is 
> to make splits that start at position greater than zero to always discard the 
> first line and then have the prior split decide whether it should include the 
> first line of the next split or not (as part of the last record or as a new 
> record). This works well even in the case of a single line spanning multiple 
> splits.
> *What is the fix?*
> The fix is to prevent ever setting "needAdditionalRecord" if the bytes filled 
> to the buffer are not the bytes immediately outside the range of the split.
> When reading compressed data, CompressedSplitLineReader requires/assumes that 
> the stream's #read method never returns bytes from more than one compression 
> block at a time. This ensures that #fillBuffer gets invoked to read the first 
> byte of the next block. This next block may or may not be part of the split 
> we are reading. If we detect that the last bytes of the prior block maybe 
> part of a delimiter, then we may decide that we should read an additional 
> record, but we should only do that when this next block is not part of our 
> split *and* we aren't filling the buffer again beyond our split range. This 
> is because we are only concerned whether the we need to read the very first 
> line of the next split as a separate record. If it going to be part of the 
> last record, then

[jira] [Work logged] (HADOOP-18321) Fix when to read an additional record from a BZip2 text file split

2022-06-30 Thread ASF GitHub Bot (Jira)



 [ 
https://issues.apache.org/jira/browse/HADOOP-18321?focusedWorklogId=786901=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-786901
 ]

ASF GitHub Bot logged work on HADOOP-18321:
---

Author: ASF GitHub Bot
Created on: 01/Jul/22 01:08
Start Date: 01/Jul/22 01:08
Worklog Time Spent: 10m 
  Work Description: hadoop-yetus commented on PR #4521:
URL: https://github.com/apache/hadoop/pull/4521#issuecomment-1171818052

   :confetti_ball: **+1 overall**
   
   
   
   
   
   
   | Vote | Subsystem | Runtime |  Logfile | Comment |
   |::|--:|:|::|:---:|
   | +0 :ok: |  reexec  |   0m 46s |  |  Docker mode activated.  |
    _ Prechecks _ |
   | +1 :green_heart: |  dupname  |   0m  1s |  |  No case conflicting files 
found.  |
   | +0 :ok: |  codespell  |   0m  0s |  |  codespell was not available.  |
   | +0 :ok: |  detsecrets  |   0m  0s |  |  detect-secrets was not available.  
|
   | +1 :green_heart: |  @author  |   0m  0s |  |  The patch does not contain 
any @author tags.  |
   | +1 :green_heart: |  test4tests  |   0m  0s |  |  The patch appears to 
include 9 new or modified test files.  |
    _ trunk Compile Tests _ |
   | +0 :ok: |  mvndep  |  14m 35s |  |  Maven dependency ordering for branch  |
   | +1 :green_heart: |  mvninstall  |  26m 24s |  |  trunk passed  |
   | +1 :green_heart: |  compile  |  24m 12s |  |  trunk passed with JDK 
Private Build-11.0.15+10-Ubuntu-0ubuntu0.20.04.1  |
   | +1 :green_heart: |  compile  |  21m 21s |  |  trunk passed with JDK 
Private Build-1.8.0_312-8u312-b07-0ubuntu1~20.04-b07  |
   | +1 :green_heart: |  checkstyle  |   4m  8s |  |  trunk passed  |
   | +1 :green_heart: |  mvnsite  |   2m 53s |  |  trunk passed  |
   | +1 :green_heart: |  javadoc  |   1m 56s |  |  trunk passed with JDK 
Private Build-11.0.15+10-Ubuntu-0ubuntu0.20.04.1  |
   | +1 :green_heart: |  javadoc  |   1m 37s |  |  trunk passed with JDK 
Private Build-1.8.0_312-8u312-b07-0ubuntu1~20.04-b07  |
   | +1 :green_heart: |  spotbugs  |   4m 28s |  |  trunk passed  |
   | +1 :green_heart: |  shadedclient  |  21m 35s |  |  branch has no errors 
when building and testing our client artifacts.  |
    _ Patch Compile Tests _ |
   | +0 :ok: |  mvndep  |   0m 23s |  |  Maven dependency ordering for patch  |
   | +1 :green_heart: |  mvninstall  |   1m 43s |  |  the patch passed  |
   | +1 :green_heart: |  compile  |  22m  0s |  |  the patch passed with JDK 
Private Build-11.0.15+10-Ubuntu-0ubuntu0.20.04.1  |
   | +1 :green_heart: |  javac  |  22m  0s |  |  the patch passed  |
   | +1 :green_heart: |  compile  |  19m 56s |  |  the patch passed with JDK 
Private Build-1.8.0_312-8u312-b07-0ubuntu1~20.04-b07  |
   | +1 :green_heart: |  javac  |  19m 56s |  |  the patch passed  |
   | +1 :green_heart: |  blanks  |   0m  0s |  |  The patch has no blanks 
issues.  |
   | +1 :green_heart: |  checkstyle  |   5m 59s |  |  root: The patch generated 
0 new + 167 unchanged - 1 fixed = 167 total (was 168)  |
   | +1 :green_heart: |  mvnsite  |   2m 41s |  |  the patch passed  |
   | +1 :green_heart: |  javadoc  |   1m 50s |  |  the patch passed with JDK 
Private Build-11.0.15+10-Ubuntu-0ubuntu0.20.04.1  |
   | +1 :green_heart: |  javadoc  |   1m 34s |  |  the patch passed with JDK 
Private Build-1.8.0_312-8u312-b07-0ubuntu1~20.04-b07  |
   | +1 :green_heart: |  spotbugs  |   4m 24s |  |  the patch passed  |
   | +1 :green_heart: |  shadedclient  |  21m 12s |  |  patch has no errors 
when building and testing our client artifacts.  |
    _ Other Tests _ |
   | +1 :green_heart: |  unit  |  17m 59s |  |  hadoop-common in the patch 
passed.  |
   | +1 :green_heart: |  unit  |   6m 48s |  |  hadoop-mapreduce-client-core in 
the patch passed.  |
   | +1 :green_heart: |  asflicense  |   1m  6s |  |  The patch does not 
generate ASF License warnings.  |
   |  |   | 234m 42s |  |  |
   
   
   | Subsystem | Report/Notes |
   |--:|:-|
   | Docker | ClientAPI=1.41 ServerAPI=1.41 base: 
https://ci-hadoop.apache.org/job/hadoop-multibranch/job/PR-4521/2/artifact/out/Dockerfile
 |
   | GITHUB PR | https://github.com/apache/hadoop/pull/4521 |
   | Optional Tests | dupname asflicense compile javac javadoc mvninstall 
mvnsite unit shadedclient spotbugs checkstyle codespell detsecrets |
   | uname | Linux 58bc6c9ef44a 4.15.0-169-generic #177-Ubuntu SMP Thu Feb 3 
10:50:38 UTC 2022 x86_64 x86_64 x86_64 GNU/Linux |
   | Build tool | maven |
   | Personality | dev-support/bin/hadoop.sh |
   | git revision | trunk / 7a6efbb0b23d81ae74f7f9a9ca9c87ea12e962d5 |
   | Default Java | Private Build-1.8.0_312-8u312-b07-0ubuntu1~20.04-b07 |
   | Multi-JDK versions | /usr/lib/jvm/java-11-openjdk-amd64:Private 
Build-11.0.15+10-Ubuntu-0ubuntu0.20.04.1 
/usr/lib/jvm/java-8-openjdk-amd64:Private 
Build-1.8.0_312-8u312-b07-0ubuntu1~20.04-b07 |
   |  Test Results |

[jira] [Work logged] (HADOOP-18321) Fix when to read an additional record from a BZip2 text file split

2022-06-30 Thread ASF GitHub Bot (Jira)



 [ 
https://issues.apache.org/jira/browse/HADOOP-18321?focusedWorklogId=786815=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-786815
 ]

ASF GitHub Bot logged work on HADOOP-18321:
---

Author: ASF GitHub Bot
Created on: 30/Jun/22 20:12
Start Date: 30/Jun/22 20:12
Worklog Time Spent: 10m 
  Work Description: hadoop-yetus commented on PR #4521:
URL: https://github.com/apache/hadoop/pull/4521#issuecomment-1171632416

   :confetti_ball: **+1 overall**
   
   
   
   
   
   
   | Vote | Subsystem | Runtime |  Logfile | Comment |
   |::|--:|:|::|:---:|
   | +0 :ok: |  reexec  |   0m 41s |  |  Docker mode activated.  |
    _ Prechecks _ |
   | +1 :green_heart: |  dupname  |   0m  0s |  |  No case conflicting files 
found.  |
   | +0 :ok: |  codespell  |   0m  1s |  |  codespell was not available.  |
   | +0 :ok: |  detsecrets  |   0m  1s |  |  detect-secrets was not available.  
|
   | +1 :green_heart: |  @author  |   0m  0s |  |  The patch does not contain 
any @author tags.  |
   | +1 :green_heart: |  test4tests  |   0m  0s |  |  The patch appears to 
include 9 new or modified test files.  |
    _ trunk Compile Tests _ |
   | +0 :ok: |  mvndep  |  14m 40s |  |  Maven dependency ordering for branch  |
   | +1 :green_heart: |  mvninstall  |  25m 28s |  |  trunk passed  |
   | +1 :green_heart: |  compile  |  24m 33s |  |  trunk passed with JDK 
Private Build-11.0.15+10-Ubuntu-0ubuntu0.20.04.1  |
   | +1 :green_heart: |  compile  |  21m  6s |  |  trunk passed with JDK 
Private Build-1.8.0_312-8u312-b07-0ubuntu1~20.04-b07  |
   | +1 :green_heart: |  checkstyle  |   3m 50s |  |  trunk passed  |
   | +1 :green_heart: |  mvnsite  |   2m 36s |  |  trunk passed  |
   | +1 :green_heart: |  javadoc  |   2m  0s |  |  trunk passed with JDK 
Private Build-11.0.15+10-Ubuntu-0ubuntu0.20.04.1  |
   | +1 :green_heart: |  javadoc  |   1m 28s |  |  trunk passed with JDK 
Private Build-1.8.0_312-8u312-b07-0ubuntu1~20.04-b07  |
   | +1 :green_heart: |  spotbugs  |   4m 21s |  |  trunk passed  |
   | +1 :green_heart: |  shadedclient  |  21m 11s |  |  branch has no errors 
when building and testing our client artifacts.  |
    _ Patch Compile Tests _ |
   | +0 :ok: |  mvndep  |   0m 29s |  |  Maven dependency ordering for patch  |
   | +1 :green_heart: |  mvninstall  |   1m 30s |  |  the patch passed  |
   | +1 :green_heart: |  compile  |  22m 28s |  |  the patch passed with JDK 
Private Build-11.0.15+10-Ubuntu-0ubuntu0.20.04.1  |
   | +1 :green_heart: |  javac  |  22m 28s |  |  the patch passed  |
   | +1 :green_heart: |  compile  |  21m 10s |  |  the patch passed with JDK 
Private Build-1.8.0_312-8u312-b07-0ubuntu1~20.04-b07  |
   | +1 :green_heart: |  javac  |  21m 10s |  |  the patch passed  |
   | +1 :green_heart: |  blanks  |   0m  0s |  |  The patch has no blanks 
issues.  |
   | -0 :warning: |  checkstyle  |   3m 35s | 
[/results-checkstyle-root.txt](https://ci-hadoop.apache.org/job/hadoop-multibranch/job/PR-4521/1/artifact/out/results-checkstyle-root.txt)
 |  root: The patch generated 13 new + 167 unchanged - 1 fixed = 180 total (was 
168)  |
   | +1 :green_heart: |  mvnsite  |   2m 35s |  |  the patch passed  |
   | +1 :green_heart: |  javadoc  |   1m 49s |  |  the patch passed with JDK 
Private Build-11.0.15+10-Ubuntu-0ubuntu0.20.04.1  |
   | +1 :green_heart: |  javadoc  |   1m 30s |  |  the patch passed with JDK 
Private Build-1.8.0_312-8u312-b07-0ubuntu1~20.04-b07  |
   | +1 :green_heart: |  spotbugs  |   4m 26s |  |  the patch passed  |
   | +1 :green_heart: |  shadedclient  |  21m 23s |  |  patch has no errors 
when building and testing our client artifacts.  |
    _ Other Tests _ |
   | +1 :green_heart: |  unit  |  18m 18s |  |  hadoop-common in the patch 
passed.  |
   | +1 :green_heart: |  unit  |   7m 10s |  |  hadoop-mapreduce-client-core in 
the patch passed.  |
   | +1 :green_heart: |  asflicense  |   0m 59s |  |  The patch does not 
generate ASF License warnings.  |
   |  |   | 232m 29s |  |  |
   
   
   | Subsystem | Report/Notes |
   |--:|:-|
   | Docker | ClientAPI=1.41 ServerAPI=1.41 base: 
https://ci-hadoop.apache.org/job/hadoop-multibranch/job/PR-4521/1/artifact/out/Dockerfile
 |
   | GITHUB PR | https://github.com/apache/hadoop/pull/4521 |
   | Optional Tests | dupname asflicense compile javac javadoc mvninstall 
mvnsite unit shadedclient spotbugs checkstyle codespell detsecrets |
   | uname | Linux 411a69ccdf05 4.15.0-169-generic #177-Ubuntu SMP Thu Feb 3 
10:50:38 UTC 2022 x86_64 x86_64 x86_64 GNU/Linux |
   | Build tool | maven |
   | Personality | dev-support/bin/hadoop.sh |
   | git revision | trunk / 9261ada7d157fdabc4f6c66149196404d79f7d54 |
   | Default Java | Private Build-1.8.0_312-8u312-b07-0ubuntu1~20.04-b07 |
   | Multi-JDK versions | /usr/lib/jvm/java-11-openjdk-amd64:Private

[jira] [Work logged] (HADOOP-18321) Fix when to read an additional record from a BZip2 text file split

2022-06-30 Thread ASF GitHub Bot (Jira)



 [ 
https://issues.apache.org/jira/browse/HADOOP-18321?focusedWorklogId=786695=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-786695
 ]

ASF GitHub Bot logged work on HADOOP-18321:
---

Author: ASF GitHub Bot
Created on: 30/Jun/22 16:18
Start Date: 30/Jun/22 16:18
Worklog Time Spent: 10m 
  Work Description: ashutoshcipher opened a new pull request, #4521:
URL: https://github.com/apache/hadoop/pull/4521

   ### Description of PR
   
   Fix when to read an additional record from a BZip2 text file split
   
   JIRA - HADOOP-18321
   
   ### How was this patch tested?
   
   Added Units
   
   
   ### For code changes:
   
   - [X] Does the title or this PR starts with the corresponding JIRA issue id 
(e.g. 'HADOOP-17799. Your PR title ...')?
   - [ ] Object storage: have the integration tests been executed and the 
endpoint declared according to the connector-specific documentation?
   - [ ] If adding new dependencies to the code, are these dependencies 
licensed in a way that is compatible for inclusion under [ASF 
2.0](http://www.apache.org/legal/resolved.html#category-a)?
   - [ ] If applicable, have you updated the `LICENSE`, `LICENSE-binary`, 
`NOTICE-binary` files?
   
   




Issue Time Tracking
---

Worklog Id: (was: 786695)
Remaining Estimate: 0h
Time Spent: 10m

> Fix when to read an additional record from a BZip2 text file split
> --
>
> Key: HADOOP-18321
> URL: https://issues.apache.org/jira/browse/HADOOP-18321
> Project: Hadoop Common
>  Issue Type: Bug
>Affects Versions: 3.3.3
>Reporter: Ashutosh Gupta
>Assignee: Ashutosh Gupta
>Priority: Critical
>  Time Spent: 10m
>  Remaining Estimate: 0h
>
> Fix data correctness issue with TextInputFormat that can occur when reading 
> BZip2 compressed text files. When triggered this bug would cause a split to 
> return the first record of the succeeding split that reads the next BZip2 
> block, thereby duplicating that record.
> *When the bug is triggered?*
> The condition for the bug to occur requires that the flag 
> "needAdditionalRecord" in CompressedSplitLineReader to be set to true by 
> #fillBuffer at an inappropriate time: when we haven't read the remaining 
> bytes of split. This can happen when the inDelimiter parameter is true while 
> #fillBuffer is invoked while reading the next line. The inDelimiter parameter 
> is true when either 1) the last byte of the buffer is a CR character ('\r') 
> if using the default delimiters, or 2) the last bytes of the buffer are a 
> common prefix of the delimiter if using a custom delimiter.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: common-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: common-issues-h...@hadoop.apache.org

[jira] [Work logged] (HADOOP-18321) Fix when to read an additional record from a BZip2 text file split

[jira] [Work logged] (HADOOP-18321) Fix when to read an additional record from a BZip2 text file split

[jira] [Work logged] (HADOOP-18321) Fix when to read an additional record from a BZip2 text file split

[jira] [Work logged] (HADOOP-18321) Fix when to read an additional record from a BZip2 text file split

[jira] [Work logged] (HADOOP-18321) Fix when to read an additional record from a BZip2 text file split

[jira] [Work logged] (HADOOP-18321) Fix when to read an additional record from a BZip2 text file split

[jira] [Work logged] (HADOOP-18321) Fix when to read an additional record from a BZip2 text file split

[jira] [Work logged] (HADOOP-18321) Fix when to read an additional record from a BZip2 text file split

[jira] [Work logged] (HADOOP-18321) Fix when to read an additional record from a BZip2 text file split

[jira] [Work logged] (HADOOP-18321) Fix when to read an additional record from a BZip2 text file split

[jira] [Work logged] (HADOOP-18321) Fix when to read an additional record from a BZip2 text file split

[jira] [Work logged] (HADOOP-18321) Fix when to read an additional record from a BZip2 text file split

[jira] [Work logged] (HADOOP-18321) Fix when to read an additional record from a BZip2 text file split

[jira] [Work logged] (HADOOP-18321) Fix when to read an additional record from a BZip2 text file split

[jira] [Work logged] (HADOOP-18321) Fix when to read an additional record from a BZip2 text file split

[jira] [Work logged] (HADOOP-18321) Fix when to read an additional record from a BZip2 text file split

16 matches

Site Navigation

Mail list logo

Footer information