[jira] [Work logged] (HADOOP-18321) Fix when to read an additional record from a BZip2 text file split
[ https://issues.apache.org/jira/browse/HADOOP-18321?focusedWorklogId=789985=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-789985 ] ASF GitHub Bot logged work on HADOOP-18321: --- Author: ASF GitHub Bot Created on: 12/Jul/22 11:03 Start Date: 12/Jul/22 11:03 Worklog Time Spent: 10m Work Description: steveloughran commented on code in PR #4521: URL: https://github.com/apache/hadoop/pull/4521#discussion_r918837493 ## hadoop-common-project/hadoop-common/src/test/java/org/apache/hadoop/io/compress/bzip2/TestBZip2TextFileWriter.java: ## @@ -0,0 +1,91 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one + * or more contributor license agreements. See the NOTICE file + * distributed with this work for additional information + * regarding copyright ownership. The ASF licenses this file + * to you under the Apache License, Version 2.0 (the + * "License"); you may not use this file except in compliance + * with the License. You may obtain a copy of the License at + * + * http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ +package org.apache.hadoop.io.compress.bzip2; + +import static org.apache.hadoop.io.compress.bzip2.BZip2TextFileWriter.BLOCK_SIZE; +import static org.junit.Assert.assertEquals; + +import java.io.ByteArrayInputStream; Review Comment: thx. Issue Time Tracking --- Worklog Id: (was: 789985) Time Spent: 2h 40m (was: 2.5h) > Fix when to read an additional record from a BZip2 text file split > -- > > Key: HADOOP-18321 > URL: https://issues.apache.org/jira/browse/HADOOP-18321 > Project: Hadoop Common > Issue Type: Bug >Affects Versions: 3.3.3 >Reporter: Ashutosh Gupta >Assignee: Ashutosh Gupta >Priority: Critical > Labels: pull-request-available > Fix For: 3.4.0 > > Time Spent: 2h 40m > Remaining Estimate: 0h > > Fix data correctness issue with TextInputFormat that can occur when reading > BZip2 compressed text files. When triggered this bug would cause a split to > return the first record of the succeeding split that reads the next BZip2 > block, thereby duplicating that record. > *When the bug is triggered?* > The condition for the bug to occur requires that the flag > "needAdditionalRecord" in CompressedSplitLineReader to be set to true by > #fillBuffer at an inappropriate time: when we haven't read the remaining > bytes of split. This can happen when the inDelimiter parameter is true while > #fillBuffer is invoked while reading the next line. The inDelimiter parameter > is true when either 1) the last byte of the buffer is a CR character ('\r') > if using the default delimiters, or 2) the last bytes of the buffer are a > common prefix of the delimiter if using a custom delimiter. > This can occur in various edge cases, illustrated by five unit tests added in > this change -- specifically the five that would fail without the fix are as > listed below: > # > BaseTestLineRecordReaderBZip2.customDelimiter_lastRecordDelimiterStartsAtNextBlockStart > # BaseTestLineRecordReaderBZip2.firstBlockEndsWithLF_secondBlockStartsWithCR > # BaseTestLineRecordReaderBZip2.delimitedByCRSpanningThreeBlocks > # BaseTestLineRecordReaderBZip2.usingCRDelimiterWithSmallestBufferSize > # > BaseTestLineRecordReaderBZip2.customDelimiter_lastThreeBytesInBlockAreDelimiter > For background, the purpose of "needAdditionalRecord" field in > CompressedSplitLineReader is to indicate to LineRecordReader via the > #needAdditionalRecordAfterSplit method that an extra record lying beyond the > split range should be included in the split. This complication arises due to > a problem when splitting text files. When a split starts at a position > greater than zero, we do not know whether the first line belongs to the last > record in the prior split or is a new record. The solution done in Hadoop is > to make splits that start at position greater than zero to always discard the > first line and then have the prior split decide whether it should include the > first line of the next split or not (as part of the last record or as a new > record). This works well even in the case of a single line spanning multiple > splits. > *What is the fix?* > The fix is to prevent ever setting "needAdditionalRecord" if the bytes filled > to the buffer are not the bytes immediately outside the range of the split. >
[jira] [Work logged] (HADOOP-18321) Fix when to read an additional record from a BZip2 text file split
[ https://issues.apache.org/jira/browse/HADOOP-18321?focusedWorklogId=789669=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-789669 ] ASF GitHub Bot logged work on HADOOP-18321: --- Author: ASF GitHub Bot Created on: 11/Jul/22 17:03 Start Date: 11/Jul/22 17:03 Worklog Time Spent: 10m Work Description: ashutoshcipher commented on code in PR #4521: URL: https://github.com/apache/hadoop/pull/4521#discussion_r918164884 ## hadoop-common-project/hadoop-common/src/test/java/org/apache/hadoop/io/compress/bzip2/TestBZip2TextFileWriter.java: ## @@ -0,0 +1,91 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one + * or more contributor license agreements. See the NOTICE file + * distributed with this work for additional information + * regarding copyright ownership. The ASF licenses this file + * to you under the Apache License, Version 2.0 (the + * "License"); you may not use this file except in compliance + * with the License. You may obtain a copy of the License at + * + * http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ +package org.apache.hadoop.io.compress.bzip2; + +import static org.apache.hadoop.io.compress.bzip2.BZip2TextFileWriter.BLOCK_SIZE; +import static org.junit.Assert.assertEquals; + +import java.io.ByteArrayInputStream; Review Comment: @steveloughran - I will file a JIRA and fix the imports. Issue Time Tracking --- Worklog Id: (was: 789669) Time Spent: 2.5h (was: 2h 20m) > Fix when to read an additional record from a BZip2 text file split > -- > > Key: HADOOP-18321 > URL: https://issues.apache.org/jira/browse/HADOOP-18321 > Project: Hadoop Common > Issue Type: Bug >Affects Versions: 3.3.3 >Reporter: Ashutosh Gupta >Assignee: Ashutosh Gupta >Priority: Critical > Labels: pull-request-available > Fix For: 3.4.0 > > Time Spent: 2.5h > Remaining Estimate: 0h > > Fix data correctness issue with TextInputFormat that can occur when reading > BZip2 compressed text files. When triggered this bug would cause a split to > return the first record of the succeeding split that reads the next BZip2 > block, thereby duplicating that record. > *When the bug is triggered?* > The condition for the bug to occur requires that the flag > "needAdditionalRecord" in CompressedSplitLineReader to be set to true by > #fillBuffer at an inappropriate time: when we haven't read the remaining > bytes of split. This can happen when the inDelimiter parameter is true while > #fillBuffer is invoked while reading the next line. The inDelimiter parameter > is true when either 1) the last byte of the buffer is a CR character ('\r') > if using the default delimiters, or 2) the last bytes of the buffer are a > common prefix of the delimiter if using a custom delimiter. > This can occur in various edge cases, illustrated by five unit tests added in > this change -- specifically the five that would fail without the fix are as > listed below: > # > BaseTestLineRecordReaderBZip2.customDelimiter_lastRecordDelimiterStartsAtNextBlockStart > # BaseTestLineRecordReaderBZip2.firstBlockEndsWithLF_secondBlockStartsWithCR > # BaseTestLineRecordReaderBZip2.delimitedByCRSpanningThreeBlocks > # BaseTestLineRecordReaderBZip2.usingCRDelimiterWithSmallestBufferSize > # > BaseTestLineRecordReaderBZip2.customDelimiter_lastThreeBytesInBlockAreDelimiter > For background, the purpose of "needAdditionalRecord" field in > CompressedSplitLineReader is to indicate to LineRecordReader via the > #needAdditionalRecordAfterSplit method that an extra record lying beyond the > split range should be included in the split. This complication arises due to > a problem when splitting text files. When a split starts at a position > greater than zero, we do not know whether the first line belongs to the last > record in the prior split or is a new record. The solution done in Hadoop is > to make splits that start at position greater than zero to always discard the > first line and then have the prior split decide whether it should include the > first line of the next split or not (as part of the last record or as a new > record). This works well even in the case of a single line spanning multiple > splits. > *What is the fix?* > The fix is to prevent ever setting "needAdditionalRecord" if the bytes filled > to the buffer are not the
[jira] [Work logged] (HADOOP-18321) Fix when to read an additional record from a BZip2 text file split
[ https://issues.apache.org/jira/browse/HADOOP-18321?focusedWorklogId=789469=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-789469 ] ASF GitHub Bot logged work on HADOOP-18321: --- Author: ASF GitHub Bot Created on: 11/Jul/22 11:08 Start Date: 11/Jul/22 11:08 Worklog Time Spent: 10m Work Description: steveloughran commented on code in PR #4521: URL: https://github.com/apache/hadoop/pull/4521#discussion_r917810202 ## hadoop-common-project/hadoop-common/src/test/java/org/apache/hadoop/io/compress/bzip2/TestBZip2TextFileWriter.java: ## @@ -0,0 +1,91 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one + * or more contributor license agreements. See the NOTICE file + * distributed with this work for additional information + * regarding copyright ownership. The ASF licenses this file + * to you under the Apache License, Version 2.0 (the + * "License"); you may not use this file except in compliance + * with the License. You may obtain a copy of the License at + * + * http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ +package org.apache.hadoop.io.compress.bzip2; + +import static org.apache.hadoop.io.compress.bzip2.BZip2TextFileWriter.BLOCK_SIZE; +import static org.junit.Assert.assertEquals; + +import java.io.ByteArrayInputStream; Review Comment: that file puts statics at the bottom. at least it should. if it doesn't that's a bug Issue Time Tracking --- Worklog Id: (was: 789469) Time Spent: 2h 20m (was: 2h 10m) > Fix when to read an additional record from a BZip2 text file split > -- > > Key: HADOOP-18321 > URL: https://issues.apache.org/jira/browse/HADOOP-18321 > Project: Hadoop Common > Issue Type: Bug >Affects Versions: 3.3.3 >Reporter: Ashutosh Gupta >Assignee: Ashutosh Gupta >Priority: Critical > Labels: pull-request-available > Fix For: 3.4.0 > > Time Spent: 2h 20m > Remaining Estimate: 0h > > Fix data correctness issue with TextInputFormat that can occur when reading > BZip2 compressed text files. When triggered this bug would cause a split to > return the first record of the succeeding split that reads the next BZip2 > block, thereby duplicating that record. > *When the bug is triggered?* > The condition for the bug to occur requires that the flag > "needAdditionalRecord" in CompressedSplitLineReader to be set to true by > #fillBuffer at an inappropriate time: when we haven't read the remaining > bytes of split. This can happen when the inDelimiter parameter is true while > #fillBuffer is invoked while reading the next line. The inDelimiter parameter > is true when either 1) the last byte of the buffer is a CR character ('\r') > if using the default delimiters, or 2) the last bytes of the buffer are a > common prefix of the delimiter if using a custom delimiter. > This can occur in various edge cases, illustrated by five unit tests added in > this change -- specifically the five that would fail without the fix are as > listed below: > # > BaseTestLineRecordReaderBZip2.customDelimiter_lastRecordDelimiterStartsAtNextBlockStart > # BaseTestLineRecordReaderBZip2.firstBlockEndsWithLF_secondBlockStartsWithCR > # BaseTestLineRecordReaderBZip2.delimitedByCRSpanningThreeBlocks > # BaseTestLineRecordReaderBZip2.usingCRDelimiterWithSmallestBufferSize > # > BaseTestLineRecordReaderBZip2.customDelimiter_lastThreeBytesInBlockAreDelimiter > For background, the purpose of "needAdditionalRecord" field in > CompressedSplitLineReader is to indicate to LineRecordReader via the > #needAdditionalRecordAfterSplit method that an extra record lying beyond the > split range should be included in the split. This complication arises due to > a problem when splitting text files. When a split starts at a position > greater than zero, we do not know whether the first line belongs to the last > record in the prior split or is a new record. The solution done in Hadoop is > to make splits that start at position greater than zero to always discard the > first line and then have the prior split decide whether it should include the > first line of the next split or not (as part of the last record or as a new > record). This works well even in the case of a single line spanning multiple > splits. > *What is the fix?* > The fix is to prevent ever setting "needAdditionalRecord" if the bytes
[jira] [Work logged] (HADOOP-18321) Fix when to read an additional record from a BZip2 text file split
[ https://issues.apache.org/jira/browse/HADOOP-18321?focusedWorklogId=788675=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-788675 ] ASF GitHub Bot logged work on HADOOP-18321: --- Author: ASF GitHub Bot Created on: 07/Jul/22 16:58 Start Date: 07/Jul/22 16:58 Worklog Time Spent: 10m Work Description: ashutoshcipher commented on code in PR #4521: URL: https://github.com/apache/hadoop/pull/4521#discussion_r916095242 ## hadoop-common-project/hadoop-common/src/test/java/org/apache/hadoop/io/compress/bzip2/TestBZip2TextFileWriter.java: ## @@ -0,0 +1,91 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one + * or more contributor license agreements. See the NOTICE file + * distributed with this work for additional information + * regarding copyright ownership. The ASF licenses this file + * to you under the Apache License, Version 2.0 (the + * "License"); you may not use this file except in compliance + * with the License. You may obtain a copy of the License at + * + * http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ +package org.apache.hadoop.io.compress.bzip2; + +import static org.apache.hadoop.io.compress.bzip2.BZip2TextFileWriter.BLOCK_SIZE; +import static org.junit.Assert.assertEquals; + +import java.io.ByteArrayInputStream; Review Comment: Thanks @steveloughran for pointing it out. I am using this for code formatting - https://github.com/apache/hadoop/blob/trunk/dev-support/code-formatter/hadoop_idea_formatter.xml Issue Time Tracking --- Worklog Id: (was: 788675) Time Spent: 2h 10m (was: 2h) > Fix when to read an additional record from a BZip2 text file split > -- > > Key: HADOOP-18321 > URL: https://issues.apache.org/jira/browse/HADOOP-18321 > Project: Hadoop Common > Issue Type: Bug >Affects Versions: 3.3.3 >Reporter: Ashutosh Gupta >Assignee: Ashutosh Gupta >Priority: Critical > Labels: pull-request-available > Fix For: 3.4.0 > > Time Spent: 2h 10m > Remaining Estimate: 0h > > Fix data correctness issue with TextInputFormat that can occur when reading > BZip2 compressed text files. When triggered this bug would cause a split to > return the first record of the succeeding split that reads the next BZip2 > block, thereby duplicating that record. > *When the bug is triggered?* > The condition for the bug to occur requires that the flag > "needAdditionalRecord" in CompressedSplitLineReader to be set to true by > #fillBuffer at an inappropriate time: when we haven't read the remaining > bytes of split. This can happen when the inDelimiter parameter is true while > #fillBuffer is invoked while reading the next line. The inDelimiter parameter > is true when either 1) the last byte of the buffer is a CR character ('\r') > if using the default delimiters, or 2) the last bytes of the buffer are a > common prefix of the delimiter if using a custom delimiter. > This can occur in various edge cases, illustrated by five unit tests added in > this change -- specifically the five that would fail without the fix are as > listed below: > # > BaseTestLineRecordReaderBZip2.customDelimiter_lastRecordDelimiterStartsAtNextBlockStart > # BaseTestLineRecordReaderBZip2.firstBlockEndsWithLF_secondBlockStartsWithCR > # BaseTestLineRecordReaderBZip2.delimitedByCRSpanningThreeBlocks > # BaseTestLineRecordReaderBZip2.usingCRDelimiterWithSmallestBufferSize > # > BaseTestLineRecordReaderBZip2.customDelimiter_lastThreeBytesInBlockAreDelimiter > For background, the purpose of "needAdditionalRecord" field in > CompressedSplitLineReader is to indicate to LineRecordReader via the > #needAdditionalRecordAfterSplit method that an extra record lying beyond the > split range should be included in the split. This complication arises due to > a problem when splitting text files. When a split starts at a position > greater than zero, we do not know whether the first line belongs to the last > record in the prior split or is a new record. The solution done in Hadoop is > to make splits that start at position greater than zero to always discard the > first line and then have the prior split decide whether it should include the > first line of the next split or not (as part of the last record or as a new > record). This works well even in the case of a single line spanning multiple > splits. > *What
[jira] [Work logged] (HADOOP-18321) Fix when to read an additional record from a BZip2 text file split
[ https://issues.apache.org/jira/browse/HADOOP-18321?focusedWorklogId=788672=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-788672 ] ASF GitHub Bot logged work on HADOOP-18321: --- Author: ASF GitHub Bot Created on: 07/Jul/22 16:53 Start Date: 07/Jul/22 16:53 Worklog Time Spent: 10m Work Description: steveloughran commented on code in PR #4521: URL: https://github.com/apache/hadoop/pull/4521#discussion_r916090530 ## hadoop-common-project/hadoop-common/src/test/java/org/apache/hadoop/io/compress/bzip2/TestBZip2TextFileWriter.java: ## @@ -0,0 +1,91 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one + * or more contributor license agreements. See the NOTICE file + * distributed with this work for additional information + * regarding copyright ownership. The ASF licenses this file + * to you under the Apache License, Version 2.0 (the + * "License"); you may not use this file except in compliance + * with the License. You may obtain a copy of the License at + * + * http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ +package org.apache.hadoop.io.compress.bzip2; + +import static org.apache.hadoop.io.compress.bzip2.BZip2TextFileWriter.BLOCK_SIZE; +import static org.junit.Assert.assertEquals; + +import java.io.ByteArrayInputStream; Review Comment: bit late, but the imports are completely out of sync with the normal hadoop rules. check your ide settings. Issue Time Tracking --- Worklog Id: (was: 788672) Time Spent: 2h (was: 1h 50m) > Fix when to read an additional record from a BZip2 text file split > -- > > Key: HADOOP-18321 > URL: https://issues.apache.org/jira/browse/HADOOP-18321 > Project: Hadoop Common > Issue Type: Bug >Affects Versions: 3.3.3 >Reporter: Ashutosh Gupta >Assignee: Ashutosh Gupta >Priority: Critical > Labels: pull-request-available > Fix For: 3.4.0 > > Time Spent: 2h > Remaining Estimate: 0h > > Fix data correctness issue with TextInputFormat that can occur when reading > BZip2 compressed text files. When triggered this bug would cause a split to > return the first record of the succeeding split that reads the next BZip2 > block, thereby duplicating that record. > *When the bug is triggered?* > The condition for the bug to occur requires that the flag > "needAdditionalRecord" in CompressedSplitLineReader to be set to true by > #fillBuffer at an inappropriate time: when we haven't read the remaining > bytes of split. This can happen when the inDelimiter parameter is true while > #fillBuffer is invoked while reading the next line. The inDelimiter parameter > is true when either 1) the last byte of the buffer is a CR character ('\r') > if using the default delimiters, or 2) the last bytes of the buffer are a > common prefix of the delimiter if using a custom delimiter. > This can occur in various edge cases, illustrated by five unit tests added in > this change -- specifically the five that would fail without the fix are as > listed below: > # > BaseTestLineRecordReaderBZip2.customDelimiter_lastRecordDelimiterStartsAtNextBlockStart > # BaseTestLineRecordReaderBZip2.firstBlockEndsWithLF_secondBlockStartsWithCR > # BaseTestLineRecordReaderBZip2.delimitedByCRSpanningThreeBlocks > # BaseTestLineRecordReaderBZip2.usingCRDelimiterWithSmallestBufferSize > # > BaseTestLineRecordReaderBZip2.customDelimiter_lastThreeBytesInBlockAreDelimiter > For background, the purpose of "needAdditionalRecord" field in > CompressedSplitLineReader is to indicate to LineRecordReader via the > #needAdditionalRecordAfterSplit method that an extra record lying beyond the > split range should be included in the split. This complication arises due to > a problem when splitting text files. When a split starts at a position > greater than zero, we do not know whether the first line belongs to the last > record in the prior split or is a new record. The solution done in Hadoop is > to make splits that start at position greater than zero to always discard the > first line and then have the prior split decide whether it should include the > first line of the next split or not (as part of the last record or as a new > record). This works well even in the case of a single line spanning multiple > splits. > *What is the fix?* > The fix is to prevent ever setting "needAdditionalRecord" if
[jira] [Work logged] (HADOOP-18321) Fix when to read an additional record from a BZip2 text file split
[ https://issues.apache.org/jira/browse/HADOOP-18321?focusedWorklogId=788175=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-788175 ] ASF GitHub Bot logged work on HADOOP-18321: --- Author: ASF GitHub Bot Created on: 06/Jul/22 10:25 Start Date: 06/Jul/22 10:25 Worklog Time Spent: 10m Work Description: ashutoshcipher commented on PR #4521: URL: https://github.com/apache/hadoop/pull/4521#issuecomment-1176053540 Thanks @aajisaka @PrabhuJoseph and @saswata-dutta :) Issue Time Tracking --- Worklog Id: (was: 788175) Time Spent: 1h 50m (was: 1h 40m) > Fix when to read an additional record from a BZip2 text file split > -- > > Key: HADOOP-18321 > URL: https://issues.apache.org/jira/browse/HADOOP-18321 > Project: Hadoop Common > Issue Type: Bug >Affects Versions: 3.3.3 >Reporter: Ashutosh Gupta >Assignee: Ashutosh Gupta >Priority: Critical > Labels: pull-request-available > Fix For: 3.4.0 > > Time Spent: 1h 50m > Remaining Estimate: 0h > > Fix data correctness issue with TextInputFormat that can occur when reading > BZip2 compressed text files. When triggered this bug would cause a split to > return the first record of the succeeding split that reads the next BZip2 > block, thereby duplicating that record. > *When the bug is triggered?* > The condition for the bug to occur requires that the flag > "needAdditionalRecord" in CompressedSplitLineReader to be set to true by > #fillBuffer at an inappropriate time: when we haven't read the remaining > bytes of split. This can happen when the inDelimiter parameter is true while > #fillBuffer is invoked while reading the next line. The inDelimiter parameter > is true when either 1) the last byte of the buffer is a CR character ('\r') > if using the default delimiters, or 2) the last bytes of the buffer are a > common prefix of the delimiter if using a custom delimiter. > This can occur in various edge cases, illustrated by five unit tests added in > this change -- specifically the five that would fail without the fix are as > listed below: > # > BaseTestLineRecordReaderBZip2.customDelimiter_lastRecordDelimiterStartsAtNextBlockStart > # BaseTestLineRecordReaderBZip2.firstBlockEndsWithLF_secondBlockStartsWithCR > # BaseTestLineRecordReaderBZip2.delimitedByCRSpanningThreeBlocks > # BaseTestLineRecordReaderBZip2.usingCRDelimiterWithSmallestBufferSize > # > BaseTestLineRecordReaderBZip2.customDelimiter_lastThreeBytesInBlockAreDelimiter > For background, the purpose of "needAdditionalRecord" field in > CompressedSplitLineReader is to indicate to LineRecordReader via the > #needAdditionalRecordAfterSplit method that an extra record lying beyond the > split range should be included in the split. This complication arises due to > a problem when splitting text files. When a split starts at a position > greater than zero, we do not know whether the first line belongs to the last > record in the prior split or is a new record. The solution done in Hadoop is > to make splits that start at position greater than zero to always discard the > first line and then have the prior split decide whether it should include the > first line of the next split or not (as part of the last record or as a new > record). This works well even in the case of a single line spanning multiple > splits. > *What is the fix?* > The fix is to prevent ever setting "needAdditionalRecord" if the bytes filled > to the buffer are not the bytes immediately outside the range of the split. > When reading compressed data, CompressedSplitLineReader requires/assumes that > the stream's #read method never returns bytes from more than one compression > block at a time. This ensures that #fillBuffer gets invoked to read the first > byte of the next block. This next block may or may not be part of the split > we are reading. If we detect that the last bytes of the prior block maybe > part of a delimiter, then we may decide that we should read an additional > record, but we should only do that when this next block is not part of our > split *and* we aren't filling the buffer again beyond our split range. This > is because we are only concerned whether the we need to read the very first > line of the next split as a separate record. If it going to be part of the > last record, then we don't need to read an extra record, or in the special > case of CR + LF (i.e. "\r\n"), if the LF is the first byte of the next split, > it will be treated as an empty line, thus we don't need to include an extra > record into the mix. > Thus, to emphasize. It is when we read the first bytes outside our split > range that matters. But the current logic
[jira] [Work logged] (HADOOP-18321) Fix when to read an additional record from a BZip2 text file split
[ https://issues.apache.org/jira/browse/HADOOP-18321?focusedWorklogId=788144=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-788144 ] ASF GitHub Bot logged work on HADOOP-18321: --- Author: ASF GitHub Bot Created on: 06/Jul/22 06:33 Start Date: 06/Jul/22 06:33 Worklog Time Spent: 10m Work Description: aajisaka commented on PR #4521: URL: https://github.com/apache/hadoop/pull/4521#issuecomment-1175834804 My late +1. Thank you @PrabhuJoseph and @ashutoshcipher Issue Time Tracking --- Worklog Id: (was: 788144) Time Spent: 1h 40m (was: 1.5h) > Fix when to read an additional record from a BZip2 text file split > -- > > Key: HADOOP-18321 > URL: https://issues.apache.org/jira/browse/HADOOP-18321 > Project: Hadoop Common > Issue Type: Bug >Affects Versions: 3.3.3 >Reporter: Ashutosh Gupta >Assignee: Ashutosh Gupta >Priority: Critical > Labels: pull-request-available > Fix For: 3.4.0 > > Time Spent: 1h 40m > Remaining Estimate: 0h > > Fix data correctness issue with TextInputFormat that can occur when reading > BZip2 compressed text files. When triggered this bug would cause a split to > return the first record of the succeeding split that reads the next BZip2 > block, thereby duplicating that record. > *When the bug is triggered?* > The condition for the bug to occur requires that the flag > "needAdditionalRecord" in CompressedSplitLineReader to be set to true by > #fillBuffer at an inappropriate time: when we haven't read the remaining > bytes of split. This can happen when the inDelimiter parameter is true while > #fillBuffer is invoked while reading the next line. The inDelimiter parameter > is true when either 1) the last byte of the buffer is a CR character ('\r') > if using the default delimiters, or 2) the last bytes of the buffer are a > common prefix of the delimiter if using a custom delimiter. > This can occur in various edge cases, illustrated by five unit tests added in > this change -- specifically the five that would fail without the fix are as > listed below: > # > BaseTestLineRecordReaderBZip2.customDelimiter_lastRecordDelimiterStartsAtNextBlockStart > # BaseTestLineRecordReaderBZip2.firstBlockEndsWithLF_secondBlockStartsWithCR > # BaseTestLineRecordReaderBZip2.delimitedByCRSpanningThreeBlocks > # BaseTestLineRecordReaderBZip2.usingCRDelimiterWithSmallestBufferSize > # > BaseTestLineRecordReaderBZip2.customDelimiter_lastThreeBytesInBlockAreDelimiter > For background, the purpose of "needAdditionalRecord" field in > CompressedSplitLineReader is to indicate to LineRecordReader via the > #needAdditionalRecordAfterSplit method that an extra record lying beyond the > split range should be included in the split. This complication arises due to > a problem when splitting text files. When a split starts at a position > greater than zero, we do not know whether the first line belongs to the last > record in the prior split or is a new record. The solution done in Hadoop is > to make splits that start at position greater than zero to always discard the > first line and then have the prior split decide whether it should include the > first line of the next split or not (as part of the last record or as a new > record). This works well even in the case of a single line spanning multiple > splits. > *What is the fix?* > The fix is to prevent ever setting "needAdditionalRecord" if the bytes filled > to the buffer are not the bytes immediately outside the range of the split. > When reading compressed data, CompressedSplitLineReader requires/assumes that > the stream's #read method never returns bytes from more than one compression > block at a time. This ensures that #fillBuffer gets invoked to read the first > byte of the next block. This next block may or may not be part of the split > we are reading. If we detect that the last bytes of the prior block maybe > part of a delimiter, then we may decide that we should read an additional > record, but we should only do that when this next block is not part of our > split *and* we aren't filling the buffer again beyond our split range. This > is because we are only concerned whether the we need to read the very first > line of the next split as a separate record. If it going to be part of the > last record, then we don't need to read an extra record, or in the special > case of CR + LF (i.e. "\r\n"), if the LF is the first byte of the next split, > it will be treated as an empty line, thus we don't need to include an extra > record into the mix. > Thus, to emphasize. It is when we read the first bytes outside our split > range that matters. But the current logic
[jira] [Work logged] (HADOOP-18321) Fix when to read an additional record from a BZip2 text file split
[ https://issues.apache.org/jira/browse/HADOOP-18321?focusedWorklogId=788135=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-788135 ] ASF GitHub Bot logged work on HADOOP-18321: --- Author: ASF GitHub Bot Created on: 06/Jul/22 04:30 Start Date: 06/Jul/22 04:30 Worklog Time Spent: 10m Work Description: PrabhuJoseph merged PR #4521: URL: https://github.com/apache/hadoop/pull/4521 Issue Time Tracking --- Worklog Id: (was: 788135) Time Spent: 1.5h (was: 1h 20m) > Fix when to read an additional record from a BZip2 text file split > -- > > Key: HADOOP-18321 > URL: https://issues.apache.org/jira/browse/HADOOP-18321 > Project: Hadoop Common > Issue Type: Bug >Affects Versions: 3.3.3 >Reporter: Ashutosh Gupta >Assignee: Ashutosh Gupta >Priority: Critical > Labels: pull-request-available > Time Spent: 1.5h > Remaining Estimate: 0h > > Fix data correctness issue with TextInputFormat that can occur when reading > BZip2 compressed text files. When triggered this bug would cause a split to > return the first record of the succeeding split that reads the next BZip2 > block, thereby duplicating that record. > *When the bug is triggered?* > The condition for the bug to occur requires that the flag > "needAdditionalRecord" in CompressedSplitLineReader to be set to true by > #fillBuffer at an inappropriate time: when we haven't read the remaining > bytes of split. This can happen when the inDelimiter parameter is true while > #fillBuffer is invoked while reading the next line. The inDelimiter parameter > is true when either 1) the last byte of the buffer is a CR character ('\r') > if using the default delimiters, or 2) the last bytes of the buffer are a > common prefix of the delimiter if using a custom delimiter. > This can occur in various edge cases, illustrated by five unit tests added in > this change -- specifically the five that would fail without the fix are as > listed below: > # > BaseTestLineRecordReaderBZip2.customDelimiter_lastRecordDelimiterStartsAtNextBlockStart > # BaseTestLineRecordReaderBZip2.firstBlockEndsWithLF_secondBlockStartsWithCR > # BaseTestLineRecordReaderBZip2.delimitedByCRSpanningThreeBlocks > # BaseTestLineRecordReaderBZip2.usingCRDelimiterWithSmallestBufferSize > # > BaseTestLineRecordReaderBZip2.customDelimiter_lastThreeBytesInBlockAreDelimiter > For background, the purpose of "needAdditionalRecord" field in > CompressedSplitLineReader is to indicate to LineRecordReader via the > #needAdditionalRecordAfterSplit method that an extra record lying beyond the > split range should be included in the split. This complication arises due to > a problem when splitting text files. When a split starts at a position > greater than zero, we do not know whether the first line belongs to the last > record in the prior split or is a new record. The solution done in Hadoop is > to make splits that start at position greater than zero to always discard the > first line and then have the prior split decide whether it should include the > first line of the next split or not (as part of the last record or as a new > record). This works well even in the case of a single line spanning multiple > splits. > *What is the fix?* > The fix is to prevent ever setting "needAdditionalRecord" if the bytes filled > to the buffer are not the bytes immediately outside the range of the split. > When reading compressed data, CompressedSplitLineReader requires/assumes that > the stream's #read method never returns bytes from more than one compression > block at a time. This ensures that #fillBuffer gets invoked to read the first > byte of the next block. This next block may or may not be part of the split > we are reading. If we detect that the last bytes of the prior block maybe > part of a delimiter, then we may decide that we should read an additional > record, but we should only do that when this next block is not part of our > split *and* we aren't filling the buffer again beyond our split range. This > is because we are only concerned whether the we need to read the very first > line of the next split as a separate record. If it going to be part of the > last record, then we don't need to read an extra record, or in the special > case of CR + LF (i.e. "\r\n"), if the LF is the first byte of the next split, > it will be treated as an empty line, thus we don't need to include an extra > record into the mix. > Thus, to emphasize. It is when we read the first bytes outside our split > range that matters. But the current logic doesn't take that into account in > CompressedSplitLineReader. This is in contrast to UncompressedSplitLineReader > which
[jira] [Work logged] (HADOOP-18321) Fix when to read an additional record from a BZip2 text file split
[ https://issues.apache.org/jira/browse/HADOOP-18321?focusedWorklogId=788134=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-788134 ] ASF GitHub Bot logged work on HADOOP-18321: --- Author: ASF GitHub Bot Created on: 06/Jul/22 04:29 Start Date: 06/Jul/22 04:29 Worklog Time Spent: 10m Work Description: PrabhuJoseph commented on PR #4521: URL: https://github.com/apache/hadoop/pull/4521#issuecomment-1175768212 Thanks @ashutoshcipher for the patch and @aajisaka for the review. Issue Time Tracking --- Worklog Id: (was: 788134) Time Spent: 1h 20m (was: 1h 10m) > Fix when to read an additional record from a BZip2 text file split > -- > > Key: HADOOP-18321 > URL: https://issues.apache.org/jira/browse/HADOOP-18321 > Project: Hadoop Common > Issue Type: Bug >Affects Versions: 3.3.3 >Reporter: Ashutosh Gupta >Assignee: Ashutosh Gupta >Priority: Critical > Labels: pull-request-available > Time Spent: 1h 20m > Remaining Estimate: 0h > > Fix data correctness issue with TextInputFormat that can occur when reading > BZip2 compressed text files. When triggered this bug would cause a split to > return the first record of the succeeding split that reads the next BZip2 > block, thereby duplicating that record. > *When the bug is triggered?* > The condition for the bug to occur requires that the flag > "needAdditionalRecord" in CompressedSplitLineReader to be set to true by > #fillBuffer at an inappropriate time: when we haven't read the remaining > bytes of split. This can happen when the inDelimiter parameter is true while > #fillBuffer is invoked while reading the next line. The inDelimiter parameter > is true when either 1) the last byte of the buffer is a CR character ('\r') > if using the default delimiters, or 2) the last bytes of the buffer are a > common prefix of the delimiter if using a custom delimiter. > This can occur in various edge cases, illustrated by five unit tests added in > this change -- specifically the five that would fail without the fix are as > listed below: > # > BaseTestLineRecordReaderBZip2.customDelimiter_lastRecordDelimiterStartsAtNextBlockStart > # BaseTestLineRecordReaderBZip2.firstBlockEndsWithLF_secondBlockStartsWithCR > # BaseTestLineRecordReaderBZip2.delimitedByCRSpanningThreeBlocks > # BaseTestLineRecordReaderBZip2.usingCRDelimiterWithSmallestBufferSize > # > BaseTestLineRecordReaderBZip2.customDelimiter_lastThreeBytesInBlockAreDelimiter > For background, the purpose of "needAdditionalRecord" field in > CompressedSplitLineReader is to indicate to LineRecordReader via the > #needAdditionalRecordAfterSplit method that an extra record lying beyond the > split range should be included in the split. This complication arises due to > a problem when splitting text files. When a split starts at a position > greater than zero, we do not know whether the first line belongs to the last > record in the prior split or is a new record. The solution done in Hadoop is > to make splits that start at position greater than zero to always discard the > first line and then have the prior split decide whether it should include the > first line of the next split or not (as part of the last record or as a new > record). This works well even in the case of a single line spanning multiple > splits. > *What is the fix?* > The fix is to prevent ever setting "needAdditionalRecord" if the bytes filled > to the buffer are not the bytes immediately outside the range of the split. > When reading compressed data, CompressedSplitLineReader requires/assumes that > the stream's #read method never returns bytes from more than one compression > block at a time. This ensures that #fillBuffer gets invoked to read the first > byte of the next block. This next block may or may not be part of the split > we are reading. If we detect that the last bytes of the prior block maybe > part of a delimiter, then we may decide that we should read an additional > record, but we should only do that when this next block is not part of our > split *and* we aren't filling the buffer again beyond our split range. This > is because we are only concerned whether the we need to read the very first > line of the next split as a separate record. If it going to be part of the > last record, then we don't need to read an extra record, or in the special > case of CR + LF (i.e. "\r\n"), if the LF is the first byte of the next split, > it will be treated as an empty line, thus we don't need to include an extra > record into the mix. > Thus, to emphasize. It is when we read the first bytes outside our split > range that matters. But the current logic doesn't take that
[jira] [Work logged] (HADOOP-18321) Fix when to read an additional record from a BZip2 text file split
[ https://issues.apache.org/jira/browse/HADOOP-18321?focusedWorklogId=787856=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-787856 ] ASF GitHub Bot logged work on HADOOP-18321: --- Author: ASF GitHub Bot Created on: 05/Jul/22 13:34 Start Date: 05/Jul/22 13:34 Worklog Time Spent: 10m Work Description: hadoop-yetus commented on PR #4521: URL: https://github.com/apache/hadoop/pull/4521#issuecomment-1175068343 :confetti_ball: **+1 overall** | Vote | Subsystem | Runtime | Logfile | Comment | |::|--:|:|::|:---:| | +0 :ok: | reexec | 13m 6s | | Docker mode activated. | _ Prechecks _ | | +1 :green_heart: | dupname | 0m 0s | | No case conflicting files found. | | +0 :ok: | codespell | 0m 1s | | codespell was not available. | | +0 :ok: | detsecrets | 0m 1s | | detect-secrets was not available. | | +1 :green_heart: | @author | 0m 0s | | The patch does not contain any @author tags. | | +1 :green_heart: | test4tests | 0m 0s | | The patch appears to include 9 new or modified test files. | _ trunk Compile Tests _ | | +0 :ok: | mvndep | 14m 36s | | Maven dependency ordering for branch | | +1 :green_heart: | mvninstall | 24m 49s | | trunk passed | | +1 :green_heart: | compile | 23m 1s | | trunk passed with JDK Private Build-11.0.15+10-Ubuntu-0ubuntu0.20.04.1 | | +1 :green_heart: | compile | 20m 34s | | trunk passed with JDK Private Build-1.8.0_312-8u312-b07-0ubuntu1~20.04-b07 | | +1 :green_heart: | checkstyle | 4m 25s | | trunk passed | | +1 :green_heart: | mvnsite | 3m 46s | | trunk passed | | +1 :green_heart: | javadoc | 3m 1s | | trunk passed with JDK Private Build-11.0.15+10-Ubuntu-0ubuntu0.20.04.1 | | +1 :green_heart: | javadoc | 2m 36s | | trunk passed with JDK Private Build-1.8.0_312-8u312-b07-0ubuntu1~20.04-b07 | | +1 :green_heart: | spotbugs | 5m 23s | | trunk passed | | +1 :green_heart: | shadedclient | 22m 25s | | branch has no errors when building and testing our client artifacts. | _ Patch Compile Tests _ | | +0 :ok: | mvndep | 0m 28s | | Maven dependency ordering for patch | | +1 :green_heart: | mvninstall | 1m 47s | | the patch passed | | +1 :green_heart: | compile | 22m 18s | | the patch passed with JDK Private Build-11.0.15+10-Ubuntu-0ubuntu0.20.04.1 | | +1 :green_heart: | javac | 22m 18s | | the patch passed | | +1 :green_heart: | compile | 20m 33s | | the patch passed with JDK Private Build-1.8.0_312-8u312-b07-0ubuntu1~20.04-b07 | | +1 :green_heart: | javac | 20m 33s | | the patch passed | | +1 :green_heart: | blanks | 0m 0s | | The patch has no blanks issues. | | +1 :green_heart: | checkstyle | 4m 12s | | root: The patch generated 0 new + 167 unchanged - 1 fixed = 167 total (was 168) | | +1 :green_heart: | mvnsite | 3m 44s | | the patch passed | | +1 :green_heart: | javadoc | 2m 55s | | the patch passed with JDK Private Build-11.0.15+10-Ubuntu-0ubuntu0.20.04.1 | | +1 :green_heart: | javadoc | 2m 36s | | the patch passed with JDK Private Build-1.8.0_312-8u312-b07-0ubuntu1~20.04-b07 | | +1 :green_heart: | spotbugs | 5m 29s | | the patch passed | | +1 :green_heart: | shadedclient | 22m 47s | | patch has no errors when building and testing our client artifacts. | _ Other Tests _ | | +1 :green_heart: | unit | 18m 52s | | hadoop-common in the patch passed. | | +1 :green_heart: | unit | 7m 22s | | hadoop-mapreduce-client-core in the patch passed. | | +1 :green_heart: | asflicense | 1m 37s | | The patch does not generate ASF License warnings. | | | | 256m 57s | | | | Subsystem | Report/Notes | |--:|:-| | Docker | ClientAPI=1.41 ServerAPI=1.41 base: https://ci-hadoop.apache.org/job/hadoop-multibranch/job/PR-4521/3/artifact/out/Dockerfile | | GITHUB PR | https://github.com/apache/hadoop/pull/4521 | | Optional Tests | dupname asflicense compile javac javadoc mvninstall mvnsite unit shadedclient spotbugs checkstyle codespell detsecrets | | uname | Linux cd47bfeba05f 4.15.0-112-generic #113-Ubuntu SMP Thu Jul 9 23:41:39 UTC 2020 x86_64 x86_64 x86_64 GNU/Linux | | Build tool | maven | | Personality | dev-support/bin/hadoop.sh | | git revision | trunk / a0f755ee92048b52d8bb2bb74e076758fcdc8fee | | Default Java | Private Build-1.8.0_312-8u312-b07-0ubuntu1~20.04-b07 | | Multi-JDK versions | /usr/lib/jvm/java-11-openjdk-amd64:Private Build-11.0.15+10-Ubuntu-0ubuntu0.20.04.1 /usr/lib/jvm/java-8-openjdk-amd64:Private Build-1.8.0_312-8u312-b07-0ubuntu1~20.04-b07 | | Test Results |
[jira] [Work logged] (HADOOP-18321) Fix when to read an additional record from a BZip2 text file split
[ https://issues.apache.org/jira/browse/HADOOP-18321?focusedWorklogId=787770=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-787770 ] ASF GitHub Bot logged work on HADOOP-18321: --- Author: ASF GitHub Bot Created on: 05/Jul/22 09:17 Start Date: 05/Jul/22 09:17 Worklog Time Spent: 10m Work Description: ashutoshcipher commented on code in PR #4521: URL: https://github.com/apache/hadoop/pull/4521#discussion_r913576390 ## hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-core/src/test/java/org/apache/hadoop/mapreduce/lib/input/TestLineRecordReaderBZip2.java: ## @@ -22,8 +22,13 @@ public final class TestLineRecordReaderBZip2 extends BaseTestLineRecordReaderBZip2 { + @Override + public void setUp() throws Exception { +super.setUp(); + } Review Comment: Thanks @aajisaka - I have addressed the above comment. Issue Time Tracking --- Worklog Id: (was: 787770) Time Spent: 1h (was: 50m) > Fix when to read an additional record from a BZip2 text file split > -- > > Key: HADOOP-18321 > URL: https://issues.apache.org/jira/browse/HADOOP-18321 > Project: Hadoop Common > Issue Type: Bug >Affects Versions: 3.3.3 >Reporter: Ashutosh Gupta >Assignee: Ashutosh Gupta >Priority: Critical > Labels: pull-request-available > Time Spent: 1h > Remaining Estimate: 0h > > Fix data correctness issue with TextInputFormat that can occur when reading > BZip2 compressed text files. When triggered this bug would cause a split to > return the first record of the succeeding split that reads the next BZip2 > block, thereby duplicating that record. > *When the bug is triggered?* > The condition for the bug to occur requires that the flag > "needAdditionalRecord" in CompressedSplitLineReader to be set to true by > #fillBuffer at an inappropriate time: when we haven't read the remaining > bytes of split. This can happen when the inDelimiter parameter is true while > #fillBuffer is invoked while reading the next line. The inDelimiter parameter > is true when either 1) the last byte of the buffer is a CR character ('\r') > if using the default delimiters, or 2) the last bytes of the buffer are a > common prefix of the delimiter if using a custom delimiter. > This can occur in various edge cases, illustrated by five unit tests added in > this change -- specifically the five that would fail without the fix are as > listed below: > # > BaseTestLineRecordReaderBZip2.customDelimiter_lastRecordDelimiterStartsAtNextBlockStart > # BaseTestLineRecordReaderBZip2.firstBlockEndsWithLF_secondBlockStartsWithCR > # BaseTestLineRecordReaderBZip2.delimitedByCRSpanningThreeBlocks > # BaseTestLineRecordReaderBZip2.usingCRDelimiterWithSmallestBufferSize > # > BaseTestLineRecordReaderBZip2.customDelimiter_lastThreeBytesInBlockAreDelimiter > For background, the purpose of "needAdditionalRecord" field in > CompressedSplitLineReader is to indicate to LineRecordReader via the > #needAdditionalRecordAfterSplit method that an extra record lying beyond the > split range should be included in the split. This complication arises due to > a problem when splitting text files. When a split starts at a position > greater than zero, we do not know whether the first line belongs to the last > record in the prior split or is a new record. The solution done in Hadoop is > to make splits that start at position greater than zero to always discard the > first line and then have the prior split decide whether it should include the > first line of the next split or not (as part of the last record or as a new > record). This works well even in the case of a single line spanning multiple > splits. > *What is the fix?* > The fix is to prevent ever setting "needAdditionalRecord" if the bytes filled > to the buffer are not the bytes immediately outside the range of the split. > When reading compressed data, CompressedSplitLineReader requires/assumes that > the stream's #read method never returns bytes from more than one compression > block at a time. This ensures that #fillBuffer gets invoked to read the first > byte of the next block. This next block may or may not be part of the split > we are reading. If we detect that the last bytes of the prior block maybe > part of a delimiter, then we may decide that we should read an additional > record, but we should only do that when this next block is not part of our > split *and* we aren't filling the buffer again beyond our split range. This > is because we are only concerned whether the we need to read the very first > line of the next split as a separate record. If it going to be part of the > last record, then we
[jira] [Work logged] (HADOOP-18321) Fix when to read an additional record from a BZip2 text file split
[ https://issues.apache.org/jira/browse/HADOOP-18321?focusedWorklogId=787768=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-787768 ] ASF GitHub Bot logged work on HADOOP-18321: --- Author: ASF GitHub Bot Created on: 05/Jul/22 09:09 Start Date: 05/Jul/22 09:09 Worklog Time Spent: 10m Work Description: aajisaka commented on PR #4521: URL: https://github.com/apache/hadoop/pull/4521#issuecomment-1174815875 Other than the above comment, I'm +1 for this change. Background: This fix has been merged internally and working without any failure related to this fix in several months. Issue Time Tracking --- Worklog Id: (was: 787768) Time Spent: 50m (was: 40m) > Fix when to read an additional record from a BZip2 text file split > -- > > Key: HADOOP-18321 > URL: https://issues.apache.org/jira/browse/HADOOP-18321 > Project: Hadoop Common > Issue Type: Bug >Affects Versions: 3.3.3 >Reporter: Ashutosh Gupta >Assignee: Ashutosh Gupta >Priority: Critical > Labels: pull-request-available > Time Spent: 50m > Remaining Estimate: 0h > > Fix data correctness issue with TextInputFormat that can occur when reading > BZip2 compressed text files. When triggered this bug would cause a split to > return the first record of the succeeding split that reads the next BZip2 > block, thereby duplicating that record. > *When the bug is triggered?* > The condition for the bug to occur requires that the flag > "needAdditionalRecord" in CompressedSplitLineReader to be set to true by > #fillBuffer at an inappropriate time: when we haven't read the remaining > bytes of split. This can happen when the inDelimiter parameter is true while > #fillBuffer is invoked while reading the next line. The inDelimiter parameter > is true when either 1) the last byte of the buffer is a CR character ('\r') > if using the default delimiters, or 2) the last bytes of the buffer are a > common prefix of the delimiter if using a custom delimiter. > This can occur in various edge cases, illustrated by five unit tests added in > this change -- specifically the five that would fail without the fix are as > listed below: > # > BaseTestLineRecordReaderBZip2.customDelimiter_lastRecordDelimiterStartsAtNextBlockStart > # BaseTestLineRecordReaderBZip2.firstBlockEndsWithLF_secondBlockStartsWithCR > # BaseTestLineRecordReaderBZip2.delimitedByCRSpanningThreeBlocks > # BaseTestLineRecordReaderBZip2.usingCRDelimiterWithSmallestBufferSize > # > BaseTestLineRecordReaderBZip2.customDelimiter_lastThreeBytesInBlockAreDelimiter > For background, the purpose of "needAdditionalRecord" field in > CompressedSplitLineReader is to indicate to LineRecordReader via the > #needAdditionalRecordAfterSplit method that an extra record lying beyond the > split range should be included in the split. This complication arises due to > a problem when splitting text files. When a split starts at a position > greater than zero, we do not know whether the first line belongs to the last > record in the prior split or is a new record. The solution done in Hadoop is > to make splits that start at position greater than zero to always discard the > first line and then have the prior split decide whether it should include the > first line of the next split or not (as part of the last record or as a new > record). This works well even in the case of a single line spanning multiple > splits. > *What is the fix?* > The fix is to prevent ever setting "needAdditionalRecord" if the bytes filled > to the buffer are not the bytes immediately outside the range of the split. > When reading compressed data, CompressedSplitLineReader requires/assumes that > the stream's #read method never returns bytes from more than one compression > block at a time. This ensures that #fillBuffer gets invoked to read the first > byte of the next block. This next block may or may not be part of the split > we are reading. If we detect that the last bytes of the prior block maybe > part of a delimiter, then we may decide that we should read an additional > record, but we should only do that when this next block is not part of our > split *and* we aren't filling the buffer again beyond our split range. This > is because we are only concerned whether the we need to read the very first > line of the next split as a separate record. If it going to be part of the > last record, then we don't need to read an extra record, or in the special > case of CR + LF (i.e. "\r\n"), if the LF is the first byte of the next split, > it will be treated as an empty line, thus we don't need to include an extra > record into the mix. > Thus, to emphasize. It is when we read the
[jira] [Work logged] (HADOOP-18321) Fix when to read an additional record from a BZip2 text file split
[ https://issues.apache.org/jira/browse/HADOOP-18321?focusedWorklogId=787766=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-787766 ] ASF GitHub Bot logged work on HADOOP-18321: --- Author: ASF GitHub Bot Created on: 05/Jul/22 09:06 Start Date: 05/Jul/22 09:06 Worklog Time Spent: 10m Work Description: aajisaka commented on code in PR #4521: URL: https://github.com/apache/hadoop/pull/4521#discussion_r913565934 ## hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-core/src/test/java/org/apache/hadoop/mapreduce/lib/input/TestLineRecordReaderBZip2.java: ## @@ -22,8 +22,13 @@ public final class TestLineRecordReaderBZip2 extends BaseTestLineRecordReaderBZip2 { + @Override + public void setUp() throws Exception { +super.setUp(); + } Review Comment: I think the lines are not required for the checkstyle fix. Issue Time Tracking --- Worklog Id: (was: 787766) Time Spent: 40m (was: 0.5h) > Fix when to read an additional record from a BZip2 text file split > -- > > Key: HADOOP-18321 > URL: https://issues.apache.org/jira/browse/HADOOP-18321 > Project: Hadoop Common > Issue Type: Bug >Affects Versions: 3.3.3 >Reporter: Ashutosh Gupta >Assignee: Ashutosh Gupta >Priority: Critical > Labels: pull-request-available > Time Spent: 40m > Remaining Estimate: 0h > > Fix data correctness issue with TextInputFormat that can occur when reading > BZip2 compressed text files. When triggered this bug would cause a split to > return the first record of the succeeding split that reads the next BZip2 > block, thereby duplicating that record. > *When the bug is triggered?* > The condition for the bug to occur requires that the flag > "needAdditionalRecord" in CompressedSplitLineReader to be set to true by > #fillBuffer at an inappropriate time: when we haven't read the remaining > bytes of split. This can happen when the inDelimiter parameter is true while > #fillBuffer is invoked while reading the next line. The inDelimiter parameter > is true when either 1) the last byte of the buffer is a CR character ('\r') > if using the default delimiters, or 2) the last bytes of the buffer are a > common prefix of the delimiter if using a custom delimiter. > This can occur in various edge cases, illustrated by five unit tests added in > this change -- specifically the five that would fail without the fix are as > listed below: > # > BaseTestLineRecordReaderBZip2.customDelimiter_lastRecordDelimiterStartsAtNextBlockStart > # BaseTestLineRecordReaderBZip2.firstBlockEndsWithLF_secondBlockStartsWithCR > # BaseTestLineRecordReaderBZip2.delimitedByCRSpanningThreeBlocks > # BaseTestLineRecordReaderBZip2.usingCRDelimiterWithSmallestBufferSize > # > BaseTestLineRecordReaderBZip2.customDelimiter_lastThreeBytesInBlockAreDelimiter > For background, the purpose of "needAdditionalRecord" field in > CompressedSplitLineReader is to indicate to LineRecordReader via the > #needAdditionalRecordAfterSplit method that an extra record lying beyond the > split range should be included in the split. This complication arises due to > a problem when splitting text files. When a split starts at a position > greater than zero, we do not know whether the first line belongs to the last > record in the prior split or is a new record. The solution done in Hadoop is > to make splits that start at position greater than zero to always discard the > first line and then have the prior split decide whether it should include the > first line of the next split or not (as part of the last record or as a new > record). This works well even in the case of a single line spanning multiple > splits. > *What is the fix?* > The fix is to prevent ever setting "needAdditionalRecord" if the bytes filled > to the buffer are not the bytes immediately outside the range of the split. > When reading compressed data, CompressedSplitLineReader requires/assumes that > the stream's #read method never returns bytes from more than one compression > block at a time. This ensures that #fillBuffer gets invoked to read the first > byte of the next block. This next block may or may not be part of the split > we are reading. If we detect that the last bytes of the prior block maybe > part of a delimiter, then we may decide that we should read an additional > record, but we should only do that when this next block is not part of our > split *and* we aren't filling the buffer again beyond our split range. This > is because we are only concerned whether the we need to read the very first > line of the next split as a separate record. If it going to be part of the > last record, then
[jira] [Work logged] (HADOOP-18321) Fix when to read an additional record from a BZip2 text file split
[ https://issues.apache.org/jira/browse/HADOOP-18321?focusedWorklogId=786901=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-786901 ] ASF GitHub Bot logged work on HADOOP-18321: --- Author: ASF GitHub Bot Created on: 01/Jul/22 01:08 Start Date: 01/Jul/22 01:08 Worklog Time Spent: 10m Work Description: hadoop-yetus commented on PR #4521: URL: https://github.com/apache/hadoop/pull/4521#issuecomment-1171818052 :confetti_ball: **+1 overall** | Vote | Subsystem | Runtime | Logfile | Comment | |::|--:|:|::|:---:| | +0 :ok: | reexec | 0m 46s | | Docker mode activated. | _ Prechecks _ | | +1 :green_heart: | dupname | 0m 1s | | No case conflicting files found. | | +0 :ok: | codespell | 0m 0s | | codespell was not available. | | +0 :ok: | detsecrets | 0m 0s | | detect-secrets was not available. | | +1 :green_heart: | @author | 0m 0s | | The patch does not contain any @author tags. | | +1 :green_heart: | test4tests | 0m 0s | | The patch appears to include 9 new or modified test files. | _ trunk Compile Tests _ | | +0 :ok: | mvndep | 14m 35s | | Maven dependency ordering for branch | | +1 :green_heart: | mvninstall | 26m 24s | | trunk passed | | +1 :green_heart: | compile | 24m 12s | | trunk passed with JDK Private Build-11.0.15+10-Ubuntu-0ubuntu0.20.04.1 | | +1 :green_heart: | compile | 21m 21s | | trunk passed with JDK Private Build-1.8.0_312-8u312-b07-0ubuntu1~20.04-b07 | | +1 :green_heart: | checkstyle | 4m 8s | | trunk passed | | +1 :green_heart: | mvnsite | 2m 53s | | trunk passed | | +1 :green_heart: | javadoc | 1m 56s | | trunk passed with JDK Private Build-11.0.15+10-Ubuntu-0ubuntu0.20.04.1 | | +1 :green_heart: | javadoc | 1m 37s | | trunk passed with JDK Private Build-1.8.0_312-8u312-b07-0ubuntu1~20.04-b07 | | +1 :green_heart: | spotbugs | 4m 28s | | trunk passed | | +1 :green_heart: | shadedclient | 21m 35s | | branch has no errors when building and testing our client artifacts. | _ Patch Compile Tests _ | | +0 :ok: | mvndep | 0m 23s | | Maven dependency ordering for patch | | +1 :green_heart: | mvninstall | 1m 43s | | the patch passed | | +1 :green_heart: | compile | 22m 0s | | the patch passed with JDK Private Build-11.0.15+10-Ubuntu-0ubuntu0.20.04.1 | | +1 :green_heart: | javac | 22m 0s | | the patch passed | | +1 :green_heart: | compile | 19m 56s | | the patch passed with JDK Private Build-1.8.0_312-8u312-b07-0ubuntu1~20.04-b07 | | +1 :green_heart: | javac | 19m 56s | | the patch passed | | +1 :green_heart: | blanks | 0m 0s | | The patch has no blanks issues. | | +1 :green_heart: | checkstyle | 5m 59s | | root: The patch generated 0 new + 167 unchanged - 1 fixed = 167 total (was 168) | | +1 :green_heart: | mvnsite | 2m 41s | | the patch passed | | +1 :green_heart: | javadoc | 1m 50s | | the patch passed with JDK Private Build-11.0.15+10-Ubuntu-0ubuntu0.20.04.1 | | +1 :green_heart: | javadoc | 1m 34s | | the patch passed with JDK Private Build-1.8.0_312-8u312-b07-0ubuntu1~20.04-b07 | | +1 :green_heart: | spotbugs | 4m 24s | | the patch passed | | +1 :green_heart: | shadedclient | 21m 12s | | patch has no errors when building and testing our client artifacts. | _ Other Tests _ | | +1 :green_heart: | unit | 17m 59s | | hadoop-common in the patch passed. | | +1 :green_heart: | unit | 6m 48s | | hadoop-mapreduce-client-core in the patch passed. | | +1 :green_heart: | asflicense | 1m 6s | | The patch does not generate ASF License warnings. | | | | 234m 42s | | | | Subsystem | Report/Notes | |--:|:-| | Docker | ClientAPI=1.41 ServerAPI=1.41 base: https://ci-hadoop.apache.org/job/hadoop-multibranch/job/PR-4521/2/artifact/out/Dockerfile | | GITHUB PR | https://github.com/apache/hadoop/pull/4521 | | Optional Tests | dupname asflicense compile javac javadoc mvninstall mvnsite unit shadedclient spotbugs checkstyle codespell detsecrets | | uname | Linux 58bc6c9ef44a 4.15.0-169-generic #177-Ubuntu SMP Thu Feb 3 10:50:38 UTC 2022 x86_64 x86_64 x86_64 GNU/Linux | | Build tool | maven | | Personality | dev-support/bin/hadoop.sh | | git revision | trunk / 7a6efbb0b23d81ae74f7f9a9ca9c87ea12e962d5 | | Default Java | Private Build-1.8.0_312-8u312-b07-0ubuntu1~20.04-b07 | | Multi-JDK versions | /usr/lib/jvm/java-11-openjdk-amd64:Private Build-11.0.15+10-Ubuntu-0ubuntu0.20.04.1 /usr/lib/jvm/java-8-openjdk-amd64:Private Build-1.8.0_312-8u312-b07-0ubuntu1~20.04-b07 | | Test Results |
[jira] [Work logged] (HADOOP-18321) Fix when to read an additional record from a BZip2 text file split
[ https://issues.apache.org/jira/browse/HADOOP-18321?focusedWorklogId=786815=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-786815 ] ASF GitHub Bot logged work on HADOOP-18321: --- Author: ASF GitHub Bot Created on: 30/Jun/22 20:12 Start Date: 30/Jun/22 20:12 Worklog Time Spent: 10m Work Description: hadoop-yetus commented on PR #4521: URL: https://github.com/apache/hadoop/pull/4521#issuecomment-1171632416 :confetti_ball: **+1 overall** | Vote | Subsystem | Runtime | Logfile | Comment | |::|--:|:|::|:---:| | +0 :ok: | reexec | 0m 41s | | Docker mode activated. | _ Prechecks _ | | +1 :green_heart: | dupname | 0m 0s | | No case conflicting files found. | | +0 :ok: | codespell | 0m 1s | | codespell was not available. | | +0 :ok: | detsecrets | 0m 1s | | detect-secrets was not available. | | +1 :green_heart: | @author | 0m 0s | | The patch does not contain any @author tags. | | +1 :green_heart: | test4tests | 0m 0s | | The patch appears to include 9 new or modified test files. | _ trunk Compile Tests _ | | +0 :ok: | mvndep | 14m 40s | | Maven dependency ordering for branch | | +1 :green_heart: | mvninstall | 25m 28s | | trunk passed | | +1 :green_heart: | compile | 24m 33s | | trunk passed with JDK Private Build-11.0.15+10-Ubuntu-0ubuntu0.20.04.1 | | +1 :green_heart: | compile | 21m 6s | | trunk passed with JDK Private Build-1.8.0_312-8u312-b07-0ubuntu1~20.04-b07 | | +1 :green_heart: | checkstyle | 3m 50s | | trunk passed | | +1 :green_heart: | mvnsite | 2m 36s | | trunk passed | | +1 :green_heart: | javadoc | 2m 0s | | trunk passed with JDK Private Build-11.0.15+10-Ubuntu-0ubuntu0.20.04.1 | | +1 :green_heart: | javadoc | 1m 28s | | trunk passed with JDK Private Build-1.8.0_312-8u312-b07-0ubuntu1~20.04-b07 | | +1 :green_heart: | spotbugs | 4m 21s | | trunk passed | | +1 :green_heart: | shadedclient | 21m 11s | | branch has no errors when building and testing our client artifacts. | _ Patch Compile Tests _ | | +0 :ok: | mvndep | 0m 29s | | Maven dependency ordering for patch | | +1 :green_heart: | mvninstall | 1m 30s | | the patch passed | | +1 :green_heart: | compile | 22m 28s | | the patch passed with JDK Private Build-11.0.15+10-Ubuntu-0ubuntu0.20.04.1 | | +1 :green_heart: | javac | 22m 28s | | the patch passed | | +1 :green_heart: | compile | 21m 10s | | the patch passed with JDK Private Build-1.8.0_312-8u312-b07-0ubuntu1~20.04-b07 | | +1 :green_heart: | javac | 21m 10s | | the patch passed | | +1 :green_heart: | blanks | 0m 0s | | The patch has no blanks issues. | | -0 :warning: | checkstyle | 3m 35s | [/results-checkstyle-root.txt](https://ci-hadoop.apache.org/job/hadoop-multibranch/job/PR-4521/1/artifact/out/results-checkstyle-root.txt) | root: The patch generated 13 new + 167 unchanged - 1 fixed = 180 total (was 168) | | +1 :green_heart: | mvnsite | 2m 35s | | the patch passed | | +1 :green_heart: | javadoc | 1m 49s | | the patch passed with JDK Private Build-11.0.15+10-Ubuntu-0ubuntu0.20.04.1 | | +1 :green_heart: | javadoc | 1m 30s | | the patch passed with JDK Private Build-1.8.0_312-8u312-b07-0ubuntu1~20.04-b07 | | +1 :green_heart: | spotbugs | 4m 26s | | the patch passed | | +1 :green_heart: | shadedclient | 21m 23s | | patch has no errors when building and testing our client artifacts. | _ Other Tests _ | | +1 :green_heart: | unit | 18m 18s | | hadoop-common in the patch passed. | | +1 :green_heart: | unit | 7m 10s | | hadoop-mapreduce-client-core in the patch passed. | | +1 :green_heart: | asflicense | 0m 59s | | The patch does not generate ASF License warnings. | | | | 232m 29s | | | | Subsystem | Report/Notes | |--:|:-| | Docker | ClientAPI=1.41 ServerAPI=1.41 base: https://ci-hadoop.apache.org/job/hadoop-multibranch/job/PR-4521/1/artifact/out/Dockerfile | | GITHUB PR | https://github.com/apache/hadoop/pull/4521 | | Optional Tests | dupname asflicense compile javac javadoc mvninstall mvnsite unit shadedclient spotbugs checkstyle codespell detsecrets | | uname | Linux 411a69ccdf05 4.15.0-169-generic #177-Ubuntu SMP Thu Feb 3 10:50:38 UTC 2022 x86_64 x86_64 x86_64 GNU/Linux | | Build tool | maven | | Personality | dev-support/bin/hadoop.sh | | git revision | trunk / 9261ada7d157fdabc4f6c66149196404d79f7d54 | | Default Java | Private Build-1.8.0_312-8u312-b07-0ubuntu1~20.04-b07 | | Multi-JDK versions | /usr/lib/jvm/java-11-openjdk-amd64:Private
[jira] [Work logged] (HADOOP-18321) Fix when to read an additional record from a BZip2 text file split
[ https://issues.apache.org/jira/browse/HADOOP-18321?focusedWorklogId=786695=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-786695 ] ASF GitHub Bot logged work on HADOOP-18321: --- Author: ASF GitHub Bot Created on: 30/Jun/22 16:18 Start Date: 30/Jun/22 16:18 Worklog Time Spent: 10m Work Description: ashutoshcipher opened a new pull request, #4521: URL: https://github.com/apache/hadoop/pull/4521 ### Description of PR Fix when to read an additional record from a BZip2 text file split JIRA - HADOOP-18321 ### How was this patch tested? Added Units ### For code changes: - [X] Does the title or this PR starts with the corresponding JIRA issue id (e.g. 'HADOOP-17799. Your PR title ...')? - [ ] Object storage: have the integration tests been executed and the endpoint declared according to the connector-specific documentation? - [ ] If adding new dependencies to the code, are these dependencies licensed in a way that is compatible for inclusion under [ASF 2.0](http://www.apache.org/legal/resolved.html#category-a)? - [ ] If applicable, have you updated the `LICENSE`, `LICENSE-binary`, `NOTICE-binary` files? Issue Time Tracking --- Worklog Id: (was: 786695) Remaining Estimate: 0h Time Spent: 10m > Fix when to read an additional record from a BZip2 text file split > -- > > Key: HADOOP-18321 > URL: https://issues.apache.org/jira/browse/HADOOP-18321 > Project: Hadoop Common > Issue Type: Bug >Affects Versions: 3.3.3 >Reporter: Ashutosh Gupta >Assignee: Ashutosh Gupta >Priority: Critical > Time Spent: 10m > Remaining Estimate: 0h > > Fix data correctness issue with TextInputFormat that can occur when reading > BZip2 compressed text files. When triggered this bug would cause a split to > return the first record of the succeeding split that reads the next BZip2 > block, thereby duplicating that record. > *When the bug is triggered?* > The condition for the bug to occur requires that the flag > "needAdditionalRecord" in CompressedSplitLineReader to be set to true by > #fillBuffer at an inappropriate time: when we haven't read the remaining > bytes of split. This can happen when the inDelimiter parameter is true while > #fillBuffer is invoked while reading the next line. The inDelimiter parameter > is true when either 1) the last byte of the buffer is a CR character ('\r') > if using the default delimiters, or 2) the last bytes of the buffer are a > common prefix of the delimiter if using a custom delimiter. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: common-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: common-issues-h...@hadoop.apache.org