[
https://issues.apache.org/jira/browse/HADOOP-18321?focusedWorklogId=786815&page=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-786815
]
ASF GitHub Bot logged work on HADOOP-18321:
-------------------------------------------
Author: ASF GitHub Bot
Created on: 30/Jun/22 20:12
Start Date: 30/Jun/22 20:12
Worklog Time Spent: 10m
Work Description: hadoop-yetus commented on PR #4521:
URL: https://github.com/apache/hadoop/pull/4521#issuecomment-1171632416
:confetti_ball: **+1 overall**
| Vote | Subsystem | Runtime | Logfile | Comment |
|:----:|----------:|--------:|:--------:|:-------:|
| +0 :ok: | reexec | 0m 41s | | Docker mode activated. |
|||| _ Prechecks _ |
| +1 :green_heart: | dupname | 0m 0s | | No case conflicting files
found. |
| +0 :ok: | codespell | 0m 1s | | codespell was not available. |
| +0 :ok: | detsecrets | 0m 1s | | detect-secrets was not available.
|
| +1 :green_heart: | @author | 0m 0s | | The patch does not contain
any @author tags. |
| +1 :green_heart: | test4tests | 0m 0s | | The patch appears to
include 9 new or modified test files. |
|||| _ trunk Compile Tests _ |
| +0 :ok: | mvndep | 14m 40s | | Maven dependency ordering for branch |
| +1 :green_heart: | mvninstall | 25m 28s | | trunk passed |
| +1 :green_heart: | compile | 24m 33s | | trunk passed with JDK
Private Build-11.0.15+10-Ubuntu-0ubuntu0.20.04.1 |
| +1 :green_heart: | compile | 21m 6s | | trunk passed with JDK
Private Build-1.8.0_312-8u312-b07-0ubuntu1~20.04-b07 |
| +1 :green_heart: | checkstyle | 3m 50s | | trunk passed |
| +1 :green_heart: | mvnsite | 2m 36s | | trunk passed |
| +1 :green_heart: | javadoc | 2m 0s | | trunk passed with JDK
Private Build-11.0.15+10-Ubuntu-0ubuntu0.20.04.1 |
| +1 :green_heart: | javadoc | 1m 28s | | trunk passed with JDK
Private Build-1.8.0_312-8u312-b07-0ubuntu1~20.04-b07 |
| +1 :green_heart: | spotbugs | 4m 21s | | trunk passed |
| +1 :green_heart: | shadedclient | 21m 11s | | branch has no errors
when building and testing our client artifacts. |
|||| _ Patch Compile Tests _ |
| +0 :ok: | mvndep | 0m 29s | | Maven dependency ordering for patch |
| +1 :green_heart: | mvninstall | 1m 30s | | the patch passed |
| +1 :green_heart: | compile | 22m 28s | | the patch passed with JDK
Private Build-11.0.15+10-Ubuntu-0ubuntu0.20.04.1 |
| +1 :green_heart: | javac | 22m 28s | | the patch passed |
| +1 :green_heart: | compile | 21m 10s | | the patch passed with JDK
Private Build-1.8.0_312-8u312-b07-0ubuntu1~20.04-b07 |
| +1 :green_heart: | javac | 21m 10s | | the patch passed |
| +1 :green_heart: | blanks | 0m 0s | | The patch has no blanks
issues. |
| -0 :warning: | checkstyle | 3m 35s |
[/results-checkstyle-root.txt](https://ci-hadoop.apache.org/job/hadoop-multibranch/job/PR-4521/1/artifact/out/results-checkstyle-root.txt)
| root: The patch generated 13 new + 167 unchanged - 1 fixed = 180 total (was
168) |
| +1 :green_heart: | mvnsite | 2m 35s | | the patch passed |
| +1 :green_heart: | javadoc | 1m 49s | | the patch passed with JDK
Private Build-11.0.15+10-Ubuntu-0ubuntu0.20.04.1 |
| +1 :green_heart: | javadoc | 1m 30s | | the patch passed with JDK
Private Build-1.8.0_312-8u312-b07-0ubuntu1~20.04-b07 |
| +1 :green_heart: | spotbugs | 4m 26s | | the patch passed |
| +1 :green_heart: | shadedclient | 21m 23s | | patch has no errors
when building and testing our client artifacts. |
|||| _ Other Tests _ |
| +1 :green_heart: | unit | 18m 18s | | hadoop-common in the patch
passed. |
| +1 :green_heart: | unit | 7m 10s | | hadoop-mapreduce-client-core in
the patch passed. |
| +1 :green_heart: | asflicense | 0m 59s | | The patch does not
generate ASF License warnings. |
| | | 232m 29s | | |
| Subsystem | Report/Notes |
|----------:|:-------------|
| Docker | ClientAPI=1.41 ServerAPI=1.41 base:
https://ci-hadoop.apache.org/job/hadoop-multibranch/job/PR-4521/1/artifact/out/Dockerfile
|
| GITHUB PR | https://github.com/apache/hadoop/pull/4521 |
| Optional Tests | dupname asflicense compile javac javadoc mvninstall
mvnsite unit shadedclient spotbugs checkstyle codespell detsecrets |
| uname | Linux 411a69ccdf05 4.15.0-169-generic #177-Ubuntu SMP Thu Feb 3
10:50:38 UTC 2022 x86_64 x86_64 x86_64 GNU/Linux |
| Build tool | maven |
| Personality | dev-support/bin/hadoop.sh |
| git revision | trunk / 9261ada7d157fdabc4f6c66149196404d79f7d54 |
| Default Java | Private Build-1.8.0_312-8u312-b07-0ubuntu1~20.04-b07 |
| Multi-JDK versions | /usr/lib/jvm/java-11-openjdk-amd64:Private
Build-11.0.15+10-Ubuntu-0ubuntu0.20.04.1
/usr/lib/jvm/java-8-openjdk-amd64:Private
Build-1.8.0_312-8u312-b07-0ubuntu1~20.04-b07 |
| Test Results |
https://ci-hadoop.apache.org/job/hadoop-multibranch/job/PR-4521/1/testReport/ |
| Max. process+thread count | 3153 (vs. ulimit of 5500) |
| modules | C: hadoop-common-project/hadoop-common
hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-core
U: . |
| Console output |
https://ci-hadoop.apache.org/job/hadoop-multibranch/job/PR-4521/1/console |
| versions | git=2.25.1 maven=3.6.3 spotbugs=4.2.2 |
| Powered by | Apache Yetus 0.14.0 https://yetus.apache.org |
This message was automatically generated.
Issue Time Tracking
-------------------
Worklog Id: (was: 786815)
Time Spent: 20m (was: 10m)
> Fix when to read an additional record from a BZip2 text file split
> ------------------------------------------------------------------
>
> Key: HADOOP-18321
> URL: https://issues.apache.org/jira/browse/HADOOP-18321
> Project: Hadoop Common
> Issue Type: Bug
> Affects Versions: 3.3.3
> Reporter: Ashutosh Gupta
> Assignee: Ashutosh Gupta
> Priority: Critical
> Labels: pull-request-available
> Time Spent: 20m
> Remaining Estimate: 0h
>
> Fix data correctness issue with TextInputFormat that can occur when reading
> BZip2 compressed text files. When triggered this bug would cause a split to
> return the first record of the succeeding split that reads the next BZip2
> block, thereby duplicating that record.
> *When the bug is triggered?*
> The condition for the bug to occur requires that the flag
> "needAdditionalRecord" in CompressedSplitLineReader to be set to true by
> #fillBuffer at an inappropriate time: when we haven't read the remaining
> bytes of split. This can happen when the inDelimiter parameter is true while
> #fillBuffer is invoked while reading the next line. The inDelimiter parameter
> is true when either 1) the last byte of the buffer is a CR character ('\r')
> if using the default delimiters, or 2) the last bytes of the buffer are a
> common prefix of the delimiter if using a custom delimiter.
> This can occur in various edge cases, illustrated by five unit tests added in
> this change -- specifically the five that would fail without the fix are as
> listed below:
> #
> BaseTestLineRecordReaderBZip2.customDelimiter_lastRecordDelimiterStartsAtNextBlockStart
> # BaseTestLineRecordReaderBZip2.firstBlockEndsWithLF_secondBlockStartsWithCR
> # BaseTestLineRecordReaderBZip2.delimitedByCRSpanningThreeBlocks
> # BaseTestLineRecordReaderBZip2.usingCRDelimiterWithSmallestBufferSize
> #
> BaseTestLineRecordReaderBZip2.customDelimiter_lastThreeBytesInBlockAreDelimiter
> For background, the purpose of "needAdditionalRecord" field in
> CompressedSplitLineReader is to indicate to LineRecordReader via the
> #needAdditionalRecordAfterSplit method that an extra record lying beyond the
> split range should be included in the split. This complication arises due to
> a problem when splitting text files. When a split starts at a position
> greater than zero, we do not know whether the first line belongs to the last
> record in the prior split or is a new record. The solution done in Hadoop is
> to make splits that start at position greater than zero to always discard the
> first line and then have the prior split decide whether it should include the
> first line of the next split or not (as part of the last record or as a new
> record). This works well even in the case of a single line spanning multiple
> splits.
> *What is the fix?*
> The fix is to prevent ever setting "needAdditionalRecord" if the bytes filled
> to the buffer are not the bytes immediately outside the range of the split.
> When reading compressed data, CompressedSplitLineReader requires/assumes that
> the stream's #read method never returns bytes from more than one compression
> block at a time. This ensures that #fillBuffer gets invoked to read the first
> byte of the next block. This next block may or may not be part of the split
> we are reading. If we detect that the last bytes of the prior block maybe
> part of a delimiter, then we may decide that we should read an additional
> record, but we should only do that when this next block is not part of our
> split *and* we aren't filling the buffer again beyond our split range. This
> is because we are only concerned whether the we need to read the very first
> line of the next split as a separate record. If it going to be part of the
> last record, then we don't need to read an extra record, or in the special
> case of CR + LF (i.e. "\r\n"), if the LF is the first byte of the next split,
> it will be treated as an empty line, thus we don't need to include an extra
> record into the mix.
> Thus, to emphasize. It is when we read the first bytes outside our split
> range that matters. But the current logic doesn't take that into account in
> CompressedSplitLineReader. This is in contrast to UncompressedSplitLineReader
> which does.
--
This message was sent by Atlassian Jira
(v8.20.10#820010)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]