[ 
https://issues.apache.org/jira/browse/HADOOP-18400?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17578274#comment-17578274
 ] 

ASF GitHub Bot commented on HADOOP-18400:
-----------------------------------------

hadoop-yetus commented on PR #4732:
URL: https://github.com/apache/hadoop/pull/4732#issuecomment-1211581648

   :confetti_ball: **+1 overall**
   
   
   
   
   
   
   | Vote | Subsystem | Runtime |  Logfile | Comment |
   |:----:|----------:|--------:|:--------:|:-------:|
   | +0 :ok: |  reexec  |   0m 47s |  |  Docker mode activated.  |
   |||| _ Prechecks _ |
   | +1 :green_heart: |  dupname  |   0m  0s |  |  No case conflicting files 
found.  |
   | +0 :ok: |  codespell  |   0m  1s |  |  codespell was not available.  |
   | +0 :ok: |  detsecrets  |   0m  1s |  |  detect-secrets was not available.  
|
   | +1 :green_heart: |  @author  |   0m  0s |  |  The patch does not contain 
any @author tags.  |
   | +1 :green_heart: |  test4tests  |   0m  0s |  |  The patch appears to 
include 2 new or modified test files.  |
   |||| _ trunk Compile Tests _ |
   | +0 :ok: |  mvndep  |  15m 29s |  |  Maven dependency ordering for branch  |
   | +1 :green_heart: |  mvninstall  |  28m 32s |  |  trunk passed  |
   | +1 :green_heart: |  compile  |  25m 15s |  |  trunk passed with JDK 
Private Build-11.0.15+10-Ubuntu-0ubuntu0.20.04.1  |
   | +1 :green_heart: |  compile  |  22m  0s |  |  trunk passed with JDK 
Private Build-1.8.0_312-8u312-b07-0ubuntu1~20.04-b07  |
   | +1 :green_heart: |  checkstyle  |   4m 30s |  |  trunk passed  |
   | +1 :green_heart: |  mvnsite  |   3m 15s |  |  trunk passed  |
   | +1 :green_heart: |  javadoc  |   2m 29s |  |  trunk passed with JDK 
Private Build-11.0.15+10-Ubuntu-0ubuntu0.20.04.1  |
   | +1 :green_heart: |  javadoc  |   1m 59s |  |  trunk passed with JDK 
Private Build-1.8.0_312-8u312-b07-0ubuntu1~20.04-b07  |
   | +1 :green_heart: |  spotbugs  |   4m 54s |  |  trunk passed  |
   | +1 :green_heart: |  shadedclient  |  24m 45s |  |  branch has no errors 
when building and testing our client artifacts.  |
   |||| _ Patch Compile Tests _ |
   | +0 :ok: |  mvndep  |   0m 27s |  |  Maven dependency ordering for patch  |
   | +1 :green_heart: |  mvninstall  |   1m 43s |  |  the patch passed  |
   | +1 :green_heart: |  compile  |  24m 24s |  |  the patch passed with JDK 
Private Build-11.0.15+10-Ubuntu-0ubuntu0.20.04.1  |
   | +1 :green_heart: |  javac  |  24m 24s |  |  the patch passed  |
   | +1 :green_heart: |  compile  |  21m 56s |  |  the patch passed with JDK 
Private Build-1.8.0_312-8u312-b07-0ubuntu1~20.04-b07  |
   | +1 :green_heart: |  javac  |  21m 56s |  |  the patch passed  |
   | +1 :green_heart: |  blanks  |   0m  0s |  |  The patch has no blanks 
issues.  |
   | +1 :green_heart: |  checkstyle  |   4m 20s |  |  the patch passed  |
   | +1 :green_heart: |  mvnsite  |   3m 12s |  |  the patch passed  |
   | +1 :green_heart: |  javadoc  |   2m 19s |  |  the patch passed with JDK 
Private Build-11.0.15+10-Ubuntu-0ubuntu0.20.04.1  |
   | +1 :green_heart: |  javadoc  |   2m  0s |  |  the patch passed with JDK 
Private Build-1.8.0_312-8u312-b07-0ubuntu1~20.04-b07  |
   | +1 :green_heart: |  spotbugs  |   5m  8s |  |  the patch passed  |
   | +1 :green_heart: |  shadedclient  |  25m  1s |  |  patch has no errors 
when building and testing our client artifacts.  |
   |||| _ Other Tests _ |
   | +1 :green_heart: |  unit  |  18m 23s |  |  hadoop-common in the patch 
passed.  |
   | +1 :green_heart: |  unit  |   7m 28s |  |  hadoop-mapreduce-client-core in 
the patch passed.  |
   | +1 :green_heart: |  asflicense  |   1m 17s |  |  The patch does not 
generate ASF License warnings.  |
   |  |   | 255m 10s |  |  |
   
   
   | Subsystem | Report/Notes |
   |----------:|:-------------|
   | Docker | ClientAPI=1.41 ServerAPI=1.41 base: 
https://ci-hadoop.apache.org/job/hadoop-multibranch/job/PR-4732/1/artifact/out/Dockerfile
 |
   | GITHUB PR | https://github.com/apache/hadoop/pull/4732 |
   | Optional Tests | dupname asflicense compile javac javadoc mvninstall 
mvnsite unit shadedclient spotbugs checkstyle codespell detsecrets |
   | uname | Linux 851d47b409b4 4.15.0-175-generic #184-Ubuntu SMP Thu Mar 24 
17:48:36 UTC 2022 x86_64 x86_64 x86_64 GNU/Linux |
   | Build tool | maven |
   | Personality | dev-support/bin/hadoop.sh |
   | git revision | trunk / 41861391e6ccc3f980d23014b655fd9e22e58ed9 |
   | Default Java | Private Build-1.8.0_312-8u312-b07-0ubuntu1~20.04-b07 |
   | Multi-JDK versions | /usr/lib/jvm/java-11-openjdk-amd64:Private 
Build-11.0.15+10-Ubuntu-0ubuntu0.20.04.1 
/usr/lib/jvm/java-8-openjdk-amd64:Private 
Build-1.8.0_312-8u312-b07-0ubuntu1~20.04-b07 |
   |  Test Results | 
https://ci-hadoop.apache.org/job/hadoop-multibranch/job/PR-4732/1/testReport/ |
   | Max. process+thread count | 1285 (vs. ulimit of 5500) |
   | modules | C: hadoop-common-project/hadoop-common 
hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-core 
U: . |
   | Console output | 
https://ci-hadoop.apache.org/job/hadoop-multibranch/job/PR-4732/1/console |
   | versions | git=2.25.1 maven=3.6.3 spotbugs=4.2.2 |
   | Powered by | Apache Yetus 0.14.0 https://yetus.apache.org |
   
   
   This message was automatically generated.
   
   




>  Fix file split duplicating records from a succeeding split when reading 
> BZip2 text files 
> ------------------------------------------------------------------------------------------
>
>                 Key: HADOOP-18400
>                 URL: https://issues.apache.org/jira/browse/HADOOP-18400
>             Project: Hadoop Common
>          Issue Type: Bug
>    Affects Versions: 3.3.3, 3.3.4
>            Reporter: groot
>            Assignee: groot
>            Priority: Critical
>              Labels: pull-request-available
>
> Fix data correctness issue with TextInputFormat that can occur when reading 
> BZip2 compressed text files. When a file split's range does not include the 
> start position of a BZip2 block, then it is expected to contain no records 
> (i.e. the split is empty). However, if it so happens that the end of this 
> split (exclusive) is at the start of a BZip2 block, then LineRecordReader 
> ends up returning all the records for that BZip2 block. This ends up 
> duplicating records read by a job because the next split would also end up 
> returning all the records for the same block (since its range would include 
> the start of that block).
> This bug does not get triggered when the file split's range does include the 
> start of at least one block and ends just before the start of another block. 
> The reason for this has to do with when BZip2CompressionInputStream updates 
> its position when using the BYBLOCK READMODE. Using this read mode, the 
> stream's position while reading only gets updated when reading the first byte 
> past an end of a block marker. The bug is that if the stream, when 
> initialized, was adjusted to be at the end of one block, then we don't update 
> the position after we read the first byte of the next block. Rather, we keep 
> the position to be equal to the next block marker we've initialized to. If 
> the exclusive end position of the split is equal to stream's position, 
> LineRecordReader will continue to read lines until the position is updated 
> (an an additional record in the next block is read if needed).



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to