[jira] [Updated] (HADOOP-15206) BZip2 drops and duplicates records when input split size is small

Jason Lowe (JIRA) Wed, 14 Feb 2018 15:04:08 -0800

     [ 
https://issues.apache.org/jira/browse/HADOOP-15206?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]


Jason Lowe updated HADOOP-15206:
--------------------------------
    Status: Patch Available  (was: Open)

Thanks for updating the patch! Looks good overall, just a few nits. I think 
we're close, so moving this to Patch Available so the QA bot can comment on 
this as well.

Why are we only skipping one byte at a time instead of trying to skip the rest 
of the way in one call? The code can track the remaining bytes in skipBytes, 
decrement that by the number of bytes skipped in the loop, then loop while 
skipBytes > 0.

There is trailing whitespace on a couple of lines which would be nice to 
cleanup. I expect the QA bot to flag this in its whitespace check.

I'm not sure it's necessary to call out the JIRA in the comments. That's what 
{{git blame}} is for. ;) Otherwise the code would be littered with JIRA numbers 
in every bugfix change.

"steam is on BZip2 header" should be "a split is before the first BZip2 block"

> BZip2 drops and duplicates records when input split size is small
> -----------------------------------------------------------------
>
>                 Key: HADOOP-15206
>                 URL: https://issues.apache.org/jira/browse/HADOOP-15206
>             Project: Hadoop Common
>          Issue Type: Bug
>    Affects Versions: 3.0.0, 2.8.3
>            Reporter: Aki Tanaka
>            Priority: Major
>         Attachments: HADOOP-15206-test.patch, HADOOP-15206.001.patch, 
> HADOOP-15206.002.patch, HADOOP-15206.003.patch, HADOOP-15206.004.patch, 
> HADOOP-15206.005.patch, HADOOP-15206.006.patch
>
>
> BZip2 can drop and duplicate record when input split file is small. I 
> confirmed that this issue happens when the input split size is between 1byte 
> and 4bytes.
> I am seeing the following 2 problem behaviors.
>  
> 1. Drop record:
> BZip2 skips the first record in the input file when the input split size is 
> small
>  
> Set the split size to 3 and tested to load 100 records (0, 1, 2..99)
> {code:java}
> 2018-02-01 10:52:33,502 INFO  [Thread-17] mapred.TestTextInputFormat 
> (TestTextInputFormat.java:verifyPartitions(317)) - 
> splits[1]=file:/work/count-mismatch2/hadoop/hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-jobclient/target/test-dir/TestTextInputFormat/test.bz2:3+3
>  count=99{code}
> > The input format read only 99 records but not 100 records
>  
> 2. Duplicate Record:
> 2 input splits has same BZip2 records when the input split size is small
>  
> Set the split size to 1 and tested to load 100 records (0, 1, 2..99)
>  
> {code:java}
> 2018-02-01 11:18:49,309 INFO [Thread-17] mapred.TestTextInputFormat 
> (TestTextInputFormat.java:verifyPartitions(318)) - splits[3]=file 
> /work/count-mismatch2/hadoop/hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-jobclient/target/test-dir/TestTextInputFormat/test.bz2:3+1
>  count=99
> 2018-02-01 11:18:49,310 WARN [Thread-17] mapred.TestTextInputFormat 
> (TestTextInputFormat.java:verifyPartitions(308)) - conflict with 1 in split 4 
> at position 8
> {code}
>  
> I experienced this error when I execute Spark (SparkSQL) job under the 
> following conditions:
> * The file size of the input files are small (around 1KB)
> * Hadoop cluster has many slave nodes (able to launch many executor tasks)
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

---------------------------------------------------------------------
To unsubscribe, e-mail: common-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: common-issues-h...@hadoop.apache.org

[jira] [Updated] (HADOOP-15206) BZip2 drops and duplicates records when input split size is small

Reply via email to