Jason Lowe commented on HADOOP-15206:

Thanks for updating the patch!

It seems the basic problem is that split 0, the first split, is _always_ 
responsible for the first record even if that record is technically past the 
byte offset of the end of the split.  That's because all other splits will 
unconditionally throw away the first (potentially partial) record under the 
assumption the previous split is responsible for it.  Therefore we need to do 
two things to avoid drops and duplicates:
* If the first split ends before the start of the first bz2 block then we need 
to avoid advertising the updated byte position until we have started to consume 
the first bz2 block.  This avoids the dropped record.
* If subsequent splits start before the first bz2 block begins then we need to 
make sure any split that starts before the first block is artificially pushed 
past that first block.  This avoids the duplicates.

I'm wondering if it gets cleaner if we move this logic into readStreamHeader() 
and always call it.  That method can check the starting position and do one of 
the following:
* check for and read the full header if it is at starting position 0
* do nothing if start pos is past the full header + 1
* verify the bytes being skipped are the expected header bytes if start pos 
between 0 and full_header+1.  If they are not the expected bytes then we reset 
the buffered input (just like starting pos 0 logic does today if header is not 

In the constructor we should be able to avoid updating the reported position if 
starting position is 0 (so we will always read into the first bz2 block), 
otherwise we advertise after reading any header so subsequent splits always 
start at least one byte after the start of the first bz2 block.

> BZip2 drops and duplicates records when input split size is small
> -----------------------------------------------------------------
>                 Key: HADOOP-15206
>                 URL: https://issues.apache.org/jira/browse/HADOOP-15206
>             Project: Hadoop Common
>          Issue Type: Bug
>    Affects Versions: 2.8.3, 3.0.0
>            Reporter: Aki Tanaka
>            Priority: Major
>         Attachments: HADOOP-15206-test.patch, HADOOP-15206.001.patch, 
> HADOOP-15206.002.patch, HADOOP-15206.003.patch
> BZip2 can drop and duplicate record when input split file is small. I 
> confirmed that this issue happens when the input split size is between 1byte 
> and 4bytes.
> I am seeing the following 2 problem behaviors.
> 1. Drop record:
> BZip2 skips the first record in the input file when the input split size is 
> small
> Set the split size to 3 and tested to load 100 records (0, 1, 2..99)
> {code:java}
> 2018-02-01 10:52:33,502 INFO  [Thread-17] mapred.TestTextInputFormat 
> (TestTextInputFormat.java:verifyPartitions(317)) - 
> splits[1]=file:/work/count-mismatch2/hadoop/hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-jobclient/target/test-dir/TestTextInputFormat/test.bz2:3+3
>  count=99{code}
> > The input format read only 99 records but not 100 records
> 2. Duplicate Record:
> 2 input splits has same BZip2 records when the input split size is small
> Set the split size to 1 and tested to load 100 records (0, 1, 2..99)
> {code:java}
> 2018-02-01 11:18:49,309 INFO [Thread-17] mapred.TestTextInputFormat 
> (TestTextInputFormat.java:verifyPartitions(318)) - splits[3]=file 
> /work/count-mismatch2/hadoop/hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-jobclient/target/test-dir/TestTextInputFormat/test.bz2:3+1
>  count=99
> 2018-02-01 11:18:49,310 WARN [Thread-17] mapred.TestTextInputFormat 
> (TestTextInputFormat.java:verifyPartitions(308)) - conflict with 1 in split 4 
> at position 8
> {code}
> I experienced this error when I execute Spark (SparkSQL) job under the 
> following conditions:
> * The file size of the input files are small (around 1KB)
> * Hadoop cluster has many slave nodes (able to launch many executor tasks)

This message was sent by Atlassian JIRA

To unsubscribe, e-mail: common-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: common-issues-h...@hadoop.apache.org

Reply via email to