[jira] [Commented] (HADOOP-15206) BZip2 drops and duplicates records when input split size is small

Jason Lowe (JIRA) Fri, 02 Feb 2018 15:08:47 -0800

    [ 
https://issues.apache.org/jira/browse/HADOOP-15206?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16351044#comment-16351044
 ]


Jason Lowe commented on HADOOP-15206:
-------------------------------------

I found a bit of time to look into this, so I'm dumping my notes here.  I'm not 
sure when I'll get some more time to work on it, so if someone feels brave 
enough to step in feel free.

Here's how I believe records get dropped with very small split sizes:
 # There's only one bz2 block in the file
 # The split size is smaller than 4 bytes
 # First split starts to read the data. It consumes the 'BZh9' magic header 
then updates the reported byte position of the stream to be 4
 # At this point the first split reader is beyond the end of the split before 
it ever read a single record, so it ends up returning with no records.
 # The second split starts in the middle of the 'BZh9' magic header and scans 
forward to find the start of a bz2 block and starts processing the split
 # Since this is not the first split, it throws away the first record with the 
assumption the previous split is responsible for it
 # Second split reader proceeds to consume all remaining data, since byte 
position is not updated until the next bz2 block and there's only one block
 # End result is first record is lost since first split never consumed it.

I think we can fix this scenario by not advertising a new byte position after 
reading the 'BZh9' header and only updating the byte position when we read the 
bz2 block header following the current bz2 block.

I didn't get as much time to look into the duplicated record scenario, but I 
suspect multiple splits end up discovering the beginning of the bz2 block and 
think it is their block to consume. Not sure yet how we can easily distinguish 
which split is the one, true split that is responsible for consuming the bz2 
block given we're hiding the true byte offset from the upper layers most of the 
time.

> BZip2 drops and duplicates records when input split size is small
> -----------------------------------------------------------------
>
>                 Key: HADOOP-15206
>                 URL: https://issues.apache.org/jira/browse/HADOOP-15206
>             Project: Hadoop Common
>          Issue Type: Bug
>    Affects Versions: 2.8.3, 3.0.0
>            Reporter: Aki Tanaka
>            Priority: Major
>         Attachments: HADOOP-15206-test.patch
>
>
> BZip2 can drop and duplicate record when input split file is small. I 
> confirmed that this issue happens when the input split size is between 1byte 
> and 4bytes.
> I am seeing the following 2 problem behaviors.
>  
> 1. Drop record:
> BZip2 skips the first record in the input file when the input split size is 
> small
>  
> Set the split size to 3 and tested to load 100 records (0, 1, 2..99)
> {code:java}
> 2018-02-01 10:52:33,502 INFO  [Thread-17] mapred.TestTextInputFormat 
> (TestTextInputFormat.java:verifyPartitions(317)) - 
> splits[1]=file:/work/count-mismatch2/hadoop/hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-jobclient/target/test-dir/TestTextInputFormat/test.bz2:3+3
>  count=99{code}
> > The input format read only 99 records but not 100 records
>  
> 2. Duplicate Record:
> 2 input splits has same BZip2 records when the input split size is small
>  
> Set the split size to 1 and tested to load 100 records (0, 1, 2..99)
>  
> {code:java}
> 2018-02-01 11:18:49,309 INFO [Thread-17] mapred.TestTextInputFormat 
> (TestTextInputFormat.java:verifyPartitions(318)) - splits[3]=file 
> /work/count-mismatch2/hadoop/hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-jobclient/target/test-dir/TestTextInputFormat/test.bz2:3+1
>  count=99
> 2018-02-01 11:18:49,310 WARN [Thread-17] mapred.TestTextInputFormat 
> (TestTextInputFormat.java:verifyPartitions(308)) - conflict with 1 in split 4 
> at position 8
> {code}
>  
> I experienced this error when I execute Spark (SparkSQL) job under the 
> following conditions:
> * The file size of the input files are small (around 1KB)
> * Hadoop cluster has many slave nodes (able to launch many executor tasks)
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[jira] [Commented] (HADOOP-15206) BZip2 drops and duplicates records when input split size is small

Reply via email to