[ 
https://issues.apache.org/jira/browse/HADOOP-15206?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16354317#comment-16354317
 ] 

Aki Tanaka commented on HADOOP-15206:
-------------------------------------

Thank you for the comment! I've updated the patch.

 
>The code above is making the following assumptions that I believe could not be 
>true in some cases:
>The bzip2 file header is always present at starting pos 0. I think it should 
>be checking isHeaderStripped/isSubHeaderStripped.
>If the split starts after byte 0 but before byte 5 then it must also end on or 
>before byte 5. (Splits are not required to be equally sized.)
 
Thank you for pointing these out. I've modified the patch code and uploaded to 
the Jira.

 
> If the bzip2 file header is four bytes, why is the condition {{<= 4}} instead 
> of {{< 4}}? Should this code leverage HEADER_LEN and SUB_HEADER_LEN here?

When we changed the condition to < 4, the first bz2 block will duplicate. 
Because 4 is a position of the first bz2 block marker, and an input stream will 
start reading the first bz2 block if the start position of the input stream is 
4. 

The concept of my patch is changing the position of the first BZ2 block marker 
from 4 (when the bzip2 file header is present) to 0 forcibly. So, if the input 
stream tries to read from position 1-4, it will drop the first BZ2 block even 
though the block marker position is 4. I am not sure whether this is the best 
solution but I could not come up with another idea.

> BZip2 drops and duplicates records when input split size is small
> -----------------------------------------------------------------
>
>                 Key: HADOOP-15206
>                 URL: https://issues.apache.org/jira/browse/HADOOP-15206
>             Project: Hadoop Common
>          Issue Type: Bug
>    Affects Versions: 2.8.3, 3.0.0
>            Reporter: Aki Tanaka
>            Priority: Major
>         Attachments: HADOOP-15206-test.patch, HADOOP-15206.001.patch, 
> HADOOP-15206.002.patch
>
>
> BZip2 can drop and duplicate record when input split file is small. I 
> confirmed that this issue happens when the input split size is between 1byte 
> and 4bytes.
> I am seeing the following 2 problem behaviors.
>  
> 1. Drop record:
> BZip2 skips the first record in the input file when the input split size is 
> small
>  
> Set the split size to 3 and tested to load 100 records (0, 1, 2..99)
> {code:java}
> 2018-02-01 10:52:33,502 INFO  [Thread-17] mapred.TestTextInputFormat 
> (TestTextInputFormat.java:verifyPartitions(317)) - 
> splits[1]=file:/work/count-mismatch2/hadoop/hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-jobclient/target/test-dir/TestTextInputFormat/test.bz2:3+3
>  count=99{code}
> > The input format read only 99 records but not 100 records
>  
> 2. Duplicate Record:
> 2 input splits has same BZip2 records when the input split size is small
>  
> Set the split size to 1 and tested to load 100 records (0, 1, 2..99)
>  
> {code:java}
> 2018-02-01 11:18:49,309 INFO [Thread-17] mapred.TestTextInputFormat 
> (TestTextInputFormat.java:verifyPartitions(318)) - splits[3]=file 
> /work/count-mismatch2/hadoop/hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-jobclient/target/test-dir/TestTextInputFormat/test.bz2:3+1
>  count=99
> 2018-02-01 11:18:49,310 WARN [Thread-17] mapred.TestTextInputFormat 
> (TestTextInputFormat.java:verifyPartitions(308)) - conflict with 1 in split 4 
> at position 8
> {code}
>  
> I experienced this error when I execute Spark (SparkSQL) job under the 
> following conditions:
> * The file size of the input files are small (around 1KB)
> * Hadoop cluster has many slave nodes (able to launch many executor tasks)
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

---------------------------------------------------------------------
To unsubscribe, e-mail: common-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: common-issues-h...@hadoop.apache.org

Reply via email to