[
https://issues.apache.org/jira/browse/HADOOP-15206?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16354606#comment-16354606
]
Jason Lowe commented on HADOOP-15206:
-------------------------------------
Thanks for updating the patch!
{quote}Because 4 is a position of the first bz2 block marker, and an input
stream will start reading the first bz2 block if the start position of the
input stream is 4.
{quote}
Ah, right. Thanks for the explanation.
{quote}So, if the input stream tries to read from position 1-4, it will drop
the first BZ2 block even though the block marker position is 4.
{quote}
This doesn't just drop the first bzip2 block, it drops the entire split. This
goes back to my previous comment about the code assuming splits that start
between bytes 1-4 are always tiny. Splits do not have to be equally sized, so
theoretically there could be just two splits where the first split is a
two-byte split starting at offset 0 and the other split is the rest of the
file. I believe this change would cause all records to be dropped in that
scenario. To fix that I think we only need to report a position that is one
byte beyond the start of the first bzip2 block rather than at the end of the
entire split (i.e.: header_len + 1 rather than end + 1).
The logic regarding the header seems backwards. If the header is stripped then
that means there was a header present, yet the logic is only adding up bytes
for a header length if it was *not* stripped which is the case when the header
is not there. I'm wondering how it's working since I think the header is
always there in the unit tests.
> BZip2 drops and duplicates records when input split size is small
> -----------------------------------------------------------------
>
> Key: HADOOP-15206
> URL: https://issues.apache.org/jira/browse/HADOOP-15206
> Project: Hadoop Common
> Issue Type: Bug
> Affects Versions: 2.8.3, 3.0.0
> Reporter: Aki Tanaka
> Priority: Major
> Attachments: HADOOP-15206-test.patch, HADOOP-15206.001.patch,
> HADOOP-15206.002.patch
>
>
> BZip2 can drop and duplicate record when input split file is small. I
> confirmed that this issue happens when the input split size is between 1byte
> and 4bytes.
> I am seeing the following 2 problem behaviors.
>
> 1. Drop record:
> BZip2 skips the first record in the input file when the input split size is
> small
>
> Set the split size to 3 and tested to load 100 records (0, 1, 2..99)
> {code:java}
> 2018-02-01 10:52:33,502 INFO [Thread-17] mapred.TestTextInputFormat
> (TestTextInputFormat.java:verifyPartitions(317)) -
> splits[1]=file:/work/count-mismatch2/hadoop/hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-jobclient/target/test-dir/TestTextInputFormat/test.bz2:3+3
> count=99{code}
> > The input format read only 99 records but not 100 records
>
> 2. Duplicate Record:
> 2 input splits has same BZip2 records when the input split size is small
>
> Set the split size to 1 and tested to load 100 records (0, 1, 2..99)
>
> {code:java}
> 2018-02-01 11:18:49,309 INFO [Thread-17] mapred.TestTextInputFormat
> (TestTextInputFormat.java:verifyPartitions(318)) - splits[3]=file
> /work/count-mismatch2/hadoop/hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-jobclient/target/test-dir/TestTextInputFormat/test.bz2:3+1
> count=99
> 2018-02-01 11:18:49,310 WARN [Thread-17] mapred.TestTextInputFormat
> (TestTextInputFormat.java:verifyPartitions(308)) - conflict with 1 in split 4
> at position 8
> {code}
>
> I experienced this error when I execute Spark (SparkSQL) job under the
> following conditions:
> * The file size of the input files are small (around 1KB)
> * Hadoop cluster has many slave nodes (able to launch many executor tasks)
>
--
This message was sent by Atlassian JIRA
(v7.6.3#76005)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]