[
https://issues.apache.org/jira/browse/HADOOP-9442?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Qiming He updated HADOOP-9442:
------------------------------
Description:
$ cat abook.txt | base64 –w 0 >onelinetext.b64
$ hadoop fs –put onelinetext.b64 /input/onelinetext.b64
$ hadoop jar hadoop-streaming.jar \
-input /input/onelinetext.b64 \
-output /output \
-inputformat org.apache.hadoop.mapred.lib.NLineInputFormat \
–mapper wc
Num task: 1, and output has one line:
Line 1: 1 2 202699
which makes sense because one line per mapper is intended.
Them, using compression with NLineInputFormat
$ bzip2 onelinetext.b64
$ hadoop fs –put onelinetext.b64.bz2 /input/onelinetext.b64.bz2
$ hadoop jar hadoop-streaming.jar \
-Dmapred.input.compress=true \
-Dmapred.input.compression.codec=org.apache.hadoop.io.compress.GzipCodec \
-input /input/onelinetext.b64.gz \
-output /output \
-inputformat org.apache.hadoop.mapred.lib.NLineInputFormat \
–mapper wc
I am expecting the same results as above, 'coz decompressing should occur
before processing one-line text (i.e. wc), however, I am getting:
Num task: 397 (or other large num depends on environments), and output has 397
lines:
Line1-396: 0 0 0
Line 397: 1 2 202699
Any idea why so many mapred.map.tasks >>1? Is it incorrect splitting? I
purposely choose gzip because I believe it is NOT split-able. I got similar
results when using bzip2 and lzop codecs.
was:
$ cat abook.txt | base64 –w 0 >onelinetext.b64
$ hadoop fs –put onelinetext.b64 /input/onelinetext.b64
$ hadoop jar hadoop-streaming.jar \
-input /input/onelinetext.b64 \
-output /output \
-inputformat org.apache.hadoop.mapred.lib.NLineInputFormat \
–mapper wc
Num task: 1, and output has one line:
Line 1: 1 2 202699
which makes sense because one line per mapper is intended.
Them, using compression with NLineInputFormat
$ bzip2 onelinetext.b64
$ hadoop fs –put onelinetext.b64.bz2 /input/onelinetext.b64.bz2
$ hadoop jar hadoop-streaming.jar \
-Dmapred.input.compress=true \
-Dmapred.input.compression.codec=org.apache.hadoop.io.compress.GzipCodec \
-input /input/onelinetext.b64.bz2 \
-output /output \
-inputformat org.apache.hadoop.mapred.lib.NLineInputFormat \
–mapper wc
I am expecting the same results as above, 'coz decompressing should occur
before processing one-line text (i.e. wc), however, I am getting:
Num task: 397 (or other large num depends on environments), and output has 397
lines:
Line1-396: 0 0 0
Line 397: 1 2 202699
Any idea why so many mapred.map.tasks >>1? Is it incorrect splitting? I
purposely choose gzip because I believe it is NOT split-able. I got similar
results when using bzip2 and lzop codecs.
> Splitting issue when using NLineInputFormat with compression
> ------------------------------------------------------------
>
> Key: HADOOP-9442
> URL: https://issues.apache.org/jira/browse/HADOOP-9442
> Project: Hadoop Common
> Issue Type: Bug
> Affects Versions: 1.1.2
> Environment: Try in Apache Hadoop 1.1.1, CDH4, and Amazon EMR. Same
> result.
> Reporter: Qiming He
> Priority: Minor
>
> $ cat abook.txt | base64 –w 0 >onelinetext.b64
> $ hadoop fs –put onelinetext.b64 /input/onelinetext.b64
> $ hadoop jar hadoop-streaming.jar \
> -input /input/onelinetext.b64 \
> -output /output \
> -inputformat org.apache.hadoop.mapred.lib.NLineInputFormat \
> –mapper wc
> Num task: 1, and output has one line:
> Line 1: 1 2 202699
> which makes sense because one line per mapper is intended.
> Them, using compression with NLineInputFormat
> $ bzip2 onelinetext.b64
> $ hadoop fs –put onelinetext.b64.bz2 /input/onelinetext.b64.bz2
> $ hadoop jar hadoop-streaming.jar \
> -Dmapred.input.compress=true \
>
> -Dmapred.input.compression.codec=org.apache.hadoop.io.compress.GzipCodec \
> -input /input/onelinetext.b64.gz \
> -output /output \
> -inputformat org.apache.hadoop.mapred.lib.NLineInputFormat \
> –mapper wc
> I am expecting the same results as above, 'coz decompressing should occur
> before processing one-line text (i.e. wc), however, I am getting:
> Num task: 397 (or other large num depends on environments), and output has
> 397 lines:
> Line1-396: 0 0 0
> Line 397: 1 2 202699
> Any idea why so many mapred.map.tasks >>1? Is it incorrect splitting? I
> purposely choose gzip because I believe it is NOT split-able. I got similar
> results when using bzip2 and lzop codecs.
--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira