Hi,

What input format are you using for the GZipped file?

I don't believe there is a GZip input format although some people have
 discussed whether it is feasible...

Cheers

Tim

On Thu, May 7, 2009 at 9:05 PM, Malcolm Matalka
<mmata...@millennialmedia.com> wrote:
> Problem:
>
> I am comparing two jobs.  The both have the same input content, however
> in one job the input file has been gziped, and in the other it has not.
> I get far less output rows in the gzipped result than I do in the
> uncompressed version:
>
>
>
> Lines in output:
>
> Gzipped: 86851
>
> Uncompressed: 65693I03
>
>
>
> The gzipped input file is 875MB in size, and the entire job runs in
> about 30 seconds.  The uncompressed file takes around 5 minutes to run.
>
>
>
> Hadoop version:
>
> 0.18.1, r694836
>
>
>
> Here is the output of the map task of the compressed input:
>
> 2009-05-07 14:54:53,492 INFO org.apache.hadoop.metrics.jvm.JvmMetrics:
> Initializing JVM Metrics with processName=MAP, sessionId=
>
> 2009-05-07 14:54:53,636 INFO org.apache.hadoop.mapred.MapTask:
> numReduceTasks: 12
>
> 2009-05-07 14:54:53,663 INFO org.apache.hadoop.mapred.MapTask:
> io.sort.mb = 100
>
> 2009-05-07 14:54:53,909 INFO org.apache.hadoop.mapred.MapTask: data
> buffer = 79691776/99614720
>
> 2009-05-07 14:54:53,909 INFO org.apache.hadoop.mapred.MapTask: record
> buffer = 262144/327680
>
> 2009-05-07 14:54:53,994 INFO org.apache.hadoop.util.NativeCodeLoader:
> Loaded the native-hadoop library
>
> 2009-05-07 14:54:54,005 INFO
> org.apache.hadoop.io.compress.zlib.ZlibFactory: Successfully loaded &
> initialized native-zlib library
>
> 2009-05-07 14:55:05,026 INFO org.apache.hadoop.mapred.MapTask: Starting
> flush of map output
>
> 2009-05-07 14:55:05,027 INFO org.apache.hadoop.mapred.MapTask: bufstart
> = 0; bufend = 45410962; bufvoid = 99614720
>
> 2009-05-07 14:55:05,027 INFO org.apache.hadoop.mapred.MapTask: kvstart =
> 0; kvend = 87923; length = 327680
>
> 2009-05-07 14:55:08,624 INFO org.apache.hadoop.mapred.MapTask: Index:
> (0, 3786199, 3786199)
>
> 2009-05-07 14:55:08,969 INFO org.apache.hadoop.mapred.MapTask: Index:
> (3786199, 3789579, 3789579)
>
> 2009-05-07 14:55:09,292 INFO org.apache.hadoop.mapred.MapTask: Index:
> (7575778, 3859183, 3859183)
>
> 2009-05-07 14:55:09,610 INFO org.apache.hadoop.mapred.MapTask: Index:
> (11434961, 3792449, 3792449)
>
> 2009-05-07 14:55:09,929 INFO org.apache.hadoop.mapred.MapTask: Index:
> (15227410, 3818963, 3818963)
>
> 2009-05-07 14:55:10,241 INFO org.apache.hadoop.mapred.MapTask: Index:
> (19046373, 3780875, 3780875)
>
> 2009-05-07 14:55:10,559 INFO org.apache.hadoop.mapred.MapTask: Index:
> (22827248, 3814950, 3814950)
>
> 2009-05-07 14:55:10,882 INFO org.apache.hadoop.mapred.MapTask: Index:
> (26642198, 3871426, 3871426)
>
> 2009-05-07 14:55:11,197 INFO org.apache.hadoop.mapred.MapTask: Index:
> (30513624, 3799971, 3799971)
>
> 2009-05-07 14:55:11,513 INFO org.apache.hadoop.mapred.MapTask: Index:
> (34313595, 3813327, 3813327)
>
> 2009-05-07 14:55:11,834 INFO org.apache.hadoop.mapred.MapTask: Index:
> (38126922, 3835208, 3835208)
>
> 2009-05-07 14:55:12,146 INFO org.apache.hadoop.mapred.MapTask: Index:
> (41962130, 3747048, 3747048)
>
> 2009-05-07 14:55:12,146 INFO org.apache.hadoop.mapred.MapTask: Finished
> spill 0
>
> 2009-05-07 14:55:12,160 INFO org.apache.hadoop.mapred.TaskRunner:
> attempt_200905071451_0001_m_000000_0: No outputs to promote from
> hdfs://hadoop00.corp.millennialmedia.com:54313/user/hadoop/kerry.common/
> _temporary/_attempt_200905071451_0001_m_000000_0
>
> 2009-05-07 14:55:12,162 INFO org.apache.hadoop.mapred.TaskRunner: Task
> 'attempt_200905071451_0001_m_000000_0' done.
>
>
>
>
>
> Am I doing something wrong?  Is there anything else I can do to debug
> this?  Is it a known bug?
>
>
>
> Let me know if you need anything else, thanks.
>
>

Reply via email to