Hi, What input format are you using for the GZipped file?
I don't believe there is a GZip input format although some people have discussed whether it is feasible... Cheers Tim On Thu, May 7, 2009 at 9:05 PM, Malcolm Matalka <mmata...@millennialmedia.com> wrote: > Problem: > > I am comparing two jobs. The both have the same input content, however > in one job the input file has been gziped, and in the other it has not. > I get far less output rows in the gzipped result than I do in the > uncompressed version: > > > > Lines in output: > > Gzipped: 86851 > > Uncompressed: 65693I03 > > > > The gzipped input file is 875MB in size, and the entire job runs in > about 30 seconds. The uncompressed file takes around 5 minutes to run. > > > > Hadoop version: > > 0.18.1, r694836 > > > > Here is the output of the map task of the compressed input: > > 2009-05-07 14:54:53,492 INFO org.apache.hadoop.metrics.jvm.JvmMetrics: > Initializing JVM Metrics with processName=MAP, sessionId= > > 2009-05-07 14:54:53,636 INFO org.apache.hadoop.mapred.MapTask: > numReduceTasks: 12 > > 2009-05-07 14:54:53,663 INFO org.apache.hadoop.mapred.MapTask: > io.sort.mb = 100 > > 2009-05-07 14:54:53,909 INFO org.apache.hadoop.mapred.MapTask: data > buffer = 79691776/99614720 > > 2009-05-07 14:54:53,909 INFO org.apache.hadoop.mapred.MapTask: record > buffer = 262144/327680 > > 2009-05-07 14:54:53,994 INFO org.apache.hadoop.util.NativeCodeLoader: > Loaded the native-hadoop library > > 2009-05-07 14:54:54,005 INFO > org.apache.hadoop.io.compress.zlib.ZlibFactory: Successfully loaded & > initialized native-zlib library > > 2009-05-07 14:55:05,026 INFO org.apache.hadoop.mapred.MapTask: Starting > flush of map output > > 2009-05-07 14:55:05,027 INFO org.apache.hadoop.mapred.MapTask: bufstart > = 0; bufend = 45410962; bufvoid = 99614720 > > 2009-05-07 14:55:05,027 INFO org.apache.hadoop.mapred.MapTask: kvstart = > 0; kvend = 87923; length = 327680 > > 2009-05-07 14:55:08,624 INFO org.apache.hadoop.mapred.MapTask: Index: > (0, 3786199, 3786199) > > 2009-05-07 14:55:08,969 INFO org.apache.hadoop.mapred.MapTask: Index: > (3786199, 3789579, 3789579) > > 2009-05-07 14:55:09,292 INFO org.apache.hadoop.mapred.MapTask: Index: > (7575778, 3859183, 3859183) > > 2009-05-07 14:55:09,610 INFO org.apache.hadoop.mapred.MapTask: Index: > (11434961, 3792449, 3792449) > > 2009-05-07 14:55:09,929 INFO org.apache.hadoop.mapred.MapTask: Index: > (15227410, 3818963, 3818963) > > 2009-05-07 14:55:10,241 INFO org.apache.hadoop.mapred.MapTask: Index: > (19046373, 3780875, 3780875) > > 2009-05-07 14:55:10,559 INFO org.apache.hadoop.mapred.MapTask: Index: > (22827248, 3814950, 3814950) > > 2009-05-07 14:55:10,882 INFO org.apache.hadoop.mapred.MapTask: Index: > (26642198, 3871426, 3871426) > > 2009-05-07 14:55:11,197 INFO org.apache.hadoop.mapred.MapTask: Index: > (30513624, 3799971, 3799971) > > 2009-05-07 14:55:11,513 INFO org.apache.hadoop.mapred.MapTask: Index: > (34313595, 3813327, 3813327) > > 2009-05-07 14:55:11,834 INFO org.apache.hadoop.mapred.MapTask: Index: > (38126922, 3835208, 3835208) > > 2009-05-07 14:55:12,146 INFO org.apache.hadoop.mapred.MapTask: Index: > (41962130, 3747048, 3747048) > > 2009-05-07 14:55:12,146 INFO org.apache.hadoop.mapred.MapTask: Finished > spill 0 > > 2009-05-07 14:55:12,160 INFO org.apache.hadoop.mapred.TaskRunner: > attempt_200905071451_0001_m_000000_0: No outputs to promote from > hdfs://hadoop00.corp.millennialmedia.com:54313/user/hadoop/kerry.common/ > _temporary/_attempt_200905071451_0001_m_000000_0 > > 2009-05-07 14:55:12,162 INFO org.apache.hadoop.mapred.TaskRunner: Task > 'attempt_200905071451_0001_m_000000_0' done. > > > > > > Am I doing something wrong? Is there anything else I can do to debug > this? Is it a known bug? > > > > Let me know if you need anything else, thanks. > >