Problem: I am comparing two jobs. The both have the same input content, however in one job the input file has been gziped, and in the other it has not. I get far less output rows in the gzipped result than I do in the uncompressed version:
Lines in output: Gzipped: 86851 Uncompressed: 6569303 The gzipped input file is 875MB in size, and the entire job runs in about 30 seconds. The uncompressed file takes around 5 minutes to run. Hadoop version: 0.18.1, r694836 Here is the output of the map task of the compressed input: 2009-05-07 14:54:53,492 INFO org.apache.hadoop.metrics.jvm.JvmMetrics: Initializing JVM Metrics with processName=MAP, sessionId= 2009-05-07 14:54:53,636 INFO org.apache.hadoop.mapred.MapTask: numReduceTasks: 12 2009-05-07 14:54:53,663 INFO org.apache.hadoop.mapred.MapTask: io.sort.mb = 100 2009-05-07 14:54:53,909 INFO org.apache.hadoop.mapred.MapTask: data buffer = 79691776/99614720 2009-05-07 14:54:53,909 INFO org.apache.hadoop.mapred.MapTask: record buffer = 262144/327680 2009-05-07 14:54:53,994 INFO org.apache.hadoop.util.NativeCodeLoader: Loaded the native-hadoop library 2009-05-07 14:54:54,005 INFO org.apache.hadoop.io.compress.zlib.ZlibFactory: Successfully loaded & initialized native-zlib library 2009-05-07 14:55:05,026 INFO org.apache.hadoop.mapred.MapTask: Starting flush of map output 2009-05-07 14:55:05,027 INFO org.apache.hadoop.mapred.MapTask: bufstart = 0; bufend = 45410962; bufvoid = 99614720 2009-05-07 14:55:05,027 INFO org.apache.hadoop.mapred.MapTask: kvstart = 0; kvend = 87923; length = 327680 2009-05-07 14:55:08,624 INFO org.apache.hadoop.mapred.MapTask: Index: (0, 3786199, 3786199) 2009-05-07 14:55:08,969 INFO org.apache.hadoop.mapred.MapTask: Index: (3786199, 3789579, 3789579) 2009-05-07 14:55:09,292 INFO org.apache.hadoop.mapred.MapTask: Index: (7575778, 3859183, 3859183) 2009-05-07 14:55:09,610 INFO org.apache.hadoop.mapred.MapTask: Index: (11434961, 3792449, 3792449) 2009-05-07 14:55:09,929 INFO org.apache.hadoop.mapred.MapTask: Index: (15227410, 3818963, 3818963) 2009-05-07 14:55:10,241 INFO org.apache.hadoop.mapred.MapTask: Index: (19046373, 3780875, 3780875) 2009-05-07 14:55:10,559 INFO org.apache.hadoop.mapred.MapTask: Index: (22827248, 3814950, 3814950) 2009-05-07 14:55:10,882 INFO org.apache.hadoop.mapred.MapTask: Index: (26642198, 3871426, 3871426) 2009-05-07 14:55:11,197 INFO org.apache.hadoop.mapred.MapTask: Index: (30513624, 3799971, 3799971) 2009-05-07 14:55:11,513 INFO org.apache.hadoop.mapred.MapTask: Index: (34313595, 3813327, 3813327) 2009-05-07 14:55:11,834 INFO org.apache.hadoop.mapred.MapTask: Index: (38126922, 3835208, 3835208) 2009-05-07 14:55:12,146 INFO org.apache.hadoop.mapred.MapTask: Index: (41962130, 3747048, 3747048) 2009-05-07 14:55:12,146 INFO org.apache.hadoop.mapred.MapTask: Finished spill 0 2009-05-07 14:55:12,160 INFO org.apache.hadoop.mapred.TaskRunner: attempt_200905071451_0001_m_000000_0: No outputs to promote from hdfs://hadoop00.corp.millennialmedia.com:54313/user/hadoop/kerry.common/ _temporary/_attempt_200905071451_0001_m_000000_0 2009-05-07 14:55:12,162 INFO org.apache.hadoop.mapred.TaskRunner: Task 'attempt_200905071451_0001_m_000000_0' done. Am I doing something wrong? Is there anything else I can do to debug this? Is it a known bug? Let me know if you need anything else, thanks.