Strange Reduce Bahavior

Sriram Krishnan Wed, 01 Apr 2009 22:32:14 -0700

Hi all,

I am new to this list, and relatively new to Hadoop itself. So if thisquestion has been answered before, please point me to the right thread.

We are investigating the use of Hadoop for processing of geo-spatialdata. In its most basic form, out data is laid out in files, whereevery row has the format -

{index, x, y, z, ....}

I am writing some basic Hadoop programs for selecting data based on xand y values, and everything appears to work correctly. I have Hadoop0.19.1 running in pseudo-distributed on a Linux box. However, as aacademic exercise, I began writing some code that simply reads everysingle line of my input file, and does nothing else - I hoped to gainan understanding on how long it would take for Hadoop/HDFS to read theentire data set. My Map and Reduce functions are as follows:


        public void map(LongWritable key, Text value,
                        OutputCollector<Text, NullWritable> output,
                        Reporter reporter) throws IOException {

            // do nothing
            return;
        }

        public void reduce(Text key, Iterator<NullWritable> values,
                           OutputCollector<Text, NullWritable> output,
                           Reporter reporter) throws IOException {
            // do nothing
            return;
        }

My understanding is that the above map function will produce nointermediate key/value pairs - and hence, the reduce function shouldtake no time at all. However, when I run this code, Hadoop seems tospend an inordinate amount of time in the reduce phase. Here is theHadoop output -

09/04/01 20:11:12 INFO mapred.JobClient: Running job:job_200904011958_0005

09/04/01 20:11:13 INFO mapred.JobClient:  map 0% reduce 0%
09/04/01 20:11:21 INFO mapred.JobClient:  map 3% reduce 0%
09/04/01 20:11:25 INFO mapred.JobClient:  map 7% reduce 0%
....
09/04/01 20:13:17 INFO mapred.JobClient:  map 96% reduce 0%
09/04/01 20:13:20 INFO mapred.JobClient:  map 100% reduce 0%
09/04/01 20:13:30 INFO mapred.JobClient:  map 100% reduce 4%
09/04/01 20:13:35 INFO mapred.JobClient:  map 100% reduce 7%
...
09/04/01 20:14:05 INFO mapred.JobClient:  map 100% reduce 25%
09/04/01 20:14:10 INFO mapred.JobClient:  map 100% reduce 29%

09/04/01 20:14:15 INFO mapred.JobClient: Job complete:job_200904011958_0005

09/04/01 20:14:15 INFO mapred.JobClient: Counters: 15
09/04/01 20:14:15 INFO mapred.JobClient:   File Systems
09/04/01 20:14:15 INFO mapred.JobClient:     HDFS bytes read=1787707732
09/04/01 20:14:15 INFO mapred.JobClient:     Local bytes read=10
09/04/01 20:14:15 INFO mapred.JobClient:     Local bytes written=932
09/04/01 20:14:15 INFO mapred.JobClient:   Job Counters
09/04/01 20:14:15 INFO mapred.JobClient:     Launched reduce tasks=1
09/04/01 20:14:15 INFO mapred.JobClient:     Launched map tasks=27
09/04/01 20:14:15 INFO mapred.JobClient:     Data-local map tasks=27
09/04/01 20:14:15 INFO mapred.JobClient:   Map-Reduce Framework
09/04/01 20:14:15 INFO mapred.JobClient:     Reduce input groups=1
09/04/01 20:14:15 INFO mapred.JobClient:     Combine output records=0
09/04/01 20:14:15 INFO mapred.JobClient:     Map input records=44967808
09/04/01 20:14:15 INFO mapred.JobClient:     Reduce output records=0
09/04/01 20:14:15 INFO mapred.JobClient:     Map output bytes=2
09/04/01 20:14:15 INFO mapred.JobClient:     Map input bytes=1787601210
09/04/01 20:14:15 INFO mapred.JobClient:     Combine input records=0
09/04/01 20:14:15 INFO mapred.JobClient:     Map output records=1
09/04/01 20:14:15 INFO mapred.JobClient:     Reduce input records=0

As you can see, the reduce phase takes a little more than a minute -which is about a third of the execution time. However, the number ofreduce tasks spawned is 1, and reduce input records is 0. Why does itspend so long on the reduce phase if there are 0 input records to beread? Furthermore, if the number of reduce jobs is 1, how is Hadoopable to report back the percentage completion of the reduce phase?Updating the number of reduce tasks using theJobConf.setNumReduceTasks() has no effect on the parallelism of mapand reduce tasks.

Another interesting aspect is that my Hadoop code to do a select onthe input files based on x and y values runs faster than my aboveHadoop code - the select code contains a map function that emits theselected rows as intermediate keys, while the reduce code is prettymuch an identity function. In fact, in this case, I see parallelexecution of map and reduce tasks. I had thought that my Select codeshould be slower - because not only is it reading every single line ofinput (similar to my above experiment), but it is also doing somewrites based on the selection criteria.


Thanks in advance for any pointers!
Sriram

--
Sriram Krishnan, Ph.D.
San Diego Supercomputer Center
http://www.sdsc.edu/~sriram

Strange Reduce Bahavior

Reply via email to