1) when running in pseudo-distributed mode, only 2 values for the reduce
count are accepted, 0 and 1. All other positive values are mapped to 1.

2) The single reduce task spawned has several steps, and each of these steps
account for about 1/3 of it's overall progress.

The 1st third, is collecting all of the map outputs, each of which in your
example has 0 records.
The 2nd third, is to produce a single sorted set from all of the map
outputs.
The 3rd third, is to reduce the sorted set.

So, you get progress reports,

3) The framework is designed for working on large clusters of machines where
there needs to be a little delay between operations to avoid massive network
loading spikes, and the initial setup of the map task execution environment
on a machine, and the initial setup of the reduce task execution environment
take a bit of time.
In production jobs, these delays and setup times are lost in the overall
task run time.
In the small test job case the delays and setup times will be the bulk of
the time spent executing the test.

On Wed, Apr 1, 2009 at 10:31 PM, Sriram Krishnan <sri...@sdsc.edu> wrote:

> Hi all,
>
> I am new to this list, and relatively new to Hadoop itself. So if this
> question has been answered before, please point me to the right thread.
>
> We are investigating the use of Hadoop for processing of geo-spatial data.
> In its most basic form, out data is laid out in files, where every row has
> the format -
> {index, x, y, z, ....}
>
> I am writing some basic Hadoop programs for selecting data based on x and y
> values, and everything appears to work correctly. I have Hadoop 0.19.1
> running in pseudo-distributed on a Linux box. However, as a academic
> exercise, I began writing some code that simply reads every single line of
> my input file, and does nothing else - I hoped to gain an understanding on
> how long it would take for Hadoop/HDFS to read the entire data set. My Map
> and Reduce functions are as follows:
>
>        public void map(LongWritable key, Text value,
>                        OutputCollector<Text, NullWritable> output,
>                        Reporter reporter) throws IOException {
>
>            // do nothing
>            return;
>        }
>
>        public void reduce(Text key, Iterator<NullWritable> values,
>                           OutputCollector<Text, NullWritable> output,
>                           Reporter reporter) throws IOException {
>            // do nothing
>            return;
>        }
>
> My understanding is that the above map function will produce no
> intermediate key/value pairs - and hence, the reduce function should take no
> time at all. However, when I run this code, Hadoop seems to spend an
> inordinate amount of time in the reduce phase. Here is the Hadoop output -
>
> 09/04/01 20:11:12 INFO mapred.JobClient: Running job: job_200904011958_0005
> 09/04/01 20:11:13 INFO mapred.JobClient:  map 0% reduce 0%
> 09/04/01 20:11:21 INFO mapred.JobClient:  map 3% reduce 0%
> 09/04/01 20:11:25 INFO mapred.JobClient:  map 7% reduce 0%
> ....
> 09/04/01 20:13:17 INFO mapred.JobClient:  map 96% reduce 0%
> 09/04/01 20:13:20 INFO mapred.JobClient:  map 100% reduce 0%
> 09/04/01 20:13:30 INFO mapred.JobClient:  map 100% reduce 4%
> 09/04/01 20:13:35 INFO mapred.JobClient:  map 100% reduce 7%
> ...
> 09/04/01 20:14:05 INFO mapred.JobClient:  map 100% reduce 25%
> 09/04/01 20:14:10 INFO mapred.JobClient:  map 100% reduce 29%
> 09/04/01 20:14:15 INFO mapred.JobClient: Job complete:
> job_200904011958_0005
> 09/04/01 20:14:15 INFO mapred.JobClient: Counters: 15
> 09/04/01 20:14:15 INFO mapred.JobClient:   File Systems
> 09/04/01 20:14:15 INFO mapred.JobClient:     HDFS bytes read=1787707732
> 09/04/01 20:14:15 INFO mapred.JobClient:     Local bytes read=10
> 09/04/01 20:14:15 INFO mapred.JobClient:     Local bytes written=932
> 09/04/01 20:14:15 INFO mapred.JobClient:   Job Counters
> 09/04/01 20:14:15 INFO mapred.JobClient:     Launched reduce tasks=1
> 09/04/01 20:14:15 INFO mapred.JobClient:     Launched map tasks=27
> 09/04/01 20:14:15 INFO mapred.JobClient:     Data-local map tasks=27
> 09/04/01 20:14:15 INFO mapred.JobClient:   Map-Reduce Framework
> 09/04/01 20:14:15 INFO mapred.JobClient:     Reduce input groups=1
> 09/04/01 20:14:15 INFO mapred.JobClient:     Combine output records=0
> 09/04/01 20:14:15 INFO mapred.JobClient:     Map input records=44967808
> 09/04/01 20:14:15 INFO mapred.JobClient:     Reduce output records=0
> 09/04/01 20:14:15 INFO mapred.JobClient:     Map output bytes=2
> 09/04/01 20:14:15 INFO mapred.JobClient:     Map input bytes=1787601210
> 09/04/01 20:14:15 INFO mapred.JobClient:     Combine input records=0
> 09/04/01 20:14:15 INFO mapred.JobClient:     Map output records=1
> 09/04/01 20:14:15 INFO mapred.JobClient:     Reduce input records=0
>
> As you can see, the reduce phase takes a little more than a minute - which
> is about a third of the execution time. However, the number of reduce tasks
> spawned is 1, and reduce input records is 0. Why does it spend so long on
> the reduce phase if there are 0 input records to be read? Furthermore, if
> the number of reduce jobs is 1, how is Hadoop able to report back the
> percentage completion of the reduce phase? Updating the number of reduce
> tasks using the JobConf.setNumReduceTasks() has no effect on the parallelism
> of map and reduce tasks.
>
> Another interesting aspect is that my Hadoop code to do a select on the
> input files based on x and y values runs faster than my above Hadoop code -
> the select code contains a map function that emits the selected rows as
> intermediate keys, while the reduce code is pretty much an identity
> function. In fact, in this case, I see parallel execution of map and reduce
> tasks. I had thought that my Select code should be slower - because not only
> is it reading every single line of input (similar to my above experiment),
> but it is also doing some writes based on the selection criteria.
>
> Thanks in advance for any pointers!
> Sriram
>
> --
> Sriram Krishnan, Ph.D.
> San Diego Supercomputer Center
> http://www.sdsc.edu/~sriram <http://www.sdsc.edu/%7Esriram>
>
>
>
>
>


-- 
Alpha Chapters of my book on Hadoop are available
http://www.apress.com/book/view/9781430219422

Reply via email to