Dear Wiki user, You have subscribed to a wiki page or wiki category on "Hadoop Wiki" for change notification.
The following page has been changed by AndreyKurdyumov: http://wiki.apache.org/hadoop/HadoopMapReduce The comment on the change is: Fixed links broken due to src structure refactoring 3 days ago ------------------------------------------------------------------------------ writer per configured reduce task. It will then proceed to read its !FileSplit using the [http://hadoop.apache.org/core/docs/current/api/org/apache/hadoop/mapred/RecordReader.html RecordReader] it gets from the specified [http://hadoop.apache.org/core/docs/current/api/org/apache/hadoop/mapred/InputFormat.html InputFormat]. !InputFormat parses the input and generates - key-value pairs. !InputFormat must also handle records that may be split on the !FileSplit boundary. For example [http://svn.apache.org/viewcvs.cgi/hadoop/core/trunk/src/java/org/apache/hadoop/mapred/TextInputFormat.java?view=markup TextInputFormat] will read the last line of the !FileSplit past the split boundary and, when reading other than the first !FileSplit, !TextInputFormat ignores the content up to the first newline. + key-value pairs. !InputFormat must also handle records that may be split on the !FileSplit boundary. For example [http://svn.apache.org/viewcvs.cgi/hadoop/core/trunk/src/mapred/org/apache/hadoop/mapred/TextInputFormat.java?view=markup TextInputFormat] will read the last line of the !FileSplit past the split boundary and, when reading other than the first !FileSplit, !TextInputFormat ignores the content up to the first newline. It is not necessary for the !InputFormat to generate both meaningful keys ''and'' values. For example the @@ -42, +42 @@ When Mapper output is collected it is partitioned, which means that it will be written to the output specified by the [http://hadoop.apache.org/core/docs/current/api/org/apache/hadoop/mapred/Partitioner.html Partitioner]. The default [http://hadoop.apache.org/core/docs/current/api/org/apache/hadoop/mapred/lib/HashPartitioner.html HashPartitioner] uses the - hashcode function on the key's class (which means that this hashcode function must be good in order to achieve an even workload across the reduce tasks). See [http://svn.apache.org/viewcvs.cgi/hadoop/core/trunk/src/java/org/apache/hadoop/mapred/MapTask.java?view=markup MapTask] for details. + hashcode function on the key's class (which means that this hashcode function must be good in order to achieve an even workload across the reduce tasks). See [http://svn.apache.org/viewcvs.cgi/hadoop/core/trunk/src/mapred/org/apache/hadoop/mapred/MapTask.java?view=markup MapTask] for details. N input files will generate M map tasks to be run and each map task will generate as many output files as there are reduce @@ -80, +80 @@ == Reduce == When a reduce task starts, its input is scattered in many files across all the nodes where map tasks ran. If run in distributed mode these need to be first copied to the local - filesystem in a ''copy phase'' (see [http://svn.apache.org/viewcvs.cgi/hadoop/core/trunk/src/java/org/apache/hadoop/mapred/ReduceTaskRunner.java?view=markup ReduceTaskRunner]). + filesystem in a ''copy phase'' (see [https://svn.apache.org/viewcvs.cgi/hadoop/core/trunk/src/mapred/org/apache/hadoop/mapred/ReduceTaskRunner.java?view=markup ReduceTaskRunner]). Once all the data is available locally it is appended to one file in an ''append phase''. The file is then merge sorted so that the key-value pairs for a given key are contiguous (''sort phase''). This makes the actual reduce operation simple: the file is read sequentially and the values are passed to the reduce method with an iterator reading the input file until the next key - value is encountered. See [http://svn.apache.org/viewcvs.cgi/hadoop/core/trunk/src/java/org/apache/hadoop/mapred/ReduceTask.java?view=markup ReduceTask] for details. + value is encountered. See [http://svn.apache.org/viewcvs.cgi/hadoop/core/trunk/src/mapred/org/apache/hadoop/mapred/ReduceTask.java?view=markup ReduceTask] for details. At the end, the output will consist of one output file per executed reduce task. The format of the files can be specified with
