[Hadoop Wiki] Update of "HadoopMapReduce" by DougCutting

Apache Wiki Wed, 13 Feb 2008 09:26:38 -0800

Dear Wiki user,

You have subscribed to a wiki page or wiki category on "Hadoop Wiki" for change 
notification.


The following page has been changed by DougCutting:
http://wiki.apache.org/hadoop/HadoopMapReduce

The comment on the change is:
update for TLP move

------------------------------------------------------------------------------
  == Map ==
  
  As the Map operation is parallelized the input file set is first
- split to several pieces called 
[http://lucene.apache.org/hadoop/docs/api/org/apache/hadoop/mapred/FileSplit.html
 FileSplits]. If an individual file
+ split to several pieces called 
[http://hadoop.apache.org/core/docs/current/api/org/apache/hadoop/mapred/FileSplit.html
 FileSplits]. If an individual file
  is so large that it will affect seek time it will be split to
  several Splits. The splitting does not know anything about the
  input file's internal logical structure, for example
@@ -18, +18 @@

  
  When an individual map task starts it will open a new output
  writer per configured reduce task. It will then proceed to read
- its !FileSplit using the 
[http://lucene.apache.org/hadoop/docs/api/org/apache/hadoop/mapred/RecordReader.html
 RecordReader] it gets from the specified
+ its !FileSplit using the 
[http://hadoop.apache.org/core/docs/current/api/org/apache/hadoop/mapred/RecordReader.html
 RecordReader] it gets from the specified
- 
[http://lucene.apache.org/hadoop/docs/api/org/apache/hadoop/mapred/InputFormat.html
 InputFormat]. !InputFormat parses the input and generates
+ 
[http://hadoop.apache.org/core/docs/current/api/org/apache/hadoop/mapred/InputFormat.html
 InputFormat]. !InputFormat parses the input and generates
- key-value pairs. !InputFormat must also handle records that may be split on 
the !FileSplit boundary. For example 
[http://svn.apache.org/viewcvs.cgi/lucene/hadoop/trunk/src/java/org/apache/hadoop/mapred/TextInputFormat.java?view=markup
 TextInputFormat] will read the last line of the !FileSplit past the split 
boundary and, when reading other than the first !FileSplit, !TextInputFormat 
ignores the content up to the first newline.
+ key-value pairs. !InputFormat must also handle records that may be split on 
the !FileSplit boundary. For example 
[http://svn.apache.org/viewcvs.cgi/hadoop/core/trunk/src/java/org/apache/hadoop/mapred/TextInputFormat.java?view=markup
 TextInputFormat] will read the last line of the !FileSplit past the split 
boundary and, when reading other than the first !FileSplit, !TextInputFormat 
ignores the content up to the first newline.
  
  It is not necessary for the !InputFormat to
  generate both meaningful keys ''and'' values. For example the
@@ -30, +30 @@

  offsets.
  
  As key-value pairs are read from the !RecordReader they are
- passed to the configured 
[http://lucene.apache.org/hadoop/docs/api/org/apache/hadoop/mapred/Mapper.html 
Mapper]. The user supplied Mapper does
+ passed to the configured 
[http://hadoop.apache.org/core/docs/current/api/org/apache/hadoop/mapred/Mapper.html
 Mapper]. The user supplied Mapper does
- whatever it wants with the input pair and calls       
[http://lucene.apache.org/hadoop/docs/api/org/apache/hadoop/mapred/OutputCollector.html#collect(org.apache.hadoop.io.WritableComparable,%20org.apache.hadoop.io.Writable)
 OutputCollector.collect] with key-value pairs of its own choosing. The output 
it
+ whatever it wants with the input pair and calls       
[http://hadoop.apache.org/core/docs/current/api/org/apache/hadoop/mapred/OutputCollector.html#collect(org.apache.hadoop.io.WritableComparable,%20org.apache.hadoop.io.Writable)
 OutputCollector.collect] with key-value pairs of its own choosing. The output 
it
  generates must use one key class and one value class.  This is because
- the Map output will be written into a 
[http://wiki.apache.org/lucene-hadoop/SequenceFile SequenceFile]
+ the Map output will be written into a SequenceFile
  which has per-file type information and all the records must
  have the same type (use subclassing if you want to output
  different data structures). The Map input and output key-value
@@ -41, +41 @@

  
  When Mapper output is collected it is partitioned, which means
  that it will be written to the output specified by the
- 
[http://lucene.apache.org/hadoop/docs/api/org/apache/hadoop/mapred/Partitioner.html
 Partitioner]. The default 
[http://lucene.apache.org/hadoop/docs/api/org/apache/hadoop/mapred/lib/HashPartitioner.html
 HashPartitioner] uses the
- hashcode function on the key's class (which means that this hashcode function 
must be good in order to achieve an even workload across the reduce tasks).  
See 
[http://svn.apache.org/viewcvs.cgi/lucene/hadoop/trunk/src/java/org/apache/hadoop/mapred/MapTask.java?view=markup
       MapTask] for details.
+ 
[http://hadoop.apache.org/core/docs/current/api/org/apache/hadoop/mapred/Partitioner.html
 Partitioner]. The default 
[http://hadoop.apache.org/core/docs/current/api/org/apache/hadoop/mapred/lib/HashPartitioner.html
 HashPartitioner] uses the
+ hashcode function on the key's class (which means that this hashcode function 
must be good in order to achieve an even workload across the reduce tasks).  
See 
[http://svn.apache.org/viewcvs.cgi/hadoop/core/trunk/src/java/org/apache/hadoop/mapred/MapTask.java?view=markup
         MapTask] for details.
  
  N input files will generate M map tasks to be run and each map
  task will generate as many output files as there are reduce
@@ -80, +80 @@

  == Reduce ==
  When a reduce task starts, its input is scattered in many files across all 
the nodes where map tasks ran. If run in
  distributed mode these need to be first copied to the local
- filesystem in a ''copy phase'' (see 
[http://svn.apache.org/viewcvs.cgi/lucene/hadoop/trunk/src/java/org/apache/hadoop/mapred/ReduceTaskRunner.java?view=markup
 ReduceTaskRunner]).
+ filesystem in a ''copy phase'' (see 
[http://svn.apache.org/viewcvs.cgi/hadoop/core/trunk/src/java/org/apache/hadoop/mapred/ReduceTaskRunner.java?view=markup
 ReduceTaskRunner]).
  
  Once all the data is available locally it is appended to one
  file in an ''append phase''. The file is then merge sorted so that the 
key-value pairs for
  a given key are contiguous (''sort phase''). This makes the actual reduce 
operation simple: the file is
  read sequentially and the values are passed to the reduce method
  with an iterator reading the input file until the next key
- value is encountered. See 
[http://svn.apache.org/viewcvs.cgi/lucene/hadoop/trunk/src/java/org/apache/hadoop/mapred/ReduceTask.java?view=markup
         ReduceTask] for details.
+ value is encountered. See 
[http://svn.apache.org/viewcvs.cgi/hadoop/core/trunk/src/java/org/apache/hadoop/mapred/ReduceTask.java?view=markup
   ReduceTask] for details.
  
  At the end, the output will consist of one output file per executed reduce
  task. The format of the files can be specified with
- 
[http://lucene.apache.org/hadoop/docs/api/org/apache/hadoop/mapred/JobConf.html#setOutputFormat(java.lang.Class)
 JobConf.setOutputFormat]. If !SequentialOutputFormat is used then the output 
key and value
+ 
[http://hadoop.apache.org/core/docs/current/api/org/apache/hadoop/mapred/JobConf.html#setOutputFormat(java.lang.Class)
 JobConf.setOutputFormat]. If !SequentialOutputFormat is used then the output 
key and value
  classes must also be specified.

[Hadoop Wiki] Update of "HadoopMapReduce" by DougCutting

Reply via email to