mapred_tutorial.pdf

cutting Thu, 31 Jul 2008 14:05:54 -0700
Modified: hadoop/core/branches/branch-0.18/docs/mapred_tutorial.html
URL: 
http://svn.apache.org/viewvc/hadoop/core/branches/branch-0.18/docs/mapred_tutorial.html?rev=681496&r1=681495&r2=681496&view=diff
==============================================================================
--- hadoop/core/branches/branch-0.18/docs/mapred_tutorial.html (original)
+++ hadoop/core/branches/branch-0.18/docs/mapred_tutorial.html Thu Jul 31 
14:05:00 2008
@@ -310,7 +310,7 @@
 <a href="#Example%3A+WordCount+v2.0">Example: WordCount v2.0</a>
 <ul class="minitoc">
 <li>
-<a href="#Source+Code-N10DD5">Source Code</a>
+<a href="#Source+Code-N10DF5">Source Code</a>
 </li>
 <li>
 <a href="#Sample+Runs">Sample Runs</a>
@@ -1116,7 +1116,24 @@
 <br>
         
 </p>
-<a name="N104EC"></a><a name="Walk-through"></a>
+<p> Applications can specify a comma separated list of paths which
+        would be present in the current working directory of the task 
+        using the option <span class="codefrag">-files</span>. The <span 
class="codefrag">-libjars</span>
+        option allows applications to add jars to the classpaths of the maps
+        and reduces. The <span class="codefrag">-archives</span> allows them 
to pass archives
+        as arguments that are unzipped/unjarred and a link with name of the
+        jar/zip are created in the current working directory of tasks. More
+        details about the command line options are available at 
+        <a href="commands_manual.html">Commands manual</a>
+</p>
+<p>Running <span class="codefrag">wordcount</span> example with 
+        <span class="codefrag">-libjars</span> and <span 
class="codefrag">-files</span>:<br>
+        
+<span class="codefrag"> hadoop jar hadoop-examples.jar wordcount -files 
cachefile.txt 
+        -libjars mylib.jar input output </span> 
+        
+</p>
+<a name="N1050C"></a><a name="Walk-through"></a>
 <h3 class="h4">Walk-through</h3>
 <p>The <span class="codefrag">WordCount</span> application is quite 
straight-forward.</p>
 <p>The <span class="codefrag">Mapper</span> implementation (lines 14-26), via 
the 
@@ -1226,7 +1243,7 @@
 </div>
     
     
-<a name="N105A3"></a><a name="Map%2FReduce+-+User+Interfaces"></a>
+<a name="N105C3"></a><a name="Map%2FReduce+-+User+Interfaces"></a>
 <h2 class="h3">Map/Reduce - User Interfaces</h2>
 <div class="section">
 <p>This section provides a reasonable amount of detail on every user-facing 
@@ -1245,12 +1262,12 @@
 <p>Finally, we will wrap up by discussing some useful features of the
       framework such as the <span class="codefrag">DistributedCache</span>, 
       <span class="codefrag">IsolationRunner</span> etc.</p>
-<a name="N105DC"></a><a name="Payload"></a>
+<a name="N105FC"></a><a name="Payload"></a>
 <h3 class="h4">Payload</h3>
 <p>Applications typically implement the <span class="codefrag">Mapper</span> 
and 
         <span class="codefrag">Reducer</span> interfaces to provide the <span 
class="codefrag">map</span> and 
         <span class="codefrag">reduce</span> methods. These form the core of 
the job.</p>
-<a name="N105F1"></a><a name="Mapper"></a>
+<a name="N10611"></a><a name="Mapper"></a>
 <h4>Mapper</h4>
 <p>
 <a href="api/org/apache/hadoop/mapred/Mapper.html">
@@ -1306,7 +1323,7 @@
           <a href="api/org/apache/hadoop/io/compress/CompressionCodec.html">
           CompressionCodec</a> to be used via the <span 
class="codefrag">JobConf</span>.
           </p>
-<a name="N10667"></a><a name="How+Many+Maps%3F"></a>
+<a name="N10687"></a><a name="How+Many+Maps%3F"></a>
 <h5>How Many Maps?</h5>
 <p>The number of maps is usually driven by the total size of the 
             inputs, that is, the total number of blocks of the input files.</p>
@@ -1319,7 +1336,7 @@
             <a 
href="api/org/apache/hadoop/mapred/JobConf.html#setNumMapTasks(int)">
             setNumMapTasks(int)</a> (which only provides a hint to the 
framework) 
             is used to set it even higher.</p>
-<a name="N1067F"></a><a name="Reducer"></a>
+<a name="N1069F"></a><a name="Reducer"></a>
 <h4>Reducer</h4>
 <p>
 <a href="api/org/apache/hadoop/mapred/Reducer.html">
@@ -1342,18 +1359,18 @@
 <p>
 <span class="codefrag">Reducer</span> has 3 primary phases: shuffle, sort and 
reduce.
           </p>
-<a name="N106AF"></a><a name="Shuffle"></a>
+<a name="N106CF"></a><a name="Shuffle"></a>
 <h5>Shuffle</h5>
 <p>Input to the <span class="codefrag">Reducer</span> is the sorted output of 
the
             mappers. In this phase the framework fetches the relevant 
partition 
             of the output of all the mappers, via HTTP.</p>
-<a name="N106BC"></a><a name="Sort"></a>
+<a name="N106DC"></a><a name="Sort"></a>
 <h5>Sort</h5>
 <p>The framework groups <span class="codefrag">Reducer</span> inputs by keys 
(since 
             different mappers may have output the same key) in this stage.</p>
 <p>The shuffle and sort phases occur simultaneously; while 
             map-outputs are being fetched they are merged.</p>
-<a name="N106CB"></a><a name="Secondary+Sort"></a>
+<a name="N106EB"></a><a name="Secondary+Sort"></a>
 <h5>Secondary Sort</h5>
 <p>If equivalence rules for grouping the intermediate keys are 
               required to be different from those for grouping keys before 
@@ -1364,7 +1381,7 @@
               JobConf.setOutputKeyComparatorClass(Class)</a> can be used to 
               control how intermediate keys are grouped, these can be used in 
               conjunction to simulate <em>secondary sort on values</em>.</p>
-<a name="N106E4"></a><a name="Reduce"></a>
+<a name="N10704"></a><a name="Reduce"></a>
 <h5>Reduce</h5>
 <p>In this phase the 
             <a href="api/org/apache/hadoop/mapred/Reducer.html#reduce(K2, 
java.util.Iterator, org.apache.hadoop.mapred.OutputCollector, 
org.apache.hadoop.mapred.Reporter)">
@@ -1380,7 +1397,7 @@
             progress, set application-level status messages and update 
             <span class="codefrag">Counters</span>, or just indicate that they 
are alive.</p>
 <p>The output of the <span class="codefrag">Reducer</span> is <em>not 
sorted</em>.</p>
-<a name="N10712"></a><a name="How+Many+Reduces%3F"></a>
+<a name="N10732"></a><a name="How+Many+Reduces%3F"></a>
 <h5>How Many Reduces?</h5>
 <p>The right number of reduces seems to be <span class="codefrag">0.95</span> 
or 
             <span class="codefrag">1.75</span> multiplied by (&lt;<em>no. of 
nodes</em>&gt; * 
@@ -1395,7 +1412,7 @@
 <p>The scaling factors above are slightly less than whole numbers to 
             reserve a few reduce slots in the framework for speculative-tasks 
and
             failed tasks.</p>
-<a name="N10737"></a><a name="Reducer+NONE"></a>
+<a name="N10757"></a><a name="Reducer+NONE"></a>
 <h5>Reducer NONE</h5>
 <p>It is legal to set the number of reduce-tasks to <em>zero</em> if 
             no reduction is desired.</p>
@@ -1405,7 +1422,7 @@
             setOutputPath(Path)</a>. The framework does not sort the 
             map-outputs before writing them out to the <span 
class="codefrag">FileSystem</span>.
             </p>
-<a name="N10752"></a><a name="Partitioner"></a>
+<a name="N10772"></a><a name="Partitioner"></a>
 <h4>Partitioner</h4>
 <p>
 <a href="api/org/apache/hadoop/mapred/Partitioner.html">
@@ -1419,7 +1436,7 @@
 <p>
 <a href="api/org/apache/hadoop/mapred/lib/HashPartitioner.html">
           HashPartitioner</a> is the default <span 
class="codefrag">Partitioner</span>.</p>
-<a name="N10771"></a><a name="Reporter"></a>
+<a name="N10791"></a><a name="Reporter"></a>
 <h4>Reporter</h4>
 <p>
 <a href="api/org/apache/hadoop/mapred/Reporter.html">
@@ -1438,7 +1455,7 @@
           </p>
 <p>Applications can also update <span class="codefrag">Counters</span> using 
the 
           <span class="codefrag">Reporter</span>.</p>
-<a name="N1079B"></a><a name="OutputCollector"></a>
+<a name="N107BB"></a><a name="OutputCollector"></a>
 <h4>OutputCollector</h4>
 <p>
 <a href="api/org/apache/hadoop/mapred/OutputCollector.html">
@@ -1449,7 +1466,7 @@
 <p>Hadoop Map/Reduce comes bundled with a 
         <a href="api/org/apache/hadoop/mapred/lib/package-summary.html">
         library</a> of generally useful mappers, reducers, and 
partitioners.</p>
-<a name="N107B6"></a><a name="Job+Configuration"></a>
+<a name="N107D6"></a><a name="Job+Configuration"></a>
 <h3 class="h4">Job Configuration</h3>
 <p>
 <a href="api/org/apache/hadoop/mapred/JobConf.html">
@@ -1507,7 +1524,7 @@
         <a 
href="api/org/apache/hadoop/conf/Configuration.html#set(java.lang.String, 
java.lang.String)">set(String, String)</a>/<a 
href="api/org/apache/hadoop/conf/Configuration.html#get(java.lang.String, 
java.lang.String)">get(String, String)</a>
         to set/get arbitrary parameters needed by applications. However, use 
the 
         <span class="codefrag">DistributedCache</span> for large amounts of 
(read-only) data.</p>
-<a name="N10848"></a><a name="Task+Execution+%26+Environment"></a>
+<a name="N10868"></a><a name="Task+Execution+%26+Environment"></a>
 <h3 class="h4">Task Execution &amp; Environment</h3>
 <p>The <span class="codefrag">TaskTracker</span> executes the <span 
class="codefrag">Mapper</span>/ 
         <span class="codefrag">Reducer</span>  <em>task</em> as a child 
process in a separate jvm.
@@ -1739,7 +1756,7 @@
         <a 
href="native_libraries.html#Loading+native+libraries+through+DistributedCache">
         native_libraries.html</a>
 </p>
-<a name="N109E8"></a><a name="Job+Submission+and+Monitoring"></a>
+<a name="N10A08"></a><a name="Job+Submission+and+Monitoring"></a>
 <h3 class="h4">Job Submission and Monitoring</h3>
 <p>
 <a href="api/org/apache/hadoop/mapred/JobClient.html">
@@ -1800,7 +1817,7 @@
 <p>Normally the user creates the application, describes various facets 
         of the job via <span class="codefrag">JobConf</span>, and then uses 
the 
         <span class="codefrag">JobClient</span> to submit the job and monitor 
its progress.</p>
-<a name="N10A48"></a><a name="Job+Control"></a>
+<a name="N10A68"></a><a name="Job+Control"></a>
 <h4>Job Control</h4>
 <p>Users may need to chain Map/Reduce jobs to accomplish complex
           tasks which cannot be done via a single Map/Reduce job. This is 
fairly
@@ -1836,7 +1853,7 @@
             </li>
           
 </ul>
-<a name="N10A72"></a><a name="Job+Input"></a>
+<a name="N10A92"></a><a name="Job+Input"></a>
 <h3 class="h4">Job Input</h3>
 <p>
 <a href="api/org/apache/hadoop/mapred/InputFormat.html">
@@ -1884,7 +1901,7 @@
         appropriate <span class="codefrag">CompressionCodec</span>. However, 
it must be noted that
         compressed files with the above extensions cannot be <em>split</em> 
and 
         each compressed file is processed in its entirety by a single 
mapper.</p>
-<a name="N10ADC"></a><a name="InputSplit"></a>
+<a name="N10AFC"></a><a name="InputSplit"></a>
 <h4>InputSplit</h4>
 <p>
 <a href="api/org/apache/hadoop/mapred/InputSplit.html">
@@ -1898,7 +1915,7 @@
           FileSplit</a> is the default <span 
class="codefrag">InputSplit</span>. It sets 
           <span class="codefrag">map.input.file</span> to the path of the 
input file for the
           logical split.</p>
-<a name="N10B01"></a><a name="RecordReader"></a>
+<a name="N10B21"></a><a name="RecordReader"></a>
 <h4>RecordReader</h4>
 <p>
 <a href="api/org/apache/hadoop/mapred/RecordReader.html">
@@ -1910,7 +1927,7 @@
           for processing. <span class="codefrag">RecordReader</span> thus 
assumes the 
           responsibility of processing record boundaries and presents the 
tasks 
           with keys and values.</p>
-<a name="N10B24"></a><a name="Job+Output"></a>
+<a name="N10B44"></a><a name="Job+Output"></a>
 <h3 class="h4">Job Output</h3>
 <p>
 <a href="api/org/apache/hadoop/mapred/OutputFormat.html">
@@ -1935,7 +1952,7 @@
 <p>
 <span class="codefrag">TextOutputFormat</span> is the default 
         <span class="codefrag">OutputFormat</span>.</p>
-<a name="N10B4D"></a><a name="Task+Side-Effect+Files"></a>
+<a name="N10B6D"></a><a name="Task+Side-Effect+Files"></a>
 <h4>Task Side-Effect Files</h4>
 <p>In some applications, component tasks need to create and/or write to
           side-files, which differ from the actual job-output files.</p>
@@ -1974,7 +1991,7 @@
 <p>The entire discussion holds true for maps of jobs with 
            reducer=NONE (i.e. 0 reduces) since output of the map, in that 
case, 
            goes directly to HDFS.</p>
-<a name="N10B95"></a><a name="RecordWriter"></a>
+<a name="N10BB5"></a><a name="RecordWriter"></a>
 <h4>RecordWriter</h4>
 <p>
 <a href="api/org/apache/hadoop/mapred/RecordWriter.html">
@@ -1982,9 +1999,9 @@
           pairs to an output file.</p>
 <p>RecordWriter implementations write the job outputs to the 
           <span class="codefrag">FileSystem</span>.</p>
-<a name="N10BAC"></a><a name="Other+Useful+Features"></a>
+<a name="N10BCC"></a><a name="Other+Useful+Features"></a>
 <h3 class="h4">Other Useful Features</h3>
-<a name="N10BB2"></a><a name="Counters"></a>
+<a name="N10BD2"></a><a name="Counters"></a>
 <h4>Counters</h4>
 <p>
 <span class="codefrag">Counters</span> represent global counters, defined 
either by 
@@ -2001,7 +2018,7 @@
           in the <span class="codefrag">map</span> and/or 
           <span class="codefrag">reduce</span> methods. These counters are 
then globally 
           aggregated by the framework.</p>
-<a name="N10BE1"></a><a name="DistributedCache"></a>
+<a name="N10C01"></a><a name="DistributedCache"></a>
 <h4>DistributedCache</h4>
 <p>
 <a href="api/org/apache/hadoop/filecache/DistributedCache.html">
@@ -2072,7 +2089,7 @@
           <span class="codefrag">mapred.job.classpath.{files|archives}</span>. 
Similarly the
           cached files that are symlinked into the working directory of the
           task can be used to distribute native libraries and load them.</p>
-<a name="N10C64"></a><a name="Tool"></a>
+<a name="N10C84"></a><a name="Tool"></a>
 <h4>Tool</h4>
 <p>The <a href="api/org/apache/hadoop/util/Tool.html">Tool</a> 
           interface supports the handling of generic Hadoop command-line 
options.
@@ -2112,7 +2129,7 @@
             </span>
           
 </p>
-<a name="N10C96"></a><a name="IsolationRunner"></a>
+<a name="N10CB6"></a><a name="IsolationRunner"></a>
 <h4>IsolationRunner</h4>
 <p>
 <a href="api/org/apache/hadoop/mapred/IsolationRunner.html">
@@ -2136,7 +2153,7 @@
 <p>
 <span class="codefrag">IsolationRunner</span> will run the failed task in a 
single 
           jvm, which can be in the debugger, over precisely the same input.</p>
-<a name="N10CC9"></a><a name="Profiling"></a>
+<a name="N10CE9"></a><a name="Profiling"></a>
 <h4>Profiling</h4>
 <p>Profiling is a utility to get a representative (2 or 3) sample
           of built-in java profiler for a sample of maps and reduces. </p>
@@ -2169,7 +2186,7 @@
           <span 
class="codefrag">-agentlib:hprof=cpu=samples,heap=sites,force=n,thread=y,verbose=n,file=%s</span>
           
 </p>
-<a name="N10CFD"></a><a name="Debugging"></a>
+<a name="N10D1D"></a><a name="Debugging"></a>
 <h4>Debugging</h4>
 <p>Map/Reduce framework provides a facility to run user-provided 
           scripts for debugging. When map/reduce task fails, user can run 
@@ -2180,14 +2197,14 @@
 <p> In the following sections we discuss how to submit debug script
           along with the job. For submitting debug script, first it has to
           distributed. Then the script has to supplied in Configuration. </p>
-<a name="N10D09"></a><a name="How+to+distribute+script+file%3A"></a>
+<a name="N10D29"></a><a name="How+to+distribute+script+file%3A"></a>
 <h5> How to distribute script file: </h5>
 <p>
           The user has to use 
           <a href="mapred_tutorial.html#DistributedCache">DistributedCache</a>
           mechanism to <em>distribute</em> and <em>symlink</em> the
           debug script file.</p>
-<a name="N10D1D"></a><a name="How+to+submit+script%3A"></a>
+<a name="N10D3D"></a><a name="How+to+submit+script%3A"></a>
 <h5> How to submit script: </h5>
 <p> A quick way to submit debug script is to set values for the 
           properties "mapred.map.task.debug.script" and 
@@ -2211,17 +2228,17 @@
 <span class="codefrag">$script $stdout $stderr $syslog $jobconf $program 
</span>  
           
 </p>
-<a name="N10D3F"></a><a name="Default+Behavior%3A"></a>
+<a name="N10D5F"></a><a name="Default+Behavior%3A"></a>
 <h5> Default Behavior: </h5>
 <p> For pipes, a default script is run to process core dumps under
           gdb, prints stack trace and gives info about running threads. </p>
-<a name="N10D4A"></a><a name="JobControl"></a>
+<a name="N10D6A"></a><a name="JobControl"></a>
 <h4>JobControl</h4>
 <p>
 <a href="api/org/apache/hadoop/mapred/jobcontrol/package-summary.html">
           JobControl</a> is a utility which encapsulates a set of Map/Reduce 
jobs
           and their dependencies.</p>
-<a name="N10D57"></a><a name="Data+Compression"></a>
+<a name="N10D77"></a><a name="Data+Compression"></a>
 <h4>Data Compression</h4>
 <p>Hadoop Map/Reduce provides facilities for the application-writer to
           specify compression for both intermediate map-outputs and the
@@ -2235,7 +2252,7 @@
           codecs for reasons of both performance (zlib) and non-availability of
           Java libraries (lzo). More details on their usage and availability 
are
           available <a href="native_libraries.html">here</a>.</p>
-<a name="N10D77"></a><a name="Intermediate+Outputs"></a>
+<a name="N10D97"></a><a name="Intermediate+Outputs"></a>
 <h5>Intermediate Outputs</h5>
 <p>Applications can control compression of intermediate map-outputs
             via the 
@@ -2244,7 +2261,7 @@
             <span class="codefrag">CompressionCodec</span> to be used via the
             <a 
href="api/org/apache/hadoop/mapred/JobConf.html#setMapOutputCompressorClass(java.lang.Class)">
             JobConf.setMapOutputCompressorClass(Class)</a> api.</p>
-<a name="N10D8C"></a><a name="Job+Outputs"></a>
+<a name="N10DAC"></a><a name="Job+Outputs"></a>
 <h5>Job Outputs</h5>
 <p>Applications can control compression of job-outputs via the
             <a 
href="api/org/apache/hadoop/mapred/FileOutputFormat.html#setCompressOutput(org.apache.hadoop.mapred.JobConf,%20boolean)">
@@ -2264,7 +2281,7 @@
 </div>
 
     
-<a name="N10DBB"></a><a name="Example%3A+WordCount+v2.0"></a>
+<a name="N10DDB"></a><a name="Example%3A+WordCount+v2.0"></a>
 <h2 class="h3">Example: WordCount v2.0</h2>
 <div class="section">
 <p>Here is a more complete <span class="codefrag">WordCount</span> which uses 
many of the
@@ -2274,7 +2291,7 @@
       <a href="quickstart.html#SingleNodeSetup">pseudo-distributed</a> or
       <a 
href="quickstart.html#Fully-Distributed+Operation">fully-distributed</a> 
       Hadoop installation.</p>
-<a name="N10DD5"></a><a name="Source+Code-N10DD5"></a>
+<a name="N10DF5"></a><a name="Source+Code-N10DF5"></a>
 <h3 class="h4">Source Code</h3>
 <table class="ForrestTable" cellspacing="1" cellpadding="4">
           
@@ -3484,7 +3501,7 @@
 </tr>
         
 </table>
-<a name="N11537"></a><a name="Sample+Runs"></a>
+<a name="N11557"></a><a name="Sample+Runs"></a>
 <h3 class="h4">Sample Runs</h3>
 <p>Sample text-files as input:</p>
 <p>
@@ -3652,7 +3669,7 @@
 <br>
         
 </p>
-<a name="N1160B"></a><a name="Highlights"></a>
+<a name="N1162B"></a><a name="Highlights"></a>
 <h3 class="h4">Highlights</h3>
 <p>The second version of <span class="codefrag">WordCount</span> improves upon 
the 
         previous one by using some features offered by the Map/Reduce 
framework:
svn commit: r681496 [2/3] - in /hadoop/core/branches/branch-0.18: build.xml docs/changes.html docs/commands_manual.html docs/commands_manual.pdf docs/mapred_tutorial.html docs/mapred_tutorial.pdf

Reply via email to