Author: ddas
Date: Wed Jun 11 04:39:55 2008
New Revision: 666624

URL: http://svn.apache.org/viewvc?rev=666624&view=rev
Log:
Merge -r 666619:666620 from trunk onto 0.18 branch. Fixes HADOOP-3096.

Modified:
    hadoop/core/branches/branch-0.18/CHANGES.txt
    hadoop/core/branches/branch-0.18/docs/mapred_tutorial.html
    hadoop/core/branches/branch-0.18/docs/mapred_tutorial.pdf
    
hadoop/core/branches/branch-0.18/src/docs/src/documentation/content/xdocs/mapred_tutorial.xml
    
hadoop/core/branches/branch-0.18/src/docs/src/documentation/content/xdocs/site.xml

Modified: hadoop/core/branches/branch-0.18/CHANGES.txt
URL: 
http://svn.apache.org/viewvc/hadoop/core/branches/branch-0.18/CHANGES.txt?rev=666624&r1=666623&r2=666624&view=diff
==============================================================================
--- hadoop/core/branches/branch-0.18/CHANGES.txt (original)
+++ hadoop/core/branches/branch-0.18/CHANGES.txt Wed Jun 11 04:39:55 2008
@@ -282,6 +282,9 @@
     HADOOP-3379. Documents stream.non.zero.exit.status.is.failure for 
Streaming.
     (Amareshwari Sriramadasu via ddas)
 
+    HADOOP-3096. Improves documentation about the Task Execution Environment 
in 
+    the Map-Reduce tutorial. (Amareshwari Sriramadasu via ddas)
+
   OPTIMIZATIONS
 
     HADOOP-3274. The default constructor of BytesWritable creates empty 

Modified: hadoop/core/branches/branch-0.18/docs/mapred_tutorial.html
URL: 
http://svn.apache.org/viewvc/hadoop/core/branches/branch-0.18/docs/mapred_tutorial.html?rev=666624&r1=666623&r2=666624&view=diff
==============================================================================
--- hadoop/core/branches/branch-0.18/docs/mapred_tutorial.html (original)
+++ hadoop/core/branches/branch-0.18/docs/mapred_tutorial.html Wed Jun 11 
04:39:55 2008
@@ -301,7 +301,7 @@
 <a href="#Example%3A+WordCount+v2.0">Example: WordCount v2.0</a>
 <ul class="minitoc">
 <li>
-<a href="#Source+Code-N10C87">Source Code</a>
+<a href="#Source+Code-N10D77">Source Code</a>
 </li>
 <li>
 <a href="#Sample+Runs">Sample Runs</a>
@@ -1542,42 +1542,170 @@
 </p>
 <p>Users/admins can also specify the maximum virtual memory 
         of the launched child-task using <span 
class="codefrag">mapred.child.ulimit</span>.</p>
-<p>When the job starts, the localized job directory
-        <span class="codefrag"> 
${mapred.local.dir}/taskTracker/jobcache/$jobid/</span>
-        has the following directories: </p>
+<p>The task tracker has local directory,
+        <span class="codefrag"> ${mapred.local.dir}/taskTracker/</span> to 
create localized
+        cache and localized job. It can define multiple local directories 
+        (spanning multiple disks) and then each filename is assigned to a
+        semi-random local directory. When the job starts, task tracker 
+        creates a localized job directory relative to the local directory
+        specified in the configuration. Thus the task tracker directory 
+        structure looks the following: </p>
 <ul>
         
-<li> A job-specific shared directory, created at location
-        <span 
class="codefrag">${mapred.local.dir}/taskTracker/jobcache/$jobid/work/ </span>.
-        This directory is exposed to the users through 
-        <span class="codefrag">job.local.dir </span>. The tasks can use this 
space as scratch
-        space and share files among them. The directory can accessed through 
-        api <a 
href="api/org/apache/hadoop/mapred/JobConf.html#getJobLocalDir()">
-        JobConf.getJobLocalDir()</a>. It is available as System property also.
-        So,users can call <span 
class="codefrag">System.getProperty("job.local.dir")</span>;
-        </li>
-        
-<li>A jars directory, which has the job jar file and expanded jar </li>
+<li>
+<span class="codefrag">${mapred.local.dir}/taskTracker/archive/</span> :
+        The distributed cache. This directory holds the localized distributed
+        cache. Thus localized distributed cache is shared among all
+        the tasks and jobs </li>
         
-<li>A job.xml file, the generic job configuration </li>
+<li>
+<span class="codefrag">${mapred.local.dir}/taskTracker/jobcache/$jobid/</span> 
:
+        The localized job directory 
+        <ul>
         
-<li>Each task has directory <span class="codefrag">task-id</span> which again 
has the 
-        following structure
+<li>
+<span 
class="codefrag">${mapred.local.dir}/taskTracker/jobcache/$jobid/work/</span> 
+        : The job-specific shared directory. The tasks can use this space as 
+        scratch space and share files among them. This directory is exposed
+        to the users through the configuration property  
+        <span class="codefrag">job.local.dir</span>. The directory can 
accessed through 
+        api <a 
href="api/org/apache/hadoop/mapred/JobConf.html#getJobLocalDir()">
+        JobConf.getJobLocalDir()</a>. It is available as System property also.
+        So, users (streaming etc.) can call 
+        <span class="codefrag">System.getProperty("job.local.dir")</span> to 
access the 
+        directory.</li>
+        
+<li>
+<span 
class="codefrag">${mapred.local.dir}/taskTracker/jobcache/$jobid/jars/</span>
+        : The jars directory, which has the job jar file and expanded jar.
+        The <span class="codefrag">job.jar</span> is the application's jar 
file that is
+        automatically distributed to each machine. It is expanded in jars
+        directory before the tasks for the job start. The job.jar location
+        is accessible to the application through the api
+        <a href="api/org/apache/hadoop/mapred/JobConf.html#getJar()"> 
+        JobConf.getJar() </a>. To access the unjarred directory,
+        JobConf.getJar().getParent() can be called.</li>
+        
+<li>
+<span 
class="codefrag">${mapred.local.dir}/taskTracker/jobcache/$jobid/job.xml</span>
+        : The job.xml file, the generic job configuration, localized for 
+        the job. </li>
+        
+<li>
+<span 
class="codefrag">${mapred.local.dir}/taskTracker/jobcache/$jobid/$taskid</span>
+        : The task direcrory for each task attempt. Each task directory
+        again has the following structure :
         <ul>
         
-<li>A job.xml file, task localized job configuration </li>
+<li>
+<span 
class="codefrag">${mapred.local.dir}/taskTracker/jobcache/$jobid/$taskid/job.xml</span>
+        : A job.xml file, task localized job configuration, Task localization
+        means that properties have been set that are specific to
+        this particular task within the job. The properties localized for 
+        each task are described below.</li>
+        
+<li>
+<span 
class="codefrag">${mapred.local.dir}/taskTracker/jobcache/$jobid/$taskid/output</span>
+        : A directory for intermediate output files. This contains the
+        temporary map reduce data generated by the framework
+        such as map output files etc. </li>
+        
+<li>
+<span 
class="codefrag">${mapred.local.dir}/taskTracker/jobcache/$jobid/$taskid/work</span>
+        : The curernt working directory of the task. </li>
+        
+<li>
+<span 
class="codefrag">${mapred.local.dir}/taskTracker/jobcache/$jobid/$taskid/work/tmp</span>
+        : The temporary directory for the task. 
+        (User can specify the property <span 
class="codefrag">mapred.child.tmp</span> to set
+        the value of temporary directory for map and reduce tasks. This 
+        defaults to <span class="codefrag">./tmp</span>. If the value is not 
an absolute path,
+        it is prepended with task's working directory. Otherwise, it is
+        directly assigned. The directory will be created if it doesn't exist.
+        Then, the child java tasks are executed with option
+        <span class="codefrag">-Djava.io.tmpdir='the absolute path of the tmp 
dir'</span>.
+        Anp pipes and streaming are set with environment variable,
+        <span class="codefrag">TMPDIR='the absolute path of the tmp 
dir'</span>). This 
+        directory is created, if <span 
class="codefrag">mapred.child.tmp</span> has the value
+        <span class="codefrag">./tmp</span> 
+</li>
         
-<li>A directory for intermediate output files</li>
+</ul>
         
-<li>The working directory of the task. 
-        And work directory has a temporary directory 
-        to create temporary files</li>
+</li>
         
 </ul>
         
 </li>
         
 </ul>
+<p>The following properties are localized in the job configuration 
+         for each task's execution: </p>
+<table class="ForrestTable" cellspacing="1" cellpadding="4">
+          
+<tr>
+<th colspan="1" rowspan="1">Name</th><th colspan="1" rowspan="1">Type</th><th 
colspan="1" rowspan="1">Description</th>
+</tr>
+          
+<tr>
+<td colspan="1" rowspan="1">mapred.job.id</td><td colspan="1" 
rowspan="1">String</td><td colspan="1" rowspan="1">The job id</td>
+</tr>
+          
+<tr>
+<td colspan="1" rowspan="1">mapred.jar</td><td colspan="1" 
rowspan="1">String</td>
+              <td colspan="1" rowspan="1">job.jar location in job 
directory</td>
+</tr>
+          
+<tr>
+<td colspan="1" rowspan="1">job.local.dir</td><td colspan="1" rowspan="1"> 
String</td>
+              <td colspan="1" rowspan="1"> The job specific shared scratch 
space</td>
+</tr>
+          
+<tr>
+<td colspan="1" rowspan="1">mapred.tip.id</td><td colspan="1" rowspan="1"> 
String</td>
+              <td colspan="1" rowspan="1"> The task id</td>
+</tr>
+          
+<tr>
+<td colspan="1" rowspan="1">mapred.task.id</td><td colspan="1" rowspan="1"> 
String</td>
+              <td colspan="1" rowspan="1"> The task attempt id</td>
+</tr>
+          
+<tr>
+<td colspan="1" rowspan="1">mapred.task.is.map</td><td colspan="1" 
rowspan="1"> boolean </td>
+              <td colspan="1" rowspan="1">Is this a map task</td>
+</tr>
+          
+<tr>
+<td colspan="1" rowspan="1">mapred.task.partition</td><td colspan="1" 
rowspan="1"> int </td>
+              <td colspan="1" rowspan="1">The id of the task within the 
job</td>
+</tr>
+          
+<tr>
+<td colspan="1" rowspan="1">map.input.file</td><td colspan="1" rowspan="1"> 
String</td>
+              <td colspan="1" rowspan="1"> The filename that the map is 
reading from</td>
+</tr>
+          
+<tr>
+<td colspan="1" rowspan="1">map.input.start</td><td colspan="1" rowspan="1"> 
long</td>
+              <td colspan="1" rowspan="1"> The offset of the start of the map 
input split</td>
+</tr>
+          
+<tr>
+<td colspan="1" rowspan="1">map.input.length </td><td colspan="1" 
rowspan="1">long </td>
+              <td colspan="1" rowspan="1">The number of bytes in the map input 
split</td>
+</tr>
+          
+<tr>
+<td colspan="1" rowspan="1">mapred.work.output.dir</td><td colspan="1" 
rowspan="1"> String </td>
+              <td colspan="1" rowspan="1">The task's temporary output 
directory</td>
+</tr>
+        
+</table>
+<p>The standard output (stdout) and error (stderr) streams of the task 
+        are read by the TaskTracker and logged to 
+        <span class="codefrag">${HADOOP_LOG_DIR}/userlogs</span>
+</p>
 <p>The <a href="#DistributedCache">DistributedCache</a> can also be used
         as a rudimentary software distribution mechanism for use in the map 
         and/or reduce tasks. It can be used to distribute both jars and 
@@ -1597,7 +1725,7 @@
         loaded via <a 
href="http://java.sun.com/j2se/1.5.0/docs/api/java/lang/System.html#loadLibrary(java.lang.String)">
         System.loadLibrary</a> or <a 
href="http://java.sun.com/j2se/1.5.0/docs/api/java/lang/System.html#load(java.lang.String)">
         System.load</a>.</p>
-<a name="N108FB"></a><a name="Job+Submission+and+Monitoring"></a>
+<a name="N109EB"></a><a name="Job+Submission+and+Monitoring"></a>
 <h3 class="h4">Job Submission and Monitoring</h3>
 <p>
 <a href="api/org/apache/hadoop/mapred/JobClient.html">
@@ -1658,7 +1786,7 @@
 <p>Normally the user creates the application, describes various facets 
         of the job via <span class="codefrag">JobConf</span>, and then uses 
the 
         <span class="codefrag">JobClient</span> to submit the job and monitor 
its progress.</p>
-<a name="N1095B"></a><a name="Job+Control"></a>
+<a name="N10A4B"></a><a name="Job+Control"></a>
 <h4>Job Control</h4>
 <p>Users may need to chain map-reduce jobs to accomplish complex
           tasks which cannot be done via a single map-reduce job. This is 
fairly
@@ -1694,7 +1822,7 @@
             </li>
           
 </ul>
-<a name="N10985"></a><a name="Job+Input"></a>
+<a name="N10A75"></a><a name="Job+Input"></a>
 <h3 class="h4">Job Input</h3>
 <p>
 <a href="api/org/apache/hadoop/mapred/InputFormat.html">
@@ -1742,7 +1870,7 @@
         appropriate <span class="codefrag">CompressionCodec</span>. However, 
it must be noted that
         compressed files with the above extensions cannot be <em>split</em> 
and 
         each compressed file is processed in its entirety by a single 
mapper.</p>
-<a name="N109EF"></a><a name="InputSplit"></a>
+<a name="N10ADF"></a><a name="InputSplit"></a>
 <h4>InputSplit</h4>
 <p>
 <a href="api/org/apache/hadoop/mapred/InputSplit.html">
@@ -1756,7 +1884,7 @@
           FileSplit</a> is the default <span 
class="codefrag">InputSplit</span>. It sets 
           <span class="codefrag">map.input.file</span> to the path of the 
input file for the
           logical split.</p>
-<a name="N10A14"></a><a name="RecordReader"></a>
+<a name="N10B04"></a><a name="RecordReader"></a>
 <h4>RecordReader</h4>
 <p>
 <a href="api/org/apache/hadoop/mapred/RecordReader.html">
@@ -1768,7 +1896,7 @@
           for processing. <span class="codefrag">RecordReader</span> thus 
assumes the 
           responsibility of processing record boundaries and presents the 
tasks 
           with keys and values.</p>
-<a name="N10A37"></a><a name="Job+Output"></a>
+<a name="N10B27"></a><a name="Job+Output"></a>
 <h3 class="h4">Job Output</h3>
 <p>
 <a href="api/org/apache/hadoop/mapred/OutputFormat.html">
@@ -1793,7 +1921,7 @@
 <p>
 <span class="codefrag">TextOutputFormat</span> is the default 
         <span class="codefrag">OutputFormat</span>.</p>
-<a name="N10A60"></a><a name="Task+Side-Effect+Files"></a>
+<a name="N10B50"></a><a name="Task+Side-Effect+Files"></a>
 <h4>Task Side-Effect Files</h4>
 <p>In some applications, component tasks need to create and/or write to
           side-files, which differ from the actual job-output files.</p>
@@ -1832,7 +1960,7 @@
 <p>The entire discussion holds true for maps of jobs with 
            reducer=NONE (i.e. 0 reduces) since output of the map, in that 
case, 
            goes directly to HDFS.</p>
-<a name="N10AA8"></a><a name="RecordWriter"></a>
+<a name="N10B98"></a><a name="RecordWriter"></a>
 <h4>RecordWriter</h4>
 <p>
 <a href="api/org/apache/hadoop/mapred/RecordWriter.html">
@@ -1840,9 +1968,9 @@
           pairs to an output file.</p>
 <p>RecordWriter implementations write the job outputs to the 
           <span class="codefrag">FileSystem</span>.</p>
-<a name="N10ABF"></a><a name="Other+Useful+Features"></a>
+<a name="N10BAF"></a><a name="Other+Useful+Features"></a>
 <h3 class="h4">Other Useful Features</h3>
-<a name="N10AC5"></a><a name="Counters"></a>
+<a name="N10BB5"></a><a name="Counters"></a>
 <h4>Counters</h4>
 <p>
 <span class="codefrag">Counters</span> represent global counters, defined 
either by 
@@ -1856,7 +1984,7 @@
           Reporter.incrCounter(Enum, long)</a> in the <span 
class="codefrag">map</span> and/or 
           <span class="codefrag">reduce</span> methods. These counters are 
then globally 
           aggregated by the framework.</p>
-<a name="N10AF0"></a><a name="DistributedCache"></a>
+<a name="N10BE0"></a><a name="DistributedCache"></a>
 <h4>DistributedCache</h4>
 <p>
 <a href="api/org/apache/hadoop/filecache/DistributedCache.html">
@@ -1890,7 +2018,7 @@
           <a 
href="api/org/apache/hadoop/filecache/DistributedCache.html#createSymlink(org.apache.hadoop.conf.Configuration)">
           DistributedCache.createSymlink(Configuration)</a> api. Files 
           have <em>execution permissions</em> set.</p>
-<a name="N10B2E"></a><a name="Tool"></a>
+<a name="N10C1E"></a><a name="Tool"></a>
 <h4>Tool</h4>
 <p>The <a href="api/org/apache/hadoop/util/Tool.html">Tool</a> 
           interface supports the handling of generic Hadoop command-line 
options.
@@ -1930,7 +2058,7 @@
             </span>
           
 </p>
-<a name="N10B60"></a><a name="IsolationRunner"></a>
+<a name="N10C50"></a><a name="IsolationRunner"></a>
 <h4>IsolationRunner</h4>
 <p>
 <a href="api/org/apache/hadoop/mapred/IsolationRunner.html">
@@ -1954,7 +2082,7 @@
 <p>
 <span class="codefrag">IsolationRunner</span> will run the failed task in a 
single 
           jvm, which can be in the debugger, over precisely the same input.</p>
-<a name="N10B93"></a><a name="Debugging"></a>
+<a name="N10C83"></a><a name="Debugging"></a>
 <h4>Debugging</h4>
 <p>Map/Reduce framework provides a facility to run user-provided 
           scripts for debugging. When map/reduce task fails, user can run 
@@ -1965,7 +2093,7 @@
 <p> In the following sections we discuss how to submit debug script
           along with the job. For submitting debug script, first it has to
           distributed. Then the script has to supplied in Configuration. </p>
-<a name="N10B9F"></a><a name="How+to+distribute+script+file%3A"></a>
+<a name="N10C8F"></a><a name="How+to+distribute+script+file%3A"></a>
 <h5> How to distribute script file: </h5>
 <p>
           To distribute  the debug script file, first copy the file to the dfs.
@@ -1988,7 +2116,7 @@
           <a 
href="api/org/apache/hadoop/filecache/DistributedCache.html#createSymlink(org.apache.hadoop.conf.Configuration)">
           DistributedCache.createSymLink(Configuration) </a> api.
           </p>
-<a name="N10BB8"></a><a name="How+to+submit+script%3A"></a>
+<a name="N10CA8"></a><a name="How+to+submit+script%3A"></a>
 <h5> How to submit script: </h5>
 <p> A quick way to submit debug script is to set values for the 
           properties "mapred.map.task.debug.script" and 
@@ -2012,17 +2140,17 @@
 <span class="codefrag">$script $stdout $stderr $syslog $jobconf $program 
</span>  
           
 </p>
-<a name="N10BDA"></a><a name="Default+Behavior%3A"></a>
+<a name="N10CCA"></a><a name="Default+Behavior%3A"></a>
 <h5> Default Behavior: </h5>
 <p> For pipes, a default script is run to process core dumps under
           gdb, prints stack trace and gives info about running threads. </p>
-<a name="N10BE5"></a><a name="JobControl"></a>
+<a name="N10CD5"></a><a name="JobControl"></a>
 <h4>JobControl</h4>
 <p>
 <a href="api/org/apache/hadoop/mapred/jobcontrol/package-summary.html">
           JobControl</a> is a utility which encapsulates a set of Map-Reduce 
jobs
           and their dependencies.</p>
-<a name="N10BF2"></a><a name="Data+Compression"></a>
+<a name="N10CE2"></a><a name="Data+Compression"></a>
 <h4>Data Compression</h4>
 <p>Hadoop Map-Reduce provides facilities for the application-writer to
           specify compression for both intermediate map-outputs and the
@@ -2036,7 +2164,7 @@
           codecs for reasons of both performance (zlib) and non-availability of
           Java libraries (lzo). More details on their usage and availability 
are
           available <a href="native_libraries.html">here</a>.</p>
-<a name="N10C12"></a><a name="Intermediate+Outputs"></a>
+<a name="N10D02"></a><a name="Intermediate+Outputs"></a>
 <h5>Intermediate Outputs</h5>
 <p>Applications can control compression of intermediate map-outputs
             via the 
@@ -2057,7 +2185,7 @@
             <a 
href="api/org/apache/hadoop/mapred/JobConf.html#setMapOutputCompressionType(org.apache.hadoop.io.SequenceFile.CompressionType)">
             
JobConf.setMapOutputCompressionType(SequenceFile.CompressionType)</a> 
             api.</p>
-<a name="N10C3E"></a><a name="Job+Outputs"></a>
+<a name="N10D2E"></a><a name="Job+Outputs"></a>
 <h5>Job Outputs</h5>
 <p>Applications can control compression of job-outputs via the
             <a 
href="api/org/apache/hadoop/mapred/OutputFormatBase.html#setCompressOutput(org.apache.hadoop.mapred.JobConf,%20boolean)">
@@ -2077,7 +2205,7 @@
 </div>
 
     
-<a name="N10C6D"></a><a name="Example%3A+WordCount+v2.0"></a>
+<a name="N10D5D"></a><a name="Example%3A+WordCount+v2.0"></a>
 <h2 class="h3">Example: WordCount v2.0</h2>
 <div class="section">
 <p>Here is a more complete <span class="codefrag">WordCount</span> which uses 
many of the
@@ -2087,7 +2215,7 @@
       <a href="quickstart.html#SingleNodeSetup">pseudo-distributed</a> or
       <a 
href="quickstart.html#Fully-Distributed+Operation">fully-distributed</a> 
       Hadoop installation.</p>
-<a name="N10C87"></a><a name="Source+Code-N10C87"></a>
+<a name="N10D77"></a><a name="Source+Code-N10D77"></a>
 <h3 class="h4">Source Code</h3>
 <table class="ForrestTable" cellspacing="1" cellpadding="4">
           
@@ -3297,7 +3425,7 @@
 </tr>
         
 </table>
-<a name="N113E9"></a><a name="Sample+Runs"></a>
+<a name="N114D9"></a><a name="Sample+Runs"></a>
 <h3 class="h4">Sample Runs</h3>
 <p>Sample text-files as input:</p>
 <p>
@@ -3465,7 +3593,7 @@
 <br>
         
 </p>
-<a name="N114BD"></a><a name="Highlights"></a>
+<a name="N115AD"></a><a name="Highlights"></a>
 <h3 class="h4">Highlights</h3>
 <p>The second version of <span class="codefrag">WordCount</span> improves upon 
the 
         previous one by using some features offered by the Map-Reduce 
framework:


Reply via email to