Author: ddas
Date: Mon Oct  6 07:38:38 2008
New Revision: 702166

URL: http://svn.apache.org/viewvc?rev=702166&view=rev
Log:
Merge -r 702163:702164 from trunk onto 0.19 branch. Fixes HADOOP-4301.

Modified:
    hadoop/core/branches/branch-0.19/CHANGES.txt
    hadoop/core/branches/branch-0.19/docs/changes.html
    hadoop/core/branches/branch-0.19/docs/hadoop-default.html
    hadoop/core/branches/branch-0.19/docs/mapred_tutorial.html
    hadoop/core/branches/branch-0.19/docs/mapred_tutorial.pdf
    
hadoop/core/branches/branch-0.19/src/docs/src/documentation/content/xdocs/mapred_tutorial.xml
    
hadoop/core/branches/branch-0.19/src/docs/src/documentation/content/xdocs/site.xml

Modified: hadoop/core/branches/branch-0.19/CHANGES.txt
URL: 
http://svn.apache.org/viewvc/hadoop/core/branches/branch-0.19/CHANGES.txt?rev=702166&r1=702165&r2=702166&view=diff
==============================================================================
--- hadoop/core/branches/branch-0.19/CHANGES.txt (original)
+++ hadoop/core/branches/branch-0.19/CHANGES.txt Mon Oct  6 07:38:38 2008
@@ -431,6 +431,9 @@
     incrementing the task attempt numbers by 1000 when the job restarts.
     (Amar Kamat via omalley)
 
+    HADOOP-4301. Adds forrest doc for the skip bad records feature.
+    (Sharad Agarwal via ddas)
+
   OPTIMIZATIONS
 
     HADOOP-3556. Removed lock contention in MD5Hash by changing the 

Modified: hadoop/core/branches/branch-0.19/docs/changes.html
URL: 
http://svn.apache.org/viewvc/hadoop/core/branches/branch-0.19/docs/changes.html?rev=702166&r1=702165&r2=702166&view=diff
==============================================================================
--- hadoop/core/branches/branch-0.19/docs/changes.html (original)
+++ hadoop/core/branches/branch-0.19/docs/changes.html Mon Oct  6 07:38:38 2008
@@ -36,7 +36,7 @@
     function collapse() {
       for (var i = 0; i < document.getElementsByTagName("ul").length; i++) {
         var list = document.getElementsByTagName("ul")[i];
-        if (list.id != 'release_0.19.0_-_unreleased_' && list.id != 
'release_0.18.1_-_2008-09-17_') {
+        if (list.id != 'release_0.19.0_-_unreleased_' && list.id != 
'release_0.18.2_-_unreleased_') {
           list.style.display = "none";
         }
       }
@@ -56,7 +56,7 @@
 </a></h2>
 <ul id="release_0.19.0_-_unreleased_">
   <li><a 
href="javascript:toggleList('release_0.19.0_-_unreleased_._incompatible_changes_')">
  INCOMPATIBLE CHANGES
-</a>&nbsp;&nbsp;&nbsp;(18)
+</a>&nbsp;&nbsp;&nbsp;(20)
     <ol id="release_0.19.0_-_unreleased_._incompatible_changes_">
       <li><a 
href="http://issues.apache.org/jira/browse/HADOOP-3595";>HADOOP-3595</a>. Remove 
deprecated methods for mapred.combine.once
 functionality, which was necessary to providing backwards
@@ -110,10 +110,15 @@
 DFS Used%: DFS used space/Present Capacity<br />(Suresh Srinivas via 
hairong)</li>
       <li><a 
href="http://issues.apache.org/jira/browse/HADOOP-3938";>HADOOP-3938</a>. Disk 
space quotas for HDFS. This is similar to namespace
 quotas in 0.18.<br />(rangadi)</li>
+      <li><a 
href="http://issues.apache.org/jira/browse/HADOOP-4293";>HADOOP-4293</a>. Make 
Configuration Writable and remove unreleased
+WritableJobConf. Configuration.write is renamed to writeXml.<br 
/>(omalley)</li>
+      <li><a 
href="http://issues.apache.org/jira/browse/HADOOP-4281";>HADOOP-4281</a>. Change 
dfsadmin to report available disk space in a format
+consistent with the web interface as defined in <a 
href="http://issues.apache.org/jira/browse/HADOOP-2816";>HADOOP-2816</a>.<br 
/>(Suresh
+Srinivas via cdouglas)</li>
     </ol>
   </li>
   <li><a 
href="javascript:toggleList('release_0.19.0_-_unreleased_._new_features_')">  
NEW FEATURES
-</a>&nbsp;&nbsp;&nbsp;(39)
+</a>&nbsp;&nbsp;&nbsp;(40)
     <ol id="release_0.19.0_-_unreleased_._new_features_">
       <li><a 
href="http://issues.apache.org/jira/browse/HADOOP-3341";>HADOOP-3341</a>. Allow 
streaming jobs to specify the field separator for map
 and reduce input and output. The new configuration values are:
@@ -195,13 +200,16 @@
       <li><a 
href="http://issues.apache.org/jira/browse/HADOOP-3019";>HADOOP-3019</a>. A new 
library to support total order partitions.<br />(cdouglas via omalley)</li>
       <li><a 
href="http://issues.apache.org/jira/browse/HADOOP-3924";>HADOOP-3924</a>. Added 
a 'KILLED' job status.<br />(Subramaniam Krishnan via
 acmurthy)</li>
+      <li><a 
href="http://issues.apache.org/jira/browse/HADOOP-2421";>HADOOP-2421</a>.  Add 
jdiff output to documentation, listing all API
+changes from the prior release.<br />(cutting)</li>
     </ol>
   </li>
   <li><a 
href="javascript:toggleList('release_0.19.0_-_unreleased_._improvements_')">  
IMPROVEMENTS
-</a>&nbsp;&nbsp;&nbsp;(68)
+</a>&nbsp;&nbsp;&nbsp;(71)
     <ol id="release_0.19.0_-_unreleased_._improvements_">
       <li><a 
href="http://issues.apache.org/jira/browse/HADOOP-4205";>HADOOP-4205</a>. hive: 
metastore and ql to use the refactored SerDe library.<br />(zshao)</li>
-      <li><a 
href="http://issues.apache.org/jira/browse/HADOOP-4106";>HADOOP-4106</a>. 
libhdfs: add time, permission and user attribute support (part 2).<br />(Pete 
Wyckoff through zshao)</li>
+      <li><a 
href="http://issues.apache.org/jira/browse/HADOOP-4106";>HADOOP-4106</a>. 
libhdfs: add time, permission and user attribute support
+(part 2).<br />(Pete Wyckoff through zshao)</li>
       <li><a 
href="http://issues.apache.org/jira/browse/HADOOP-4104";>HADOOP-4104</a>. 
libhdfs: add time, permission and user attribute support.<br />(Pete Wyckoff 
through zshao)</li>
       <li><a 
href="http://issues.apache.org/jira/browse/HADOOP-3908";>HADOOP-3908</a>. 
libhdfs: better error message if llibhdfs.so doesn't exist.<br />(Pete Wyckoff 
through zshao)</li>
       <li><a 
href="http://issues.apache.org/jira/browse/HADOOP-3732";>HADOOP-3732</a>. Delay 
intialization of datanode block verification till
@@ -230,8 +238,6 @@
 it pluggable.<br />(Tom White and Brice Arnould via omalley)</li>
       <li><a 
href="http://issues.apache.org/jira/browse/HADOOP-3756";>HADOOP-3756</a>. Minor. 
Remove unused dfs.client.buffer.dir from
 hadoop-default.xml.<br />(rangadi)</li>
-      <li><a 
href="http://issues.apache.org/jira/browse/HADOOP-3327";>HADOOP-3327</a>. Treats 
connection and read timeouts differently in the
-shuffle and the backoff logic is dependent on the type of timeout.<br />(Jothi 
Padmanabhan via ddas)</li>
       <li><a 
href="http://issues.apache.org/jira/browse/HADOOP-3747";>HADOOP-3747</a>. Adds 
counter suport for MultipleOutputs.<br />(Alejandro Abdelnur via ddas)</li>
       <li><a 
href="http://issues.apache.org/jira/browse/HADOOP-3169";>HADOOP-3169</a>. 
LeaseChecker daemon should not be started in DFSClient
 constructor. (TszWo (Nicholas), SZE via hairong)
@@ -321,6 +327,13 @@
 connection is closed and also remove an undesirable exception when
 a client is stoped while there is no pending RPC request.<br />(hairong)</li>
       <li><a 
href="http://issues.apache.org/jira/browse/HADOOP-4227";>HADOOP-4227</a>. Remove 
the deprecated class org.apache.hadoop.fs.ShellCommand.<br />(szetszwo)</li>
+      <li><a 
href="http://issues.apache.org/jira/browse/HADOOP-4006";>HADOOP-4006</a>. Clean 
up FSConstants and move some of the constants to
+better places.<br />(Sanjay Radia via rangadi)</li>
+      <li><a 
href="http://issues.apache.org/jira/browse/HADOOP-4279";>HADOOP-4279</a>. Trace 
the seeds of random sequences in append unit tests to
+make itermitant failures reproducible.<br />(szetszwo via cdouglas)</li>
+      <li><a 
href="http://issues.apache.org/jira/browse/HADOOP-4209";>HADOOP-4209</a>. Remove 
the change to the format of task attempt id by
+incrementing the task attempt numbers by 1000 when the job restarts.<br 
/>(Amar Kamat via omalley)</li>
+      <li><a 
href="http://issues.apache.org/jira/browse/HADOOP-4301";>HADOOP-4301</a>. Adds 
forrest doc for the skip bad records feature.<br />(Sharad Agarwal via 
ddas)</li>
     </ol>
   </li>
   <li><a 
href="javascript:toggleList('release_0.19.0_-_unreleased_._optimizations_')">  
OPTIMIZATIONS
@@ -347,7 +360,7 @@
     </ol>
   </li>
   <li><a 
href="javascript:toggleList('release_0.19.0_-_unreleased_._bug_fixes_')">  BUG 
FIXES
-</a>&nbsp;&nbsp;&nbsp;(88)
+</a>&nbsp;&nbsp;&nbsp;(108)
     <ol id="release_0.19.0_-_unreleased_._bug_fixes_">
       <li><a 
href="http://issues.apache.org/jira/browse/HADOOP-3563";>HADOOP-3563</a>.  
Refactor the distributed upgrade code so that it is
 easier to identify datanode and namenode related code.<br />(dhruba)</li>
@@ -511,11 +524,71 @@
 query.<br />(Raghotham Murthy via dhruba)</li>
       <li><a 
href="http://issues.apache.org/jira/browse/HADOOP-4090";>HADOOP-4090</a>. The 
hive scripts pick up hadoop from HADOOP_HOME
 and then the path.<br />(Raghotham Murthy via dhruba)</li>
+      <li><a 
href="http://issues.apache.org/jira/browse/HADOOP-4242";>HADOOP-4242</a>. Remove 
extra ";" in FSDirectory that blocks compilation
+in some IDE's.<br />(szetszwo via omalley)</li>
+      <li><a 
href="http://issues.apache.org/jira/browse/HADOOP-4249";>HADOOP-4249</a>. Fix 
eclipse path to include the hsqldb.jar.<br />(szetszwo via
+omalley)</li>
+      <li><a 
href="http://issues.apache.org/jira/browse/HADOOP-4247";>HADOOP-4247</a>. Move 
InputSampler into org.apache.hadoop.mapred.lib, so that
+examples.jar doesn't depend on tools.jar.<br />(omalley)</li>
+      <li><a 
href="http://issues.apache.org/jira/browse/HADOOP-4269";>HADOOP-4269</a>. Fix 
the deprecation of LineReader by extending the new class
+into the old name and deprecating it. Also update the tests to test the
+new class.<br />(cdouglas via omalley)</li>
+      <li><a 
href="http://issues.apache.org/jira/browse/HADOOP-4280";>HADOOP-4280</a>. Fix 
conversions between seconds in C and milliseconds in
+Java for access times for files.<br />(Pete Wyckoff via rangadi)</li>
+      <li><a 
href="http://issues.apache.org/jira/browse/HADOOP-4254";>HADOOP-4254</a>. 
-setSpaceQuota command does not convert "TB" extenstion to
+terabytes properly. Implementation now uses StringUtils for parsing this.<br 
/>(Raghu Angadi)</li>
+      <li><a 
href="http://issues.apache.org/jira/browse/HADOOP-4259";>HADOOP-4259</a>. 
Findbugs should run over tools.jar also.<br />(cdouglas via
+omalley)</li>
+      <li><a 
href="http://issues.apache.org/jira/browse/HADOOP-4275";>HADOOP-4275</a>. Move 
public method isJobValidName from JobID to a private
+method in JobTracker.<br />(omalley)</li>
+      <li><a 
href="http://issues.apache.org/jira/browse/HADOOP-4173";>HADOOP-4173</a>. fix 
failures in TestProcfsBasedProcessTree and
+TestTaskTrackerMemoryManager tests. ProcfsBasedProcessTree and
+memory management in TaskTracker are disabled on Windows.<br />(Vinod K V via 
rangadi)</li>
+      <li><a 
href="http://issues.apache.org/jira/browse/HADOOP-4189";>HADOOP-4189</a>. Fixes 
the history blocksize &amp; intertracker protocol version
+issues introduced as part of <a 
href="http://issues.apache.org/jira/browse/HADOOP-3245";>HADOOP-3245</a>.<br 
/>(Amar Kamat via ddas)</li>
+      <li><a 
href="http://issues.apache.org/jira/browse/HADOOP-4190";>HADOOP-4190</a>. Fixes 
the backward compatibility issue with Job History.
+introduced by <a 
href="http://issues.apache.org/jira/browse/HADOOP-3245";>HADOOP-3245</a> and <a 
href="http://issues.apache.org/jira/browse/HADOOP-2403";>HADOOP-2403</a>.<br 
/>(Amar Kamat via ddas)</li>
+      <li><a 
href="http://issues.apache.org/jira/browse/HADOOP-4237";>HADOOP-4237</a>. Fixes 
the TestStreamingBadRecords.testNarrowDown testcase.<br />(Sharad Agarwal via 
ddas)</li>
+      <li><a 
href="http://issues.apache.org/jira/browse/HADOOP-4274";>HADOOP-4274</a>. 
Capacity scheduler accidently modifies the underlying
+data structures when browing the job lists.<br />(Hemanth Yamijala via 
omalley)</li>
+      <li><a 
href="http://issues.apache.org/jira/browse/HADOOP-4309";>HADOOP-4309</a>. Fix 
eclipse-plugin compilation.<br />(cdouglas)</li>
+      <li><a 
href="http://issues.apache.org/jira/browse/HADOOP-4232";>HADOOP-4232</a>. Fix 
race condition in JVM reuse when multiple slots become
+free.<br />(ddas via acmurthy)</li>
+      <li><a 
href="http://issues.apache.org/jira/browse/HADOOP-4302";>HADOOP-4302</a>. Fix a 
race condition in TestReduceFetch that can yield false
+negatvies.<br />(cdouglas)</li>
+      <li><a 
href="http://issues.apache.org/jira/browse/HADOOP-3942";>HADOOP-3942</a>. Update 
distcp documentation to include features introduced in
+<a href="http://issues.apache.org/jira/browse/HADOOP-3873";>HADOOP-3873</a>, <a 
href="http://issues.apache.org/jira/browse/HADOOP-3939";>HADOOP-3939</a>. (Tsz 
Wo (Nicholas), SZE via cdouglas)
+</li>
+      <li><a 
href="http://issues.apache.org/jira/browse/HADOOP-4257";>HADOOP-4257</a>. The 
DFS client should pick only one datanode as the candidate
+to initiate lease recovery.  (Tsz Wo (Nicholas), SZE via dhruba)
+</li>
+      <li><a 
href="http://issues.apache.org/jira/browse/HADOOP-4319";>HADOOP-4319</a>. 
fuse-dfs dfs_read function returns as many bytes as it is
+told to read unlesss end-of-file is reached.<br />(Pete Wyckoff via 
dhruba)</li>
+      <li><a 
href="http://issues.apache.org/jira/browse/HADOOP-4246";>HADOOP-4246</a>. Ensure 
we have the correct lower bound on the number of
+retries for fetching map-outputs; also fixed the case where the reducer
+automatically kills on too many unique map-outputs could not be fetched
+for small jobs.<br />(Amareshwari Sri Ramadasu via acmurthy)</li>
     </ol>
   </li>
 </ul>
-<h2><a href="javascript:toggleList('release_0.18.1_-_2008-09-17_')">Release 
0.18.1 - 2008-09-17
+<h2><a href="javascript:toggleList('release_0.18.2_-_unreleased_')">Release 
0.18.2 - Unreleased
 </a></h2>
+<ul id="release_0.18.2_-_unreleased_">
+  <li><a 
href="javascript:toggleList('release_0.18.2_-_unreleased_._bug_fixes_')">  BUG 
FIXES
+</a>&nbsp;&nbsp;&nbsp;(3)
+    <ol id="release_0.18.2_-_unreleased_._bug_fixes_">
+      <li><a 
href="http://issues.apache.org/jira/browse/HADOOP-4116";>HADOOP-4116</a>. 
Balancer should provide better resource management.<br />(hairong)</li>
+      <li><a 
href="http://issues.apache.org/jira/browse/HADOOP-3614";>HADOOP-3614</a>. Fix a 
bug that Datanode may use an old GenerationStamp to get
+meta file.<br />(szetszwo)</li>
+      <li><a 
href="http://issues.apache.org/jira/browse/HADOOP-4314";>HADOOP-4314</a>. 
Simulated datanodes should not include blocks that are still
+being written in their block report.<br />(Raghu Angadi)</li>
+    </ol>
+  </li>
+</ul>
+<h2><a href="javascript:toggleList('older')">Older Releases</a></h2>
+<ul id="older">
+<h3><a href="javascript:toggleList('release_0.18.1_-_2008-09-17_')">Release 
0.18.1 - 2008-09-17
+</a></h3>
 <ul id="release_0.18.1_-_2008-09-17_">
   <li><a 
href="javascript:toggleList('release_0.18.1_-_2008-09-17_._improvements_')">  
IMPROVEMENTS
 </a>&nbsp;&nbsp;&nbsp;(1)
@@ -540,8 +613,6 @@
     </ol>
   </li>
 </ul>
-<h2><a href="javascript:toggleList('older')">Older Releases</a></h2>
-<ul id="older">
 <h3><a href="javascript:toggleList('release_0.18.0_-_2008-08-19_')">Release 
0.18.0 - 2008-08-19
 </a></h3>
 <ul id="release_0.18.0_-_2008-08-19_">
@@ -1085,6 +1156,21 @@
     </ol>
   </li>
 </ul>
+<h3><a href="javascript:toggleList('release_0.17.3_-_unreleased_')">Release 
0.17.3 - Unreleased
+</a></h3>
+<ul id="release_0.17.3_-_unreleased_">
+  <li><a 
href="javascript:toggleList('release_0.17.3_-_unreleased_._bug_fixes_')">  BUG 
FIXES
+</a>&nbsp;&nbsp;&nbsp;(4)
+    <ol id="release_0.17.3_-_unreleased_._bug_fixes_">
+      <li><a 
href="http://issues.apache.org/jira/browse/HADOOP-4277";>HADOOP-4277</a>. 
Checksum verification was mistakenly disabled for
+LocalFileSystem.<br />(Raghu Angadi)</li>
+      <li><a 
href="http://issues.apache.org/jira/browse/HADOOP-4271";>HADOOP-4271</a>. 
Checksum input stream can sometimes return invalid
+data to the user.<br />(Ning Li via rangadi)</li>
+      <li><a 
href="http://issues.apache.org/jira/browse/HADOOP-4318";>HADOOP-4318</a>. DistCp 
should use absolute paths for cleanup.<br />(szetszwo)</li>
+      <li><a 
href="http://issues.apache.org/jira/browse/HADOOP-4326";>HADOOP-4326</a>. 
ChecksumFileSystem does not override create(...) correctly.<br />(szetszwo)</li>
+    </ol>
+  </li>
+</ul>
 <h3><a href="javascript:toggleList('release_0.17.2_-_2008-08-11_')">Release 
0.17.2 - 2008-08-11
 </a></h3>
 <ul id="release_0.17.2_-_2008-08-11_">

Modified: hadoop/core/branches/branch-0.19/docs/hadoop-default.html
URL: 
http://svn.apache.org/viewvc/hadoop/core/branches/branch-0.19/docs/hadoop-default.html?rev=702166&r1=702165&r2=702166&view=diff
==============================================================================
--- hadoop/core/branches/branch-0.19/docs/hadoop-default.html (original)
+++ hadoop/core/branches/branch-0.19/docs/hadoop-default.html Mon Oct  6 
07:38:38 2008
@@ -442,12 +442,15 @@
 </tr>
 <tr>
 <td><a 
name="mapred.tasktracker.taskmemorymanager.monitoring-interval">mapred.tasktracker.taskmemorymanager.monitoring-interval</a></td><td>5000</td><td>The
 interval, in milliseconds, for which the tasktracker waits
-   between two cycles of monitoring its tasks' memory usage.</td>
+   between two cycles of monitoring its tasks' memory usage. Used only if
+   tasks' memory management is enabled via mapred.tasktracker.tasks.maxmemory.
+   </td>
 </tr>
 <tr>
 <td><a 
name="mapred.tasktracker.procfsbasedprocesstree.sleeptime-before-sigkill">mapred.tasktracker.procfsbasedprocesstree.sleeptime-before-sigkill</a></td><td>5000</td><td>The
 time, in milliseconds, the tasktracker waits for sending a
   SIGKILL to a process that has overrun memory limits, after it has been sent
-  a SIGTERM.</td>
+  a SIGTERM. Used only if tasks' memory management is enabled via
+  mapred.tasktracker.tasks.maxmemory.</td>
 </tr>
 <tr>
 <td><a name="mapred.map.tasks">mapred.map.tasks</a></td><td>2</td><td>The 
default number of map tasks per job.  Typically set
@@ -467,15 +470,10 @@
   </td>
 </tr>
 <tr>
-<td><a 
name="mapred.jobtracker.job.history.block.size">mapred.jobtracker.job.history.block.size</a></td><td>0</td><td>The
 block size of the job history file. Since the job recovery
+<td><a 
name="mapred.jobtracker.job.history.block.size">mapred.jobtracker.job.history.block.size</a></td><td>3145728&gt;</td><td>The
 block size of the job history file. Since the job recovery
                uses job history, its important to dump job history to disk as 
-               soon as possible.
-  </td>
-</tr>
-<tr>
-<td><a 
name="mapred.jobtracker.job.history.buffer.size">mapred.jobtracker.job.history.buffer.size</a></td><td>4096</td><td>The
 buffer size for the job history file. Since the job 
-               recovery uses job history, its important to frequently flush 
the 
-               job history to disk. This will minimize the loss in recovery.
+               soon as possible. Note that this is an expert level parameter.
+               The default value is set to 3 MB.
   </td>
 </tr>
 <tr>
@@ -914,7 +912,9 @@
        tasks. Any task scheduled on this tasktracker is guaranteed and 
constrained
         to use a share of this amount. Any task exceeding its share will be 
        killed. If set to -1, this functionality is disabled, and 
-       mapred.task.maxmemory is ignored.
+       mapred.task.maxmemory is ignored. Further, it will be enabled only on 
the
+       systems where org.apache.hadoop.util.ProcfsBasedProcessTree is 
available,
+       i.e at present only on Linux.
   </td>
 </tr>
 <tr>

Modified: hadoop/core/branches/branch-0.19/docs/mapred_tutorial.html
URL: 
http://svn.apache.org/viewvc/hadoop/core/branches/branch-0.19/docs/mapred_tutorial.html?rev=702166&r1=702165&r2=702166&view=diff
==============================================================================
--- hadoop/core/branches/branch-0.19/docs/mapred_tutorial.html (original)
+++ hadoop/core/branches/branch-0.19/docs/mapred_tutorial.html Mon Oct  6 
07:38:38 2008
@@ -319,6 +319,9 @@
 <li>
 <a href="#Data+Compression">Data Compression</a>
 </li>
+<li>
+<a href="#Skipping+Bad+Records">Skipping Bad Records</a>
+</li>
 </ul>
 </li>
 </ul>
@@ -327,7 +330,7 @@
 <a href="#Example%3A+WordCount+v2.0">Example: WordCount v2.0</a>
 <ul class="minitoc">
 <li>
-<a href="#Source+Code-N10F30">Source Code</a>
+<a href="#Source+Code-N10F78">Source Code</a>
 </li>
 <li>
 <a href="#Sample+Runs">Sample Runs</a>
@@ -2542,10 +2545,81 @@
             <a 
href="api/org/apache/hadoop/mapred/SequenceFileOutputFormat.html#setOutputCompressionType(org.apache.hadoop.mapred.JobConf,%20org.apache.hadoop.io.SequenceFile.CompressionType)">
             SequenceFileOutputFormat.setOutputCompressionType(JobConf, 
             SequenceFile.CompressionType)</a> api.</p>
+<a name="N10F14"></a><a name="Skipping+Bad+Records"></a>
+<h4>Skipping Bad Records</h4>
+<p>Hadoop provides an optional mode of execution in which the bad 
+          records are detected and skipped in further attempts. 
+          Applications can control various settings via 
+          <a href="api/org/apache/hadoop/mapred/SkipBadRecords.html">
+          SkipBadRecords</a>.</p>
+<p>This feature can be used when map/reduce tasks crashes 
+          deterministically on certain input. This happens due to bugs in the 
+          map/reduce function. The usual course would be to fix these bugs. 
+          But sometimes this is not possible; perhaps the bug is in third 
party 
+          libraries for which the source code is not available. Due to this, 
+          the task never reaches to completion even with multiple attempts and 
+          complete data for that task is lost.</p>
+<p>With this feature, only a small portion of data is lost surrounding 
+          the bad record. This may be acceptable for some user applications; 
+          for example applications which are doing statistical analysis on 
+          very large data. By default this feature is disabled. For turning it 
+          on refer <a 
href="api/org/apache/hadoop/mapred/SkipBadRecords.html#setMapperMaxSkipRecords(org.apache.hadoop.conf.Configuration,
 long)">
+          SkipBadRecords.setMapperMaxSkipRecords(Configuration, long)</a> and 
+          <a 
href="api/org/apache/hadoop/mapred/SkipBadRecords.html#setReducerMaxSkipGroups(org.apache.hadoop.conf.Configuration,
 long)">
+          SkipBadRecords.setReducerMaxSkipGroups(Configuration, long)</a>.
+          </p>
+<p>The skipping mode gets kicked off after certain no of failures
+          see <a 
href="api/org/apache/hadoop/mapred/SkipBadRecords.html#setAttemptsToStartSkipping(org.apache.hadoop.conf.Configuration,
 int)">
+          SkipBadRecords.setAttemptsToStartSkipping(Configuration, int)</a>.
+          </p>
+<p>In the skipping mode, the map/reduce task maintains the record 
+          range which is getting processed at all times. For maintaining this 
+          range, the framework relies on the processed record 
+          counter. see <a 
href="api/org/apache/hadoop/mapred/SkipBadRecords.html#COUNTER_MAP_PROCESSED_RECORDS">
+          SkipBadRecords.COUNTER_MAP_PROCESSED_RECORDS</a> and 
+          <a 
href="api/org/apache/hadoop/mapred/SkipBadRecords.html#COUNTER_REDUCE_PROCESSED_GROUPS">
+          SkipBadRecords.COUNTER_REDUCE_PROCESSED_GROUPS</a>. 
+          Based on this counter, the framework knows that how 
+          many records have been processed successfully by mapper/reducer.
+          Before giving the 
+          input to the map/reduce function, it sends this record range to the 
+          Task tracker. If task crashes, the Task tracker knows which one was 
+          the last reported range. On further attempts that range get skipped.
+          </p>
+<p>The number of records skipped for a single bad record depends on 
+          how frequent, the processed counters are incremented by the 
application. 
+          It is recommended to increment the counter after processing every 
+          single record. However in some applications this might be difficult 
as 
+          they may be batching up their processing. In that case, the 
framework 
+          might skip more records surrounding the bad record. If users want to 
+          reduce the number of records skipped, then they can specify the 
+          acceptable value using 
+          <a 
href="api/org/apache/hadoop/mapred/SkipBadRecords.html#setMapperMaxSkipRecords(org.apache.hadoop.conf.Configuration,
 long)">
+          SkipBadRecords.setMapperMaxSkipRecords(Configuration, long)</a> and 
+          <a 
href="api/org/apache/hadoop/mapred/SkipBadRecords.html#setReducerMaxSkipGroups(org.apache.hadoop.conf.Configuration,
 long)">
+          SkipBadRecords.setReducerMaxSkipGroups(Configuration, long)</a>. 
+          The framework tries to narrow down the skipped range by employing 
the 
+          binary search kind of algorithm during task re-executions. The 
skipped
+          range is divided into two halves and only one half get executed. 
+          Based on the subsequent failure, it figures out which half contains 
+          the bad record. This task re-execution will keep happening till 
+          acceptable skipped value is met or all task attempts are exhausted.
+          To increase the number of task attempts, use
+          <a 
href="api/org/apache/hadoop/mapred/JobConf.html#setMaxMapAttempts(int)">
+          JobConf.setMaxMapAttempts(int)</a> and 
+          <a 
href="api/org/apache/hadoop/mapred/JobConf.html#setMaxReduceAttempts(int)">
+          JobConf.setMaxReduceAttempts(int)</a>.
+          </p>
+<p>The skipped records are written to the hdfs in the sequence file 
+          format, which could be used for later analysis. The location of 
+          skipped records output path can be changed by 
+          <a 
href="api/org/apache/hadoop/mapred/SkipBadRecords.html#setSkipOutputPath(org.apache.hadoop.mapred.JobConf,
 org.apache.hadoop.fs.Path)">
+          SkipBadRecords.setSkipOutputPath(JobConf, Path)</a>.
+          </p>
 </div>
 
     
-<a name="N10F16"></a><a name="Example%3A+WordCount+v2.0"></a>
+<a name="N10F5E"></a><a name="Example%3A+WordCount+v2.0"></a>
 <h2 class="h3">Example: WordCount v2.0</h2>
 <div class="section">
 <p>Here is a more complete <span class="codefrag">WordCount</span> which uses 
many of the
@@ -2555,7 +2629,7 @@
       <a href="quickstart.html#SingleNodeSetup">pseudo-distributed</a> or
       <a 
href="quickstart.html#Fully-Distributed+Operation">fully-distributed</a> 
       Hadoop installation.</p>
-<a name="N10F30"></a><a name="Source+Code-N10F30"></a>
+<a name="N10F78"></a><a name="Source+Code-N10F78"></a>
 <h3 class="h4">Source Code</h3>
 <table class="ForrestTable" cellspacing="1" cellpadding="4">
           
@@ -3765,7 +3839,7 @@
 </tr>
         
 </table>
-<a name="N11692"></a><a name="Sample+Runs"></a>
+<a name="N116DA"></a><a name="Sample+Runs"></a>
 <h3 class="h4">Sample Runs</h3>
 <p>Sample text-files as input:</p>
 <p>
@@ -3933,7 +4007,7 @@
 <br>
         
 </p>
-<a name="N11766"></a><a name="Highlights"></a>
+<a name="N117AE"></a><a name="Highlights"></a>
 <h3 class="h4">Highlights</h3>
 <p>The second version of <span class="codefrag">WordCount</span> improves upon 
the 
         previous one by using some features offered by the Map/Reduce 
framework:


Reply via email to