[Hadoop Wiki] Update of "FAQ" by ChristophSchmitz

Apache Wiki Wed, 20 Apr 2011 02:40:08 -0700

Dear Wiki user,

You have subscribed to a wiki page or wiki category on "Hadoop Wiki" for change 
notification.


The "FAQ" page has been changed by ChristophSchmitz.
http://wiki.apache.org/hadoop/FAQ?action=diff&rev1=95&rev2=96

--------------------------------------------------

  $ bin/hadoop-daemon.sh start datanode
  $ bin/hadoop-daemon.sh start tasktracker
  }}}
- 
  If you are using the dfs.include/mapred.include functionality, you will need 
to additionally add the node to the dfs.include/mapred.include file, then issue 
{{{hadoop dfsadmin -refreshNodes}}} and {{{hadoop mradmin -refreshNodes}}} so 
that the NameNode and JobTracker know of the additional node that has been 
added.
  
  == Is there an easy way to see the status and health of a cluster? ==
@@ -92, +91 @@

   * 
[[http://hadoop.apache.org/core/docs/current/api/org/apache/hadoop/mapred/pipes/package-summary.html|Hadoop
 Pipes]], a [[http://www.swig.org/|SWIG]]-compatible  C++ API (non-JNI) to 
write map-reduce jobs.
  
  == How do I submit extra content (jars, static files, etc) for my job to use 
during runtime? ==
- 
  The 
[[http://hadoop.apache.org/mapreduce/docs/current/api/org/apache/hadoop/filecache/DistributedCache.html|distributed
 cache]] feature is used to distribute large read-only files that are needed by 
map/reduce jobs to the cluster. The framework will copy the necessary files 
from a URL (either hdfs: or http:) on to the slave node before any tasks for 
the job are executed on that node. The files are only copied once per job and 
so should not be modified by the application.
  
  For streaming, see the HadoopStreaming wiki for more information.
@@ -101, +99 @@

  
  == How do I get my MapReduce Java Program to read the Cluster's set 
configuration and not just defaults? ==
  The configuration property files ({core|mapred|hdfs}-site.xml) that are 
available in the various '''conf/''' directories of your Hadoop installation 
needs to be on the '''CLASSPATH''' of your Java application for it to get found 
and applied. Another way of ensuring that no set configuration gets overridden 
by any Job is to set those properties as final; for example:
+ 
  {{{
  <name>mapreduce.task.io.sort.mb</name>
  <value>400</value>
  <final>true</final>
  }}}
- 
  Setting configuration properties as final is a common thing Administrators 
do, as is noted in the 
[[http://hadoop.apache.org/common/docs/current/api/org/apache/hadoop/conf/Configuration.html|Configuration]]
 API docs.
  
  A better alternative would be to have a service serve up the Cluster's 
configuration to you upon request, in code. 
[[HADOOP-5670|https://issues.apache.org/jira/browse/HADOOP-5670]] may be of 
some interest in this regard, perhaps.
@@ -122, +120 @@

  
  With ''speculative-execution'' '''on''', one could face issues with 2 
instances of the same TIP (running simultaneously) trying to open/write-to the 
same file (path) on hdfs. Hence the app-writer will have to pick unique names 
(e.g. using the complete taskid i.e. task_200709221812_0001_m_000000_0) per 
task-attempt, not just per TIP. (Clearly, this needs to be done even if the 
user doesn't create/write-to files directly via reduce tasks.)
  
- To get around this the framework helps the application-writer out by 
maintaining a special '''${mapred.output.dir}/_${taskid}''' sub-dir for each 
task-attempt on hdfs where the output of the reduce task-attempt goes. On 
successful completion of the task-attempt the files in the 
${mapred.output.dir}/_${taskid} (of the successful taskid only) are moved to 
${mapred.output.dir}. Of course, the framework discards the sub-directory of 
unsuccessful task-attempts. This is completely transparent to the application.
+ To get around this the framework helps the application-writer out by 
maintaining a special '''${mapred.output.dir}/_${taskid}''' sub-dir for each 
reduce task-attempt on hdfs where the output of the reduce task-attempt goes. 
On successful completion of the task-attempt the files in the 
${mapred.output.dir}/_${taskid} (of the successful taskid only) are moved to 
${mapred.output.dir}. Of course, the framework discards the sub-directory of 
unsuccessful task-attempts. This is completely transparent to the application.
  
  The application-writer can take advantage of this by creating any side-files 
required in ${mapred.output.dir} during execution of his reduce-task, and the 
framework will move them out similarly - thus you don't have to pick unique 
paths per task-attempt.
  
- Fine-print: the value of ${mapred.output.dir} during execution of a 
particular task-attempt is actually ${mapred.output.dir}/_{$taskid}, not the 
value set by 
[[http://hadoop.apache.org/core/docs/current/api/org/apache/hadoop/mapred/JobConf.html#setOutputPath(org.apache.hadoop.fs.Path)|JobConf.setOutputPath]].
 ''So, just create any hdfs files you want in ${mapred.output.dir} from your 
reduce task to take advantage of this feature.''
+ Fine-print: the value of ${mapred.output.dir} during execution of a 
particular ''reduce'' task-attempt is actually ${mapred.output.dir}/_{$taskid}, 
not the value set by 
[[http://hadoop.apache.org/core/docs/current/api/org/apache/hadoop/mapred/JobConf.html#setOutputPath(org.apache.hadoop.fs.Path)|JobConf.setOutputPath]].
 ''So, just create any hdfs files you want in ${mapred.output.dir} from your 
reduce task to take advantage of this feature.''
+ 
+ For ''map'' task attempts, the automatic substitution of 
${mapred.output.dir}/_${taskid} for''' '''${mapred.output.dir} does not take 
place. You can still access the map task attempt directory, though, by using 
FileOutputFormat.getWorkOutputPath(TaskInputOutputContext). Files created there 
will be dealt with as described above.
  
  The entire discussion holds true for maps of jobs with reducer=NONE (i.e. 0 
reduces) since output of the map, in that case, goes directly to hdfs.
  
@@ -281, +281 @@

  = Platform Specific =
  == Mac OS X ==
  === Building on Mac OS X 10.6 ===
- 
  Be aware that Apache Hadoop 0.22 and earlier require Apache Forrest to build 
the documentation.  As of Snow Leopard, Apple no longer ships Java 1.5 which 
Apache Forrest requires.  This can be accomplished by either copying 
/System/Library/Frameworks/JavaVM.Framework/Versions/1.5 and 1.5.0 from a 10.5 
machine or using a utility like Pacifist to install from an official Apple 
package. 
http://chxor.chxo.com/post/183013153/installing-java-1-5-on-snow-leopard 
provides some step-by-step directions.
- 
  
  == Solaris ==
  === Why do files and directories show up as DrWho and/or user names are 
missing/weird? ===

[Hadoop Wiki] Update of "FAQ" by ChristophSchmitz

Reply via email to