[Hadoop Wiki] Update of "FAQ" by SomeOtherAccount

Apache Wiki Fri, 22 Oct 2010 08:43:01 -0700

Dear Wiki user,

You have subscribed to a wiki page or wiki category on "Hadoop Wiki" for change 
notification.


The "FAQ" page has been changed by SomeOtherAccount.
http://wiki.apache.org/hadoop/FAQ?action=diff&rev1=81&rev2=82

--------------------------------------------------

  $ bin/hadoop-daemon.sh start tasktracker
  }}}
  
- == Is there an easy way to see the status and health of my cluster? ==
+ == Is there an easy way to see the status and health of a cluster? ==
  
  There are web-based interfaces to both the JobTracker (MapReduce master) and 
NameNode (HDFS master) which display status pages about the state of the entire 
system. By default, these are located at http://job.tracker.addr:50030/ and 
http://name.node.addr:50070/.
  
@@ -87, +87 @@

  
  If you find something that you wish were done better, and know how to fix it, 
read HowToContribute, and contribute a patch.
  
- == I am seeing connection refused in my logs.  How do I troubleshoot this? ==
+ == I am seeing connection refused in the logs.  How do I troubleshoot this? ==
  
  See ConnectionRefused .
  
  = MapReduce =
  
- == Do I have to write my application in Java? ==
+ == Do I have to write my job in Java? ==
  
  No.  There are several ways to incorporate non-Java code.
  
@@ -105, +105 @@

  
  The distributed cache is used to distribute large read-only files that are 
needed by map/reduce jobs to the cluster. The framework will copy the necessary 
files from a url (either hdfs: or http:) on to the slave node before any tasks 
for the job are executed on that node. The files are only copied once per job 
and so should not be modified by the application.
  
- == Can I write create/write-to hdfs files directly from my map/reduce tasks? 
==
+ == Can I write create/write-to hdfs files directly from map/reduce tasks? ==
  
  Yes. (Clearly, you want this since you need to create/write-to files other 
than the output-file written out by 
[[http://hadoop.apache.org/core/docs/current/api/org/apache/hadoop/mapred/OutputCollector.html|OutputCollector]].)
  
@@ -125, +125 @@

  
  The entire discussion holds true for maps of jobs with reducer=NONE (i.e. 0 
reduces) since output of the map, in that case, goes directly to hdfs.
  
- == How do I get each of my maps to work on one complete input-file and not 
allow the framework to split-up my files? ==
+ == How do I get each of a job's maps to work on one complete input-file and 
not allow the framework to split-up the files? ==
  
  Essentially a job's input is represented by the 
[[http://hadoop.apache.org/core/docs/current/api/org/apache/hadoop/mapred/InputFormat.html|InputFormat]](interface)/[[http://hadoop.apache.org/core/docs/current/api/org/apache/hadoop/mapred/FileInputFormat.html|FileInputFormat]](base
 class).
  
@@ -181, +181 @@

  hadoop job -kill JOBID
  }}}
  
- == How do I limit the number of concurrent tasks my job may have running 
total at a time? ==
+ == How do I limit the number of concurrent tasks a job may have running total 
at a time? ==
  
+ See LimitingTaskSlotUsage.
- Typically when this question is asked, it is because a job is referencing 
something external to Hadoop that has some sort of limit on it, such as reading 
or writing from a database.  In Hadoop terms, we call this a 'side-effect'.
- 
- One of the general assumptions of the framework is that there are not any 
side-effects. All tasks are expected to be restartable and a side-effect 
typically goes against the grain of this rule.
- 
- If a task absolutely must break the rules, there are a few things one can do:
- 
- * Deploy ZooKeeper and use it as a persistent lock to keep track of how many 
tasks are running concurrently
- * Use a scheduler with a maximum task-per-queue feature and submit the job to 
that queue
- 
- == How do I limit the number of concurrent tasks my job may have running on a 
given node at a time? ==
- 
- The CapacityScheduler in 0.21 has a feature whereby one may use RAM-per-task 
to limit how many slots a given task takes.  By careful use of this feature, 
one may limit how many concurrent tasks on a given node a job may take. 
  
  = HDFS =

[Hadoop Wiki] Update of "FAQ" by SomeOtherAccount

Reply via email to