[Lucene-hadoop Wiki] Update of "FAQ" by Arun C Murthy

Apache Wiki Sat, 22 Sep 2007 03:17:36 -0700

Dear Wiki user,

You have subscribed to a wiki page or wiki category on "Lucene-hadoop Wiki" for 
change notification.


The following page has been changed by Arun C Murthy:
http://wiki.apache.org/lucene-hadoop/FAQ

The comment on the change is:
Made each FAQ a link to facilitate easy sharing via emails etc.

------------------------------------------------------------------------------
  = Hadoop FAQ =
  
- == 1. What is Hadoop? ==
+ [[BR]]
+ [[Anchor(1)]]
+ '''1. [#1 What is Hadoop?]'''
  
  [http://lucene.apache.org/hadoop/ Hadoop] is a distributed computing platform 
written in Java.  It incorporates features similar to those of the 
[http://en.wikipedia.org/wiki/Google_File_System Google File System] and of 
[http://en.wikipedia.org/wiki/MapReduce MapReduce].  For some details, see 
HadoopMapReduce.
  
+ [[BR]]
+ [[Anchor(2)]]
- == 2. What platform does Hadoop run on? ==
+ '''2. [#2 What platform does Hadoop run on?]'''
  
    1. Java 1.5.x or higher, preferably from Sun
    2. Linux and Windows are the supported operating systems, but BSD and Mac 
OS/X are known to work. (Windows requires the installation of 
[http://www.cygwin.com/ Cygwin]).
  
+ 
+ [[Anchor(2.1)]]
- === 2.1 Building / Testing Hadoop on Windows ===
+ ''2.1 [#2.1 Building / Testing Hadoop on Windows]''
  
  The Hadoop build on Windows can be run from inside a Windows (not cygwin) 
command prompt window.
  
@@ -31, +37 @@

  
  other targets work similarly. I just wanted to document this because I spent 
some time trying to figure out why the ant build would not run from a cygwin 
command prompt window. If you are building/testing on Windows, and haven't 
figured it out yet, this should get you started.
  
+ 
+ [[BR]]
+ [[Anchor(3)]]
- == 3. How well does Hadoop scale? ==
+ '''3. [#3 How well does Hadoop scale?]'''
  
  Hadoop has been demonstrated on clusters of up to 2000 nodes.  Sort 
performance on 900 nodes is good (sorting 9TB of data on 900 nodes takes around 
2.25 hours) and [attachment:sort900-20070815.png improving] using these 
non-default configuration values:
  
@@ -50, +59 @@

    * `tasktracker.http.threads = 50`
    * `mapred.child.java.opts = -Xmx1024m`
  
+ 
+ [[BR]]
+ [[Anchor(4)]]
- == 4. Do I have to write my application in Java? ==
+ '''4. [#4 Do I have to write my application in Java?]'''
  
  No.  There are several ways to incorporate non-Java code.  
    * HadoopStreaming permits any shell command to be used as a map or reduce 
function. 
    * [http://svn.apache.org/viewvc/lucene/hadoop/trunk/src/c%2B%2B/libhdfs 
libhdfs], a JNI-based C API for talking to hdfs (only).
    * 
[http://lucene.apache.org/hadoop/api/org/apache/hadoop/mapred/pipes/package-summary.html
 Hadoop Pipes], a [http://www.swig.org/ SWIG]-compatible  C++ API (non-JNI) to 
write map-reduce jobs.
  
+ 
+ [[BR]]
+ [[Anchor(5)]]
- == 5. How can I help to make Hadoop better? ==
+ '''5. [#5 How can I help to make Hadoop better?]'''
  
  If you have trouble figuring how to use Hadoop, then, once you've figured 
something out (perhaps with the help of the 
[http://lucene.apache.org/hadoop/mailing_lists.html mailing lists]), pass that 
knowledge on to others by adding something to this wiki.
  
  If you find something that you wish were done better, and know how to fix it, 
read HowToContribute, and contribute a patch.
  
+ 
+ [[BR]]
+ [[Anchor(6)]]
- == 6. If I add new data-nodes to the cluster will HDFS move the blocks to the 
newly added nodes in order to balance disk space utilization between the nodes? 
==
+ '''6. [#6 If I add new data-nodes to the cluster will HDFS move the blocks to 
the newly added nodes in order to balance disk space utilization between the 
nodes?]'''
  
  No, HDFS will not move blocks to new nodes automatically. However, newly 
created files will likely have their blocks placed on the new nodes.
  
@@ -72, +90 @@

   2. A simpler way, with no interruption of service, is to turn up the 
replication of files, wait for transfers to stabilize, and then turn the 
replication back down.
   3. Yet another way to re-balance blocks is to turn off the data-node, which 
is full, wait until its blocks are replicated, and then bring it back again. 
The over-replicated blocks will be randomly removed from different nodes, so 
you really get them rebalanced not just removed from the current node.
  
+ 
+ [[BR]]
+ [[Anchor(7)]]
- == 7. What is the purpose of the secondary name-node? ==
+ '''7. [#7 What is the purpose of the secondary name-node?]'''
  
  The term "secondary name-node" is somewhat misleading.
  It is not a name-node in the sense that data-nodes cannot connect to the 
secondary name-node,
@@ -91, +112 @@

  You will also need to restart the whole cluster in this case.
  
  
+ [[BR]]
+ [[Anchor(8)]]
- == 8. What is the Distributed Cache used for? ==
+ '''8. [#8 What is the Distributed Cache used for?]'''
  
  The distributed cache is used to distribute large read-only files that are 
needed by map/reduce jobs to the cluster. The framework will copy the necessary 
files from a url (either hdfs: or http:) on to the slave node before any tasks 
for the job are executed on that node. The files are only copied once per job 
and so should not be modified by the application.
  
+ 
+ [[BR]]
+ [[Anchor(9)]]
- == 9. Can I write create/write-to hdfs files directly from my map/reduce 
tasks? ==
+ '''9. [#9 Can I write create/write-to hdfs files directly from my map/reduce 
tasks?]'''
  
  Yes. (Clearly, you want this since you need to create/write-to files other 
than the output-file written out by 
[http://lucene.apache.org/hadoop/api/org/apache/hadoop/mapred/OutputCollector.html
 OutputCollector].)
  
@@ -113, +139 @@

  
  To get around this the framework helps the application-writer out by 
maintaining a special '''${mapred.output.dir}/_${taskid}''' sub-dir for each 
task-attempt on hdfs where the output of the reduce task-attempt goes. On 
successful completion of the task-attempt the files in the 
${mapred.output.dir}/_${taskid} (of the successful taskid only) are moved to 
${mapred.output.dir}. Of course, the framework discards the sub-directory of 
unsuccessful task-attempts. This is completely transparent to the application.
  
- The app-writer can take advantage of this by creating any side-files required 
in ${mapred.output.dir} during execution of his reduce-task, and the framework 
will move them out similarly - thus you don't have to pick unique paths per 
task-attempt.
+ The application-writer can take advantage of this by creating any side-files 
required in ${mapred.output.dir} during execution of his reduce-task, and the 
framework will move them out similarly - thus you don't have to pick unique 
paths per task-attempt.
  
  Fine-print: the value of ${mapred.output.dir} during execution of a 
particular task-attempt is actually ${mapred.output.dir}/_{$taskid}, not the 
value set by 
[http://lucene.apache.org/hadoop/api/org/apache/hadoop/mapred/JobConf.html#setOutputPath(org.apache.hadoop.fs.Path)
 JobConf.setOutputPath]. ''So, just create any hdfs files you want in 
${mapred.output.dir} from your reduce task to take advantage of this feature.''

[Lucene-hadoop Wiki] Update of "FAQ" by Arun C Murthy

Reply via email to