Dear Wiki user,

You have subscribed to a wiki page or wiki category on "Hadoop Wiki" for change 
notification.

The "TestFaqPage" page has been changed by SomeOtherAccount.
http://wiki.apache.org/hadoop/TestFaqPage?action=diff&rev1=1&rev2=2

--------------------------------------------------

- = Hadoop FAQ =
+ #pragma section-numbers on
  
+ '''Hadoop FAQ'''
- [[#A1|General]]
- [[#A2|MapReduce]]
- [[#A3|HDFS]]
- [[#A4|Platform Specific]]
  
- <<BR>> <<Anchor(1)>> [[#A1|General]]
+ <<TableOfContents(3)>>
  
- <<BR>> <<Anchor(1.1)>> '''1. [[#A1.1|What is Hadoop?]]'''
+ = General =
+ 
+ == What is Hadoop? ==
  
  [[http://hadoop.apache.org/core/|Hadoop]] is a distributed computing platform 
written in Java.  It incorporates features similar to those of the 
[[http://en.wikipedia.org/wiki/Google_File_System|Google File System]] and of 
[[http://en.wikipedia.org/wiki/MapReduce|MapReduce]].  For some details, see 
HadoopMapReduce.
  
- <<BR>> <<Anchor(1.2)>> '''2. [[#A1.2|What platform does Hadoop run on?]]'''
+ == What platform does Hadoop run on? ==
  
   1. Java 1.6.x or higher, preferably from Sun -see HadoopJavaVersions
   1. Linux and Windows are the supported operating systems, but BSD, Mac OS/X, 
and OpenSolaris are known to work. (Windows requires the installation of 
[[http://www.cygwin.com/|Cygwin]]).
  
- <<BR>> <<Anchor(1.3)>> '''3. [[#A1.3|How well does Hadoop scale?]]'''
+ == How well does Hadoop scale? ==
  
  Hadoop has been demonstrated on clusters of up to 4000 nodes.  Sort 
performance on 900 nodes is good (sorting 9TB of data on 900 nodes takes around 
1.8 hours) and [[attachment:sort900-20080115.png|improving]] using these 
non-default configuration values:
  
@@ -38, +37 @@

   * `tasktracker.http.threads = 50`
   * `mapred.child.java.opts = -Xmx1024m`
  
- <<BR>> <<Anchor(1.4)>> '''4. [[#A1.4|What kind of hardware scales best for 
Hadoop?]]'''
+ == What kind of hardware scales best for Hadoop? ==
  
  The short answer is dual processor/dual core machines with 4-8GB of RAM using 
ECC memory. Machines should be moderately high-end commodity machines to be 
most cost-effective and typically cost 1/2 - 2/3 the cost of normal production 
application servers but are not desktop-class machines. This cost tends to be 
$2-5K. For a more detailed discussion, see MachineScaling page.
  
- <<BR>> <<Anchor(1.5)>> '''5. [[#A1.5|How does GridGain compare to Hadoop?]]'''
+ == How does GridGain compare to Hadoop? ==
  
  !GridGain does not support data intensive jobs. For more details, see 
HadoopVsGridGain.
  
- <<BR>> <<Anchor(1.6)>> '''6. [[#A1.6|I have a new node I want to add to a 
running Hadoop cluster; how do I start services on just one node?]]'''
+ == I have a new node I want to add to a running Hadoop cluster; how do I 
start services on just one node? ==
  
  This also applies to the case where a machine has crashed and rebooted, etc, 
and you need to get it to rejoin the cluster. You do not need to shutdown 
and/or restart the entire cluster in this case.
  
@@ -59, +58 @@

  $ bin/hadoop-daemon.sh start datanode
  $ bin/hadoop-daemon.sh start tasktracker
  }}}
- <<BR>> <<Anchor(1.7)>> '''7. [[#A1.7|Is there an easy way to see the status 
and health of my cluster?]]'''
+ 
+ == Is there an easy way to see the status and health of my cluster? ==
  
  There are web-based interfaces to both the JobTracker (MapReduce master) and 
NameNode (HDFS master) which display status pages about the state of the entire 
system. By default, these are located at http://job.tracker.addr:50030/ and 
http://name.node.addr:50070/.
  
@@ -71, +71 @@

  $ bin/hadoop dfsadmin -report
  }}}
  
- <<BR>> <<Anchor(1.8)>> '''8. [[#A1.8|How much network bandwidth might I need 
between racks in a medium size (40-80 node) Hadoop cluster?]]'''
+ == How much network bandwidth might I need between racks in a medium size 
(40-80 node) Hadoop cluster? ==
  
  The true answer depends on the types of jobs you're running. As a back of the 
envelope calculation one might figure something like this:
  
@@ -81, +81 @@

  
  So, the simple answer is that 4-6Gbps is most likely just fine for most 
practical jobs. If you want to be extra safe, many inexpensive switches can 
operate in a "stacked" configuration where the bandwidth between them is 
essentially backplane speed. That should scale you to 96 nodes with plenty of 
headroom. Many inexpensive gigabit switches also have one or two 10GigE ports 
which can be used effectively to connect to each other or to a 10GE core.
  
- <<BR>> <<Anchor(1.9)>> '''9. [[#A1.9|How can I help to make Hadoop 
better?]]'''
+ == How can I help to make Hadoop better? ==
  
  If you have trouble figuring how to use Hadoop, then, once you've figured 
something out (perhaps with the help of the 
[[http://hadoop.apache.org/core/mailing_lists.html|mailing lists]]), pass that 
knowledge on to others by adding something to this wiki.
  
  If you find something that you wish were done better, and know how to fix it, 
read HowToContribute, and contribute a patch.
  
- <<BR>> <<Anchor(2)>> [[#A2|MapReduce]]
+ = MapReduce =
  
- <<BR>> <<Anchor(2.1)>> '''1. [[#A2.1|Do I have to write my application in 
Java?]]'''
+ == Do I have to write my application in Java? ==
  
  No.  There are several ways to incorporate non-Java code.
  
@@ -252, +252 @@

  bin/hadoop dfs -ls 'in*'
  }}}
  
- <<BR>> <<Anchor(3.8)>> '''8. [[#A3.8|Can I have multiple files in HDFS use 
different block sizes?]]'''
+ == Can I have multiple files in HDFS use different block sizes? ==
  
  Yes. HDFS provides api to specify block size when you create a file. <<BR>> 
See 
[[http://hadoop.apache.org/core/docs/current/api/org/apache/hadoop/fs/FileSystem.html#create(org.apache.hadoop.fs.Path,%20boolean,%20int,%20short,%20long)|FileSystem.create(Path,
 overwrite, bufferSize, replication, blockSize, progress)]]
  
- <<BR>> <<Anchor(3.9)>> '''9. [[#A3.9|Does HDFS make block boundaries between 
records?]]'''
+ == Does HDFS make block boundaries between records? ==
  
  No, HDFS does not provide record-oriented API and therefore is not aware of 
records and boundaries between them.
  
- <<BR>> <<Anchor(3.10)>> '''10. [[#A3.10|HWhat happens when two clients try to 
write into the same HDFS file?]]'''
+ == What happens when two clients try to write into the same HDFS file? ==
  
  HDFS supports exclusive writes only. <<BR>> When the first client contacts 
the name-node to open the file for writing, the name-node grants a lease to the 
client to create this file.  When the second client tries to open the same file 
for writing, the name-node  will see that the lease for the file is already 
granted to another client, and will reject the open request for the second 
client.
  
- <<BR>> <<Anchor(3.11)>> '''11. [[#A3.11|How to limit Data node's disk 
usage?]]'''
+ == How to limit Data node's disk usage? ==
  
  Use dfs.datanode.du.reserved configuration value in 
$HADOOP_HOME/conf/hdfs-site.xml for limiting disk usage.
  
@@ -278, +278 @@

    </property>
  }}}
  
- <<BR>> <<Anchor(3.12)>> '''12. [[#A3.12|On an individual data node, how do 
you balance the blocks on the disk?]]'''
+ == On an individual data node, how do you balance the blocks on the disk? ==
  
  Hadoop currently does not have a method by which to do this automatically.  
To do this manually:
  
@@ -286, +286 @@

   2. Use the UNIX mv command to move the individual blocks and meta pairs from 
one directory to another on each host
   3. Restart the HDFS
  
+ == What does "file could only be replicated to 0 nodes, instead of 1" mean? ==
- 
- <<BR>> <<Anchor(3.13)>> '''13. [[#A3.13|What does "file could only be 
replicated to 0 nodes, instead of 1" mean?]]'''
  
  The NameNode does not have any available DataNodes.  This can be caused by a 
wide variety of reasons.  Check the DataNode logs, the NameNode logs, network 
connectivity, ...
  
+ = Platform Specific =
+ == Windows ==
  
+ === Building / Testing Hadoop on Windows ===
- <<BR>> <<Anchor(4)>> [[#A4|Platform Specific]]
- <<BR>> <<Anchor(4.1)>> [[#A4.1|Windows]]
- 
- <<BR>> <<Anchor(4.1.1)>> '''1. [[#A4.1.1|Building / Testing Hadoop on 
Windows]]'''
  
  The Hadoop build on Windows can be run from inside a Windows (not cygwin) 
command prompt window.
  

Reply via email to