[Lucene-hadoop Wiki] Update of "MachineScaling" by cfellows

Apache Wiki Fri, 14 Dec 2007 07:48:09 -0800

Dear Wiki user,

You have subscribed to a wiki page or wiki category on "Lucene-hadoop Wiki" for 
change notification.


The following page has been changed by cfellows:
http://wiki.apache.org/lucene-hadoop/MachineScaling

------------------------------------------------------------------------------
- == Machine Scaling ==
+ Among the software questions for setting up and running Hadoop, there a few 
other questions that relate to hardware scaling:
  
+  1. What are the optimum machine configurations for running a hadoop cluster? 
+  1. Should I use a smaller number of high end/performance machines or are a 
larger number of "commodity" machines? 
+  1. How does the Hadoop/Parallel Distributed Processing community define 
"commodity"?
+ 
+ '''Note:''' The initial section of this page will focus on datanodes.
+ 
+ In answer to 1 and 2 above, we can group the possible hardware options in to 
3 rough categories:
+ 
+  A. database class machine with many (>10) fast SAS drives and >10GB memory, 
dual or quad x quad core cpu's. With an approximate cost of $20K.
+  A. generic production machine with 2 x 250GB SATA drives, 4-12GB RAM, dual x 
dual core CPU's (=Dell 1950). Cost is about $2-5K.
+  A. POS beige box machine with 2 x SATA drives of variable size, 4 GB RAM, 
single dual core CPU. Cost is about $1K.
+ 
+ For a $50K budget, most users would take 25x(B) over 50x(C) due to simpler 
and smaller admin issues even though cost/performance would be nominally about 
the same. Most users would avoid 2x(A) like the plague.
+ 
+ For the discussion to 3, "commodity" hardware is best defined as consisting 
of standardized, easily available components which can be purchased from 
multiple distributors/retailers. Given this definition there are still ranges 
of quality that can be purchased for your cluster. As mentioned above, users 
generally avoid the low-end, cheap solutions. The primary motivating force to 
avoid low-end solutions is "real" cost; cheap parts mean greater number of 
failures requiring more maintanance/cost. Many users spend $2K-$5K per machine. 
For a longer discussion of "scaling out" reference: 
http://jcole.us/blog/archives/2007/06/10/scaling-out-and-up-a-compromise/
+ 
+ '''More specifics:'''
+ 
+ Hadoop benefits greatly from ECC memory, which is not low-end. Multi-core 
boxes tend to give more computation per dollar, per watt and per unit of 
operational maintenance. But the highest clockrate processors tend to not be 
cost-effective, as do the very largest drives. So moderately high-end commodity 
hardware is the most cost-effective for Hadoop today.
+ 
+ Some users use cast-off machines that were not reliable enough for other 
applications. These machines originally cost about 2/3 what normal production 
boxes cost and achieve almost exactly 1/2 as much. Production boxes are 
typically dual CPU's with dual cores.
+ 
+ '''RAM:'''
+ 
+ Many users find that most hadoop applications are very small in memory 
consumption. Users tend to have 4-8 GB machines with 2GB probably being too 
little.
+

[Lucene-hadoop Wiki] Update of "MachineScaling" by cfellows

Reply via email to