Re: Which hardware to choose

2012-10-03 Thread J. Rottinghuis
Of course it all depends...
But something like this could work:

Leave 1-2 GB for the kernel, pagecache, tools, overhead etc.
Plan 3-4 GB for Datanode and Tasktracker each

Plan 2.5-3 GB per slot. Depending on the kinds of jobs, you may need more
or less memory per slot.
Have 2-3 times as many mappers as reducers (depending on the kinds of jobs
you run).

As Micheal pointed out the ratio of cores (hyperthreads) per disk matters.

With those initial rules of thumb you'd arrive somewhere between
10 mappers + 5 reducers
and
9 mappers + 4 reducers

Try, test, measure, adjust, rinse, repeat.

Cheers,

Joep

On Tue, Oct 2, 2012 at 8:42 PM, Alexander Pivovarov apivova...@gmail.comwrote:

 All configs are per node.
 No HBase, only Hive and Pig installed

 On Tue, Oct 2, 2012 at 9:40 PM, Michael Segel michael_se...@hotmail.com
 wrote:

  I think he's saying that its 24 maps 8 reducers per node and at 48GB that
  could be too many mappers.
  Especially if they want to run HBase.
 
  On Oct 2, 2012, at 8:14 PM, hadoopman hadoop...@gmail.com wrote:
 
   Only 24 map and 8 reduce tasks for 38 data nodes?  are you sure that's
  right?  Sounds VERY low for a cluster that size.
  
   We have only 10 c2100's and are running I believe 140 map and 70 reduce
  slots so far with pretty decent performance.
  
  
  
   On 10/02/2012 12:55 PM, Alexander Pivovarov wrote:
   38 data nodes + 2 Name Nodes
 
   Data Node:
   Dell PowerEdge C2100 series
   2 x XEON x5670
   48 GB RAM ECC  (12x4GB 1333MHz)
   12 x 2 TB  7200 RPM SATA HDD (with hot swap)  JBOD
   Intel Gigabit ET Dual port PCIe x4
   Redundant Power Supply
   Hadoop CDH3
   max map tasks 24
   max reduce tasks 8
  
  
 
 



Re: Which hardware to choose

2012-10-03 Thread Michael Segel
Well... 

If you're not running HBase, you're less harmed by minimal swapping so you 
could push the number of slots and over subscribe. 
The only thing I would have to suggest is that you monitor your system closely 
as you adjust the number of slots.

You have to admit though, its fun to tune the cluster. :-)

On Oct 3, 2012, at 12:09 PM, J. Rottinghuis jrottingh...@gmail.com wrote:

 Of course it all depends...
 But something like this could work:
 
 Leave 1-2 GB for the kernel, pagecache, tools, overhead etc.
 Plan 3-4 GB for Datanode and Tasktracker each
 
 Plan 2.5-3 GB per slot. Depending on the kinds of jobs, you may need more
 or less memory per slot.
 Have 2-3 times as many mappers as reducers (depending on the kinds of jobs
 you run).
 
 As Micheal pointed out the ratio of cores (hyperthreads) per disk matters.
 
 With those initial rules of thumb you'd arrive somewhere between
 10 mappers + 5 reducers
 and
 9 mappers + 4 reducers
 
 Try, test, measure, adjust, rinse, repeat.
 
 Cheers,
 
 Joep
 
 On Tue, Oct 2, 2012 at 8:42 PM, Alexander Pivovarov 
 apivova...@gmail.comwrote:
 
 All configs are per node.
 No HBase, only Hive and Pig installed
 
 On Tue, Oct 2, 2012 at 9:40 PM, Michael Segel michael_se...@hotmail.com
 wrote:
 
 I think he's saying that its 24 maps 8 reducers per node and at 48GB that
 could be too many mappers.
 Especially if they want to run HBase.
 
 On Oct 2, 2012, at 8:14 PM, hadoopman hadoop...@gmail.com wrote:
 
 Only 24 map and 8 reduce tasks for 38 data nodes?  are you sure that's
 right?  Sounds VERY low for a cluster that size.
 
 We have only 10 c2100's and are running I believe 140 map and 70 reduce
 slots so far with pretty decent performance.
 
 
 
 On 10/02/2012 12:55 PM, Alexander Pivovarov wrote:
 38 data nodes + 2 Name Nodes
 
 Data Node:
 Dell PowerEdge C2100 series
 2 x XEON x5670
 48 GB RAM ECC  (12x4GB 1333MHz)
 12 x 2 TB  7200 RPM SATA HDD (with hot swap)  JBOD
 Intel Gigabit ET Dual port PCIe x4
 Redundant Power Supply
 Hadoop CDH3
 max map tasks 24
 max reduce tasks 8
 
 
 
 
 



Re: Which hardware to choose

2012-10-02 Thread Alexander Pivovarov
Not sure

the following options are available
Integrated ICH10R on motherboard
LSI® 6Gb SAS2008 daughtercard
Dell PERC H200
Dell PERC H700
LSI MegaRAID® SAS 9260-8i

http://www.dell.com/us/enterprise/p/poweredge-c2100/pd

On Tue, Oct 2, 2012 at 10:59 AM, Oleg Ruchovets oruchov...@gmail.comwrote:

 Great ,

 Thank you for the such detailed information,

 By the way what type of Disk Controller do you use?

 Thanks
 Oleg.


 On Tue, Oct 2, 2012 at 6:34 AM, Alexander Pivovarov apivova...@gmail.com
 wrote:

  Privet Oleg
 
  Cloudera and Dell setup the following cluster for my company
  Company receives 1.5 TB raw data per day
 
  38 data nodes + 2 Name Nodes
 
  Data Node:
  Dell PowerEdge C2100 series
  2 x XEON x5670
  48 GB RAM ECC  (12x4GB 1333MHz)
  12 x 2 TB  7200 RPM SATA HDD (with hot swap)  JBOD
  Intel Gigabit ET Dual port PCIe x4
  Redundant Power Supply
  Hadoop CDH3
  max map tasks 24
  max reduce tasks 8
 
  Name Node and Secondary Name Node are the similar but
  96GB RAM  (not sure why)
  6x600Gb 15 RPM Serial SCSI
  RAID10
 
 
  another config is here
  page 298
 
 
 http://books.google.com/books?id=Wu_xeGdU4G8Cpg=PA298lpg=PA298dq=hadoop+jbodsource=blots=i7xVQBPb_wsig=8mhq-MtpkRcTiRB1ioKciMxIasghl=ensa=Xei=AGtqUMK6D8T10gHD4ICQAQved=0CEMQ6AEwAg#v=onepageq=hadoop%20jbodf=false
 
 
  you probably need just 1 computer with 10 x 2 TB SATA HDD
 
 
 
  On Mon, Oct 1, 2012 at 6:02 PM, Oleg Ruchovets oruchov...@gmail.com
  wrote:
 
   Hi ,
 We are on a very early stage of our hadoop project and want to do a
  POC.
  
   We have ~ 5-6 terabytes of row data and we are going to execute some
   aggregations.
  
   We plan to use  8 - 10 machines
  
   Questions:
  
 1)  Which hardware should we use:
   a) How many discs , what discs is better to use?
   b) How many RAM?
   c) How many CPUs?
  
  
  2) Please share best practices and tips / tricks related to utilise
   hardware using for hadoop projects.
  
   Thanks in advance
   Oleg.
  
 



Re: Which hardware to choose

2012-10-02 Thread hadoopman
Only 24 map and 8 reduce tasks for 38 data nodes?  are you sure that's 
right?  Sounds VERY low for a cluster that size.


We have only 10 c2100's and are running I believe 140 map and 70 reduce 
slots so far with pretty decent performance.




On 10/02/2012 12:55 PM, Alexander Pivovarov wrote:

38 data nodes + 2 Name Nodes
  
Data Node:
Dell PowerEdge C2100 series
2 x XEON x5670
48 GB RAM ECC  (12x4GB 1333MHz)
12 x 2 TB  7200 RPM SATA HDD (with hot swap)  JBOD
Intel Gigabit ET Dual port PCIe x4
Redundant Power Supply
Hadoop CDH3
max map tasks 24
max reduce tasks 8




Re: Which hardware to choose

2012-10-02 Thread hadoopman

Had to ask :D


On 10/02/2012 07:19 PM, Russell Jurney wrote:

I believe he means per node.

Russell Jurney http://datasyndrome.com

On Oct 2, 2012, at 6:15 PM, hadoopmanhadoop...@gmail.com  wrote:


Only 24 map and 8 reduce tasks for 38 data nodes?  are you sure that's right?  
Sounds VERY low for a cluster that size.

We have only 10 c2100's and are running I believe 140 map and 70 reduce slots 
so far with pretty decent performance.



On 10/02/2012 12:55 PM, Alexander Pivovarov wrote:

38 data nodes + 2 Name Nodes

  
 Data Node:
 Dell PowerEdge C2100 series
 2 x XEON x5670
 48 GB RAM ECC  (12x4GB 1333MHz)
 12 x 2 TB  7200 RPM SATA HDD (with hot swap)  JBOD
 Intel Gigabit ET Dual port PCIe x4
 Redundant Power Supply
 Hadoop CDH3
 max map tasks 24
 max reduce tasks 8




Re: Which hardware to choose

2012-10-02 Thread Michael Segel
I think he's saying that its 24 maps 8 reducers per node and at 48GB that could 
be too many mappers. 
Especially if they want to run HBase. 

On Oct 2, 2012, at 8:14 PM, hadoopman hadoop...@gmail.com wrote:

 Only 24 map and 8 reduce tasks for 38 data nodes?  are you sure that's right? 
  Sounds VERY low for a cluster that size.
 
 We have only 10 c2100's and are running I believe 140 map and 70 reduce slots 
 so far with pretty decent performance.
 
 
 
 On 10/02/2012 12:55 PM, Alexander Pivovarov wrote:
 38 data nodes + 2 Name Nodes
   
 Data Node:
 Dell PowerEdge C2100 series
 2 x XEON x5670
 48 GB RAM ECC  (12x4GB 1333MHz)
 12 x 2 TB  7200 RPM SATA HDD (with hot swap)  JBOD
 Intel Gigabit ET Dual port PCIe x4
 Redundant Power Supply
 Hadoop CDH3
 max map tasks 24
 max reduce tasks 8
 
 



Re: Which hardware to choose

2012-10-02 Thread Marcos Ortiz

Which is a reasonable number in this hardware?

On 10/02/2012 09:40 PM, Michael Segel wrote:

I think he's saying that its 24 maps 8 reducers per node and at 48GB that could 
be too many mappers.
Especially if they want to run HBase.

On Oct 2, 2012, at 8:14 PM, hadoopman hadoop...@gmail.com wrote:


Only 24 map and 8 reduce tasks for 38 data nodes?  are you sure that's right?  
Sounds VERY low for a cluster that size.

We have only 10 c2100's and are running I believe 140 map and 70 reduce slots 
so far with pretty decent performance.



On 10/02/2012 12:55 PM, Alexander Pivovarov wrote:

38 data nodes + 2 Name Nodes

  
Data Node:
Dell PowerEdge C2100 series
2 x XEON x5670
48 GB RAM ECC  (12x4GB 1333MHz)
12 x 2 TB  7200 RPM SATA HDD (with hot swap)  JBOD
Intel Gigabit ET Dual port PCIe x4
Redundant Power Supply
Hadoop CDH3
max map tasks 24
max reduce tasks 8




10mo. ANIVERSARIO DE LA CREACION DE LA UNIVERSIDAD DE LAS CIENCIAS 
INFORMATICAS...
CONECTADOS AL FUTURO, CONECTADOS A LA REVOLUCION

http://www.uci.cu
http://www.facebook.com/universidad.uci
http://www.flickr.com/photos/universidad_uci


--

Marcos Luis Ortíz Valmaseda
*Data Engineer  Sr. System Administrator at UCI*
about.me/marcosortiz http://about.me/marcosortiz
My Blog http://marcosluis2186.posterous.com
Tumblr's blog http://marcosortiz.tumblr.com/
@marcosluis2186 http://twitter.com/marcosluis2186



10mo. ANIVERSARIO DE LA CREACION DE LA UNIVERSIDAD DE LAS CIENCIAS 
INFORMATICAS...
CONECTADOS AL FUTURO, CONECTADOS A LA REVOLUCION

http://www.uci.cu
http://www.facebook.com/universidad.uci
http://www.flickr.com/photos/universidad_uci

Re: Which hardware to choose

2012-10-02 Thread Alexander Pivovarov
All configs are per node.
No HBase, only Hive and Pig installed

On Tue, Oct 2, 2012 at 9:40 PM, Michael Segel michael_se...@hotmail.comwrote:

 I think he's saying that its 24 maps 8 reducers per node and at 48GB that
 could be too many mappers.
 Especially if they want to run HBase.

 On Oct 2, 2012, at 8:14 PM, hadoopman hadoop...@gmail.com wrote:

  Only 24 map and 8 reduce tasks for 38 data nodes?  are you sure that's
 right?  Sounds VERY low for a cluster that size.
 
  We have only 10 c2100's and are running I believe 140 map and 70 reduce
 slots so far with pretty decent performance.
 
 
 
  On 10/02/2012 12:55 PM, Alexander Pivovarov wrote:
  38 data nodes + 2 Name Nodes

  Data Node:
  Dell PowerEdge C2100 series
  2 x XEON x5670
  48 GB RAM ECC  (12x4GB 1333MHz)
  12 x 2 TB  7200 RPM SATA HDD (with hot swap)  JBOD
  Intel Gigabit ET Dual port PCIe x4
  Redundant Power Supply
  Hadoop CDH3
  max map tasks 24
  max reduce tasks 8
 
 




Re: Which hardware to choose

2012-10-01 Thread Alexander Pivovarov
Privet Oleg

Cloudera and Dell setup the following cluster for my company
Company receives 1.5 TB raw data per day

38 data nodes + 2 Name Nodes

Data Node:
Dell PowerEdge C2100 series
2 x XEON x5670
48 GB RAM ECC  (12x4GB 1333MHz)
12 x 2 TB  7200 RPM SATA HDD (with hot swap)  JBOD
Intel Gigabit ET Dual port PCIe x4
Redundant Power Supply
Hadoop CDH3
max map tasks 24
max reduce tasks 8

Name Node and Secondary Name Node are the similar but
96GB RAM  (not sure why)
6x600Gb 15 RPM Serial SCSI
RAID10


another config is here
page 298
http://books.google.com/books?id=Wu_xeGdU4G8Cpg=PA298lpg=PA298dq=hadoop+jbodsource=blots=i7xVQBPb_wsig=8mhq-MtpkRcTiRB1ioKciMxIasghl=ensa=Xei=AGtqUMK6D8T10gHD4ICQAQved=0CEMQ6AEwAg#v=onepageq=hadoop%20jbodf=false


you probably need just 1 computer with 10 x 2 TB SATA HDD



On Mon, Oct 1, 2012 at 6:02 PM, Oleg Ruchovets oruchov...@gmail.com wrote:

 Hi ,
   We are on a very early stage of our hadoop project and want to do a POC.

 We have ~ 5-6 terabytes of row data and we are going to execute some
 aggregations.

 We plan to use  8 - 10 machines

 Questions:

   1)  Which hardware should we use:
 a) How many discs , what discs is better to use?
 b) How many RAM?
 c) How many CPUs?


2) Please share best practices and tips / tricks related to utilise
 hardware using for hadoop projects.

 Thanks in advance
 Oleg.