Re: Which hardware to choose
Of course it all depends... But something like this could work: Leave 1-2 GB for the kernel, pagecache, tools, overhead etc. Plan 3-4 GB for Datanode and Tasktracker each Plan 2.5-3 GB per slot. Depending on the kinds of jobs, you may need more or less memory per slot. Have 2-3 times as many mappers as reducers (depending on the kinds of jobs you run). As Micheal pointed out the ratio of cores (hyperthreads) per disk matters. With those initial rules of thumb you'd arrive somewhere between 10 mappers + 5 reducers and 9 mappers + 4 reducers Try, test, measure, adjust, rinse, repeat. Cheers, Joep On Tue, Oct 2, 2012 at 8:42 PM, Alexander Pivovarov apivova...@gmail.comwrote: All configs are per node. No HBase, only Hive and Pig installed On Tue, Oct 2, 2012 at 9:40 PM, Michael Segel michael_se...@hotmail.com wrote: I think he's saying that its 24 maps 8 reducers per node and at 48GB that could be too many mappers. Especially if they want to run HBase. On Oct 2, 2012, at 8:14 PM, hadoopman hadoop...@gmail.com wrote: Only 24 map and 8 reduce tasks for 38 data nodes? are you sure that's right? Sounds VERY low for a cluster that size. We have only 10 c2100's and are running I believe 140 map and 70 reduce slots so far with pretty decent performance. On 10/02/2012 12:55 PM, Alexander Pivovarov wrote: 38 data nodes + 2 Name Nodes Data Node: Dell PowerEdge C2100 series 2 x XEON x5670 48 GB RAM ECC (12x4GB 1333MHz) 12 x 2 TB 7200 RPM SATA HDD (with hot swap) JBOD Intel Gigabit ET Dual port PCIe x4 Redundant Power Supply Hadoop CDH3 max map tasks 24 max reduce tasks 8
Re: Which hardware to choose
Well... If you're not running HBase, you're less harmed by minimal swapping so you could push the number of slots and over subscribe. The only thing I would have to suggest is that you monitor your system closely as you adjust the number of slots. You have to admit though, its fun to tune the cluster. :-) On Oct 3, 2012, at 12:09 PM, J. Rottinghuis jrottingh...@gmail.com wrote: Of course it all depends... But something like this could work: Leave 1-2 GB for the kernel, pagecache, tools, overhead etc. Plan 3-4 GB for Datanode and Tasktracker each Plan 2.5-3 GB per slot. Depending on the kinds of jobs, you may need more or less memory per slot. Have 2-3 times as many mappers as reducers (depending on the kinds of jobs you run). As Micheal pointed out the ratio of cores (hyperthreads) per disk matters. With those initial rules of thumb you'd arrive somewhere between 10 mappers + 5 reducers and 9 mappers + 4 reducers Try, test, measure, adjust, rinse, repeat. Cheers, Joep On Tue, Oct 2, 2012 at 8:42 PM, Alexander Pivovarov apivova...@gmail.comwrote: All configs are per node. No HBase, only Hive and Pig installed On Tue, Oct 2, 2012 at 9:40 PM, Michael Segel michael_se...@hotmail.com wrote: I think he's saying that its 24 maps 8 reducers per node and at 48GB that could be too many mappers. Especially if they want to run HBase. On Oct 2, 2012, at 8:14 PM, hadoopman hadoop...@gmail.com wrote: Only 24 map and 8 reduce tasks for 38 data nodes? are you sure that's right? Sounds VERY low for a cluster that size. We have only 10 c2100's and are running I believe 140 map and 70 reduce slots so far with pretty decent performance. On 10/02/2012 12:55 PM, Alexander Pivovarov wrote: 38 data nodes + 2 Name Nodes Data Node: Dell PowerEdge C2100 series 2 x XEON x5670 48 GB RAM ECC (12x4GB 1333MHz) 12 x 2 TB 7200 RPM SATA HDD (with hot swap) JBOD Intel Gigabit ET Dual port PCIe x4 Redundant Power Supply Hadoop CDH3 max map tasks 24 max reduce tasks 8
Re: Which hardware to choose
Not sure the following options are available Integrated ICH10R on motherboard LSI® 6Gb SAS2008 daughtercard Dell PERC H200 Dell PERC H700 LSI MegaRAID® SAS 9260-8i http://www.dell.com/us/enterprise/p/poweredge-c2100/pd On Tue, Oct 2, 2012 at 10:59 AM, Oleg Ruchovets oruchov...@gmail.comwrote: Great , Thank you for the such detailed information, By the way what type of Disk Controller do you use? Thanks Oleg. On Tue, Oct 2, 2012 at 6:34 AM, Alexander Pivovarov apivova...@gmail.com wrote: Privet Oleg Cloudera and Dell setup the following cluster for my company Company receives 1.5 TB raw data per day 38 data nodes + 2 Name Nodes Data Node: Dell PowerEdge C2100 series 2 x XEON x5670 48 GB RAM ECC (12x4GB 1333MHz) 12 x 2 TB 7200 RPM SATA HDD (with hot swap) JBOD Intel Gigabit ET Dual port PCIe x4 Redundant Power Supply Hadoop CDH3 max map tasks 24 max reduce tasks 8 Name Node and Secondary Name Node are the similar but 96GB RAM (not sure why) 6x600Gb 15 RPM Serial SCSI RAID10 another config is here page 298 http://books.google.com/books?id=Wu_xeGdU4G8Cpg=PA298lpg=PA298dq=hadoop+jbodsource=blots=i7xVQBPb_wsig=8mhq-MtpkRcTiRB1ioKciMxIasghl=ensa=Xei=AGtqUMK6D8T10gHD4ICQAQved=0CEMQ6AEwAg#v=onepageq=hadoop%20jbodf=false you probably need just 1 computer with 10 x 2 TB SATA HDD On Mon, Oct 1, 2012 at 6:02 PM, Oleg Ruchovets oruchov...@gmail.com wrote: Hi , We are on a very early stage of our hadoop project and want to do a POC. We have ~ 5-6 terabytes of row data and we are going to execute some aggregations. We plan to use 8 - 10 machines Questions: 1) Which hardware should we use: a) How many discs , what discs is better to use? b) How many RAM? c) How many CPUs? 2) Please share best practices and tips / tricks related to utilise hardware using for hadoop projects. Thanks in advance Oleg.
Re: Which hardware to choose
Only 24 map and 8 reduce tasks for 38 data nodes? are you sure that's right? Sounds VERY low for a cluster that size. We have only 10 c2100's and are running I believe 140 map and 70 reduce slots so far with pretty decent performance. On 10/02/2012 12:55 PM, Alexander Pivovarov wrote: 38 data nodes + 2 Name Nodes Data Node: Dell PowerEdge C2100 series 2 x XEON x5670 48 GB RAM ECC (12x4GB 1333MHz) 12 x 2 TB 7200 RPM SATA HDD (with hot swap) JBOD Intel Gigabit ET Dual port PCIe x4 Redundant Power Supply Hadoop CDH3 max map tasks 24 max reduce tasks 8
Re: Which hardware to choose
Had to ask :D On 10/02/2012 07:19 PM, Russell Jurney wrote: I believe he means per node. Russell Jurney http://datasyndrome.com On Oct 2, 2012, at 6:15 PM, hadoopmanhadoop...@gmail.com wrote: Only 24 map and 8 reduce tasks for 38 data nodes? are you sure that's right? Sounds VERY low for a cluster that size. We have only 10 c2100's and are running I believe 140 map and 70 reduce slots so far with pretty decent performance. On 10/02/2012 12:55 PM, Alexander Pivovarov wrote: 38 data nodes + 2 Name Nodes Data Node: Dell PowerEdge C2100 series 2 x XEON x5670 48 GB RAM ECC (12x4GB 1333MHz) 12 x 2 TB 7200 RPM SATA HDD (with hot swap) JBOD Intel Gigabit ET Dual port PCIe x4 Redundant Power Supply Hadoop CDH3 max map tasks 24 max reduce tasks 8
Re: Which hardware to choose
I think he's saying that its 24 maps 8 reducers per node and at 48GB that could be too many mappers. Especially if they want to run HBase. On Oct 2, 2012, at 8:14 PM, hadoopman hadoop...@gmail.com wrote: Only 24 map and 8 reduce tasks for 38 data nodes? are you sure that's right? Sounds VERY low for a cluster that size. We have only 10 c2100's and are running I believe 140 map and 70 reduce slots so far with pretty decent performance. On 10/02/2012 12:55 PM, Alexander Pivovarov wrote: 38 data nodes + 2 Name Nodes Data Node: Dell PowerEdge C2100 series 2 x XEON x5670 48 GB RAM ECC (12x4GB 1333MHz) 12 x 2 TB 7200 RPM SATA HDD (with hot swap) JBOD Intel Gigabit ET Dual port PCIe x4 Redundant Power Supply Hadoop CDH3 max map tasks 24 max reduce tasks 8
Re: Which hardware to choose
Which is a reasonable number in this hardware? On 10/02/2012 09:40 PM, Michael Segel wrote: I think he's saying that its 24 maps 8 reducers per node and at 48GB that could be too many mappers. Especially if they want to run HBase. On Oct 2, 2012, at 8:14 PM, hadoopman hadoop...@gmail.com wrote: Only 24 map and 8 reduce tasks for 38 data nodes? are you sure that's right? Sounds VERY low for a cluster that size. We have only 10 c2100's and are running I believe 140 map and 70 reduce slots so far with pretty decent performance. On 10/02/2012 12:55 PM, Alexander Pivovarov wrote: 38 data nodes + 2 Name Nodes Data Node: Dell PowerEdge C2100 series 2 x XEON x5670 48 GB RAM ECC (12x4GB 1333MHz) 12 x 2 TB 7200 RPM SATA HDD (with hot swap) JBOD Intel Gigabit ET Dual port PCIe x4 Redundant Power Supply Hadoop CDH3 max map tasks 24 max reduce tasks 8 10mo. ANIVERSARIO DE LA CREACION DE LA UNIVERSIDAD DE LAS CIENCIAS INFORMATICAS... CONECTADOS AL FUTURO, CONECTADOS A LA REVOLUCION http://www.uci.cu http://www.facebook.com/universidad.uci http://www.flickr.com/photos/universidad_uci -- Marcos Luis Ortíz Valmaseda *Data Engineer Sr. System Administrator at UCI* about.me/marcosortiz http://about.me/marcosortiz My Blog http://marcosluis2186.posterous.com Tumblr's blog http://marcosortiz.tumblr.com/ @marcosluis2186 http://twitter.com/marcosluis2186 10mo. ANIVERSARIO DE LA CREACION DE LA UNIVERSIDAD DE LAS CIENCIAS INFORMATICAS... CONECTADOS AL FUTURO, CONECTADOS A LA REVOLUCION http://www.uci.cu http://www.facebook.com/universidad.uci http://www.flickr.com/photos/universidad_uci
Re: Which hardware to choose
All configs are per node. No HBase, only Hive and Pig installed On Tue, Oct 2, 2012 at 9:40 PM, Michael Segel michael_se...@hotmail.comwrote: I think he's saying that its 24 maps 8 reducers per node and at 48GB that could be too many mappers. Especially if they want to run HBase. On Oct 2, 2012, at 8:14 PM, hadoopman hadoop...@gmail.com wrote: Only 24 map and 8 reduce tasks for 38 data nodes? are you sure that's right? Sounds VERY low for a cluster that size. We have only 10 c2100's and are running I believe 140 map and 70 reduce slots so far with pretty decent performance. On 10/02/2012 12:55 PM, Alexander Pivovarov wrote: 38 data nodes + 2 Name Nodes Data Node: Dell PowerEdge C2100 series 2 x XEON x5670 48 GB RAM ECC (12x4GB 1333MHz) 12 x 2 TB 7200 RPM SATA HDD (with hot swap) JBOD Intel Gigabit ET Dual port PCIe x4 Redundant Power Supply Hadoop CDH3 max map tasks 24 max reduce tasks 8
Re: Which hardware to choose
Privet Oleg Cloudera and Dell setup the following cluster for my company Company receives 1.5 TB raw data per day 38 data nodes + 2 Name Nodes Data Node: Dell PowerEdge C2100 series 2 x XEON x5670 48 GB RAM ECC (12x4GB 1333MHz) 12 x 2 TB 7200 RPM SATA HDD (with hot swap) JBOD Intel Gigabit ET Dual port PCIe x4 Redundant Power Supply Hadoop CDH3 max map tasks 24 max reduce tasks 8 Name Node and Secondary Name Node are the similar but 96GB RAM (not sure why) 6x600Gb 15 RPM Serial SCSI RAID10 another config is here page 298 http://books.google.com/books?id=Wu_xeGdU4G8Cpg=PA298lpg=PA298dq=hadoop+jbodsource=blots=i7xVQBPb_wsig=8mhq-MtpkRcTiRB1ioKciMxIasghl=ensa=Xei=AGtqUMK6D8T10gHD4ICQAQved=0CEMQ6AEwAg#v=onepageq=hadoop%20jbodf=false you probably need just 1 computer with 10 x 2 TB SATA HDD On Mon, Oct 1, 2012 at 6:02 PM, Oleg Ruchovets oruchov...@gmail.com wrote: Hi , We are on a very early stage of our hadoop project and want to do a POC. We have ~ 5-6 terabytes of row data and we are going to execute some aggregations. We plan to use 8 - 10 machines Questions: 1) Which hardware should we use: a) How many discs , what discs is better to use? b) How many RAM? c) How many CPUs? 2) Please share best practices and tips / tricks related to utilise hardware using for hadoop projects. Thanks in advance Oleg.