Re: Test Hadoop performance on EC2
Sorry for the previous post. I haven't finished. Please skip it. Hi all, I've made some experiments on Hadoop on Amazon EC2. I would like to share the result and any feedback would be appreciated. Environment: -Xen VM (Amazon EC2 instance ami-ee53b687) -1.7Ghz Xeon CPU, 1.75GB of RAM, 160GB of local disk, and 250Mb/s of network bandwidth (small instance) -Hadoop 0.17.0 -storage: HDFS -Test example: wordcount Experiment 1: (fixed # of instances (8), variant data size (2MB~512MB), # of maps: 8, # of reduces: 8) Data Size(MB) | Time(s) 512 | 124 256 | 70 128 | 41 ... 8| 22 4| 17 2| 21 The purpose is to observe the lowest framework overhead for wordcount. As the result, when the data size is between 2MB to 16MB, the time is around 20 second. May I conclude the lowest framework overhead for wordcount is 20s? Experiment 2: (variant # of instances (2~32), variant data size (128MB~2GB), # of maps: (2-32), # of reduces: (2-32)) Data Size(MB) | Map | Reduce | Time(s) 2048 | 32 | 32 | 140 1024 | 16 | 16| 120 512 | 8| 8| 124 256 | 4| 4| 127 128 | 2| 2| 119 The purpose is to observe if each instance be allocated the same blocks of data, the time will be similar. As the result, when the data size is between 128MB to 1024MB, the time is around 120 seconds. The time is 140s when data size is 2048MB. I think the reason is more data to process would cause more overhead. Experiment 3: (variant # of instances (2~16), fixed data size (128MB), # of maps: (2-16), # of reduces: (2-16)) Data Size(MB) | Map | Reduce | Time(s) 128 | 16 | 16| 31 128 | 8| 8| 41 128 | 4| 4| 69 128 | 2| 2| 119 The purpose is to observe for fixed data, add more and more instances, how would the result change? As the result, as the instances double, the time would be smaller but not the half. There is always the framework overhead even give infinite instances. In fact, I did more experiments, but I just post some results. Interestingly, I discover a formula for wordcount by my experiment result. That is: Time(s) ~= 20+((DataSize - 8MB)*1.6 / (# of instance)) I've check the formula by all my experiment result and almost all is matched. Maybe it's coincidental or I have something wrong. Anyway, I just want to share my experience and any feedback would be appreciated. -- Best Regards, Shawn
Test Hadoop performance on EC2
Hi all, I've made some experiments on Hadoop on Amazon EC2. I would like to share the result and any feedback would be appreciated. Environment: -Xen VM (Amazon EC2 instance ami-ee53b687) -1.7Ghz Xeon CPU, 1.75GB of RAM, 160GB of local disk, and 250Mb/s of network bandwidth (small instance) -Hadoop 0.17.0 -storage: HDFS -Test example: wordcount Experiment 1: (fixed # of instances (8), variant data size (2MB~512MB), # of maps: 8, # of reduces: 8) Data Size(MB) | Time(s) 128 256 512 Experiment 2: Experiment 3: -- Best Regards, Shawn
Re: Hadoop performance on EC2?
What does ganglia show for load and network? You should also be able to see gc stats (count and time). Might help as well. fyi, running > hadoop-ec2 proxy will both setup a socks tunnel and list available urls you can cut/ paste into your browser. one of the urls is for the ganglia interface. On Apr 11, 2008, at 2:01 PM, Nate Carlson wrote: On Wed, 9 Apr 2008, Chris K Wensel wrote: make sure all nodes are running in the same 'availability zone', http://developer.amazonwebservices.com/connect/entry.jspa?externalID=1347 check! and that you are using the new xen kernels. http://developer.amazonwebservices.com/connect/entry.jspa?externalID=1353&categoryID=101 http://developer.amazonwebservices.com/connect/entry.jspa?externalID=1354&categoryID=101 check! also, make sure each node is addressing its peers via the ec2 private addresses, not the public ones. check! there is a patch in jira for the ec2/contrib scripts that address these issues. https://issues.apache.org/jira/browse/HADOOP-2410 if you use those scripts, you will be able to see a ganglia display showing utilization on the machines. 8/7 map/reducers sounds like alot. Reduced - I dropped it to 3/2 for testing. I am using these scripts now, and am still seeing very poor performance on EC2 compared to my development environment. ;( I'll be capturing some more extensive stats over the weekend, and see if I can glean anything useful... | nate carlson | [EMAIL PROTECTED] | http:// www.natecarlson.com | | depriving some poor village of its idiot since 1981| Chris K Wensel [EMAIL PROTECTED] http://chris.wensel.net/ http://www.cascading.org/
Re: Hadoop performance on EC2?
On Wed, 9 Apr 2008, Chris K Wensel wrote: make sure all nodes are running in the same 'availability zone', http://developer.amazonwebservices.com/connect/entry.jspa?externalID=1347 check! and that you are using the new xen kernels. http://developer.amazonwebservices.com/connect/entry.jspa?externalID=1353&categoryID=101 http://developer.amazonwebservices.com/connect/entry.jspa?externalID=1354&categoryID=101 check! also, make sure each node is addressing its peers via the ec2 private addresses, not the public ones. check! there is a patch in jira for the ec2/contrib scripts that address these issues. https://issues.apache.org/jira/browse/HADOOP-2410 if you use those scripts, you will be able to see a ganglia display showing utilization on the machines. 8/7 map/reducers sounds like alot. Reduced - I dropped it to 3/2 for testing. I am using these scripts now, and am still seeing very poor performance on EC2 compared to my development environment. ;( I'll be capturing some more extensive stats over the weekend, and see if I can glean anything useful... | nate carlson | [EMAIL PROTECTED] | http://www.natecarlson.com | | depriving some poor village of its idiot since 1981|
Re: Hadoop performance on EC2?
On Thu, 10 Apr 2008, Ted Dziuba wrote: I have seen EC2 be slower than a comparable system in development, but not by the factors that you're experiencing. One thing about EC2 that has concerned me - you are not guaranteed that your "/mnt" disk is an uncontested spindle. Early on, this was the case, but Amazon made no promises. Interesting! My understand was that it was. We were using S3 for storage before, and switched to HDFS, and saw similar performance on both for our needs.. we're more CPU intensive than I/O intensive. Also, and this may be a stupid question, are you sure that you're using the same JVM in EC2 and development? GCJ is much slower than Sun's JVM. Yeah - our code actually requires Sun's Java6u5 JVM.. it won't run on gcj. ;) | nate carlson | [EMAIL PROTECTED] | http://www.natecarlson.com | | depriving some poor village of its idiot since 1981|
Re: Hadoop performance on EC2?
I have seen EC2 be slower than a comparable system in development, but not by the factors that you're experiencing. One thing about EC2 that has concerned me - you are not guaranteed that your "/mnt" disk is an uncontested spindle. Early on, this was the case, but Amazon made no promises. Also, and this may be a stupid question, are you sure that you're using the same JVM in EC2 and development? GCJ is much slower than Sun's JVM. Ted Nate Carlson wrote: On Thu, 10 Apr 2008, Ted Dunning wrote: Are you trying to read from mySQL? No, we're outputting to MySQL. I've also verified that the MySQL server is hardly seeing any load, isn't waiting on slow queries, etc. If so, it isn't very surprising that you could get lower performance with more readers. Indeed! | nate carlson | [EMAIL PROTECTED] | http://www.natecarlson.com | | depriving some poor village of its idiot since 1981|
Re: Hadoop performance on EC2?
On Thu, 10 Apr 2008, Ted Dunning wrote: Are you trying to read from mySQL? No, we're outputting to MySQL. I've also verified that the MySQL server is hardly seeing any load, isn't waiting on slow queries, etc. If so, it isn't very surprising that you could get lower performance with more readers. Indeed! | nate carlson | [EMAIL PROTECTED] | http://www.natecarlson.com | | depriving some poor village of its idiot since 1981|
Re: Hadoop performance on EC2?
Are you trying to read from mySQL? If so, it isn't very surprising that you could get lower performance with more readers. On 4/9/08 7:07 PM, "Nate Carlson" <[EMAIL PROTECTED]> wrote: > Hey all, > > We've got a job that we're running in both a development environment, and > out on EC2. I've been rather displeased with the performance on EC2, and > was curious if the results that we've been seeing are similar to other > people's, or if I've got something misconfigured. ;) In both > environments, the load on the master node is around 1-1.5, and the load on > the slave nodes in around 8-10. I have also tried cranking up the JVM > memory on the EC2 nodes (since we got RAM to blow), with very little > performance difference. > > Basically, the job takes about 3.5 hours on development, but takes 15 > hours on EC2. With the portion that takes all the time, it is not > dependent on any external hosts - just the MySQL server on the master > node. I benchmarked the VCPU's between our dev and EC2, and they are > about equivilent.. I would expect EC2 to take 1.5x as long, since there is > one less CPU per slave, but it's taking much longer than that. > > Appreciate any tips! > > Similarities between the environments: > - 1 master node, 2 slave nodes > - 1 mapper and reducer on the master, 8 mappers and 7 reducers on the >slaves > - Hadoop 0.16.2 > - Local HDFS storage (we were using S3 on amazon before, and I switched to >local storage) > - MySQL database running on the master node > - Xen VM's in both environments (our own Xen for dev, Amazon's for EC2) > - Debian Etch 64-bit OS; 64-bit JVM > > Development master node configuration: > - 4x VCPU's (Xeon E5335 2ghz) > - 3gb memory > - 4gb swap > > Development slave nodes configuration: > - 3x VCPU's (Xeon E5335 2ghz) > - 2gb memory > - 4gb swap > > EC2 Configuration ("Large" instance type): > - 2x VCPU's (Opteron 2ghz) > - 8gb memory > - 4gb swap > - All nodes running in the same availabity zone > > > | nate carlson | [EMAIL PROTECTED] | http://www.natecarlson.com | > | depriving some poor village of its idiot since 1981| >
Re: Hadoop performance on EC2?
a few things.. make sure all nodes are running in the same 'availability zone', http://developer.amazonwebservices.com/connect/entry.jspa?externalID=1347 and that you are using the new xen kernels. http://developer.amazonwebservices.com/connect/entry.jspa?externalID=1353&categoryID=101 http://developer.amazonwebservices.com/connect/entry.jspa?externalID=1354&categoryID=101 also, make sure each node is addressing its peers via the ec2 private addresses, not the public ones. there is a patch in jira for the ec2/contrib scripts that address these issues. https://issues.apache.org/jira/browse/HADOOP-2410 if you use those scripts, you will be able to see a ganglia display showing utilization on the machines. 8/7 map/reducers sounds like alot. ymmv On Apr 9, 2008, at 7:07 PM, Nate Carlson wrote: Hey all, We've got a job that we're running in both a development environment, and out on EC2. I've been rather displeased with the performance on EC2, and was curious if the results that we've been seeing are similar to other people's, or if I've got something misconfigured. ;) In both environments, the load on the master node is around 1-1.5, and the load on the slave nodes in around 8-10. I have also tried cranking up the JVM memory on the EC2 nodes (since we got RAM to blow), with very little performance difference. Basically, the job takes about 3.5 hours on development, but takes 15 hours on EC2. With the portion that takes all the time, it is not dependent on any external hosts - just the MySQL server on the master node. I benchmarked the VCPU's between our dev and EC2, and they are about equivilent.. I would expect EC2 to take 1.5x as long, since there is one less CPU per slave, but it's taking much longer than that. Appreciate any tips! Similarities between the environments: - 1 master node, 2 slave nodes - 1 mapper and reducer on the master, 8 mappers and 7 reducers on the slaves - Hadoop 0.16.2 - Local HDFS storage (we were using S3 on amazon before, and I switched to local storage) - MySQL database running on the master node - Xen VM's in both environments (our own Xen for dev, Amazon's for EC2) - Debian Etch 64-bit OS; 64-bit JVM Development master node configuration: - 4x VCPU's (Xeon E5335 2ghz) - 3gb memory - 4gb swap Development slave nodes configuration: - 3x VCPU's (Xeon E5335 2ghz) - 2gb memory - 4gb swap EC2 Configuration ("Large" instance type): - 2x VCPU's (Opteron 2ghz) - 8gb memory - 4gb swap - All nodes running in the same availabity zone | nate carlson | [EMAIL PROTECTED] | http:// www.natecarlson.com | | depriving some poor village of its idiot since 1981| Chris K Wensel [EMAIL PROTECTED] http://chris.wensel.net/ http://www.cascading.org/
RE: Hadoop performance on EC2?
We used the small instances and the difference was around 5x-8x, depending on what we tried to run. I'm really surprised that large instances have such bad performance characteristics. D. -- Attributor-publish with confidence We are still hiring developers -Original Message- From: Nate Carlson [mailto:[EMAIL PROTECTED] Sent: Wednesday, April 09, 2008 7:07 PM To: core-user@hadoop.apache.org Subject: Hadoop performance on EC2? Hey all, We've got a job that we're running in both a development environment, and out on EC2. I've been rather displeased with the performance on EC2, and was curious if the results that we've been seeing are similar to other people's, or if I've got something misconfigured. ;) In both environments, the load on the master node is around 1-1.5, and the load on the slave nodes in around 8-10. I have also tried cranking up the JVM memory on the EC2 nodes (since we got RAM to blow), with very little performance difference. Basically, the job takes about 3.5 hours on development, but takes 15 hours on EC2. With the portion that takes all the time, it is not dependent on any external hosts - just the MySQL server on the master node. I benchmarked the VCPU's between our dev and EC2, and they are about equivilent.. I would expect EC2 to take 1.5x as long, since there is one less CPU per slave, but it's taking much longer than that. Appreciate any tips! Similarities between the environments: - 1 master node, 2 slave nodes - 1 mapper and reducer on the master, 8 mappers and 7 reducers on the slaves - Hadoop 0.16.2 - Local HDFS storage (we were using S3 on amazon before, and I switched to local storage) - MySQL database running on the master node - Xen VM's in both environments (our own Xen for dev, Amazon's for EC2) - Debian Etch 64-bit OS; 64-bit JVM Development master node configuration: - 4x VCPU's (Xeon E5335 2ghz) - 3gb memory - 4gb swap Development slave nodes configuration: - 3x VCPU's (Xeon E5335 2ghz) - 2gb memory - 4gb swap EC2 Configuration ("Large" instance type): - 2x VCPU's (Opteron 2ghz) - 8gb memory - 4gb swap - All nodes running in the same availabity zone | nate carlson | [EMAIL PROTECTED] | http://www.natecarlson.com | | depriving some poor village of its idiot since 1981|
Hadoop performance on EC2?
Hey all, We've got a job that we're running in both a development environment, and out on EC2. I've been rather displeased with the performance on EC2, and was curious if the results that we've been seeing are similar to other people's, or if I've got something misconfigured. ;) In both environments, the load on the master node is around 1-1.5, and the load on the slave nodes in around 8-10. I have also tried cranking up the JVM memory on the EC2 nodes (since we got RAM to blow), with very little performance difference. Basically, the job takes about 3.5 hours on development, but takes 15 hours on EC2. With the portion that takes all the time, it is not dependent on any external hosts - just the MySQL server on the master node. I benchmarked the VCPU's between our dev and EC2, and they are about equivilent.. I would expect EC2 to take 1.5x as long, since there is one less CPU per slave, but it's taking much longer than that. Appreciate any tips! Similarities between the environments: - 1 master node, 2 slave nodes - 1 mapper and reducer on the master, 8 mappers and 7 reducers on the slaves - Hadoop 0.16.2 - Local HDFS storage (we were using S3 on amazon before, and I switched to local storage) - MySQL database running on the master node - Xen VM's in both environments (our own Xen for dev, Amazon's for EC2) - Debian Etch 64-bit OS; 64-bit JVM Development master node configuration: - 4x VCPU's (Xeon E5335 2ghz) - 3gb memory - 4gb swap Development slave nodes configuration: - 3x VCPU's (Xeon E5335 2ghz) - 2gb memory - 4gb swap EC2 Configuration ("Large" instance type): - 2x VCPU's (Opteron 2ghz) - 8gb memory - 4gb swap - All nodes running in the same availabity zone | nate carlson | [EMAIL PROTECTED] | http://www.natecarlson.com | | depriving some poor village of its idiot since 1981|