Given that m1.small has 1 CPU, 1.7GB of RAM and 1/8 (or less) the IO of the host machine and counting in the fact that those machines are networked as a whole I expect it to much much slower that your local machine. Those machines are so under-powered that the overhead of hadoop/hbase probably overwhelms any gain from the total number of nodes. Instead do this:
- Replace all your m1.small with m1.large in a factor of 4:1. - Don't give ZK their own machine, in such a small environment it doesn't make much sense. (give them their own EBS maybe) - Use an ensemble of only 3 peers. - Give HBase plenty of RAM like 4GB. WRT your mappers, make sure you use scanner pre-fetching. In your job setup set hbase.client.scanner.caching to something like 30. J-D On Tue, Dec 15, 2009 at 9:14 AM, Something Something <[email protected]> wrote: > Thanks J-D & Mtohiko for the tips. Significant improvement in performance, > but there's still room for improvement. In my local pseudo distributed mode > the 2 map reduce jobs now run in less than 4 minutes (from 32 mins) and in > cluster of 10 nodes + 5 zk nodes they run in 11 minutes (down from 1 hour & > 30 mins). But still I would like to come to a point where they run faster > on a cluster than on my local machine. > > Here's what I did: > > 1) Fixed a bug in my code that was causing unnecessary writes to HBase. > 2) Added these two lines after creating 'new HTable': > table.setAutoFlush(false); > table.setWriteBufferSize(1024*1024*12); > 3) Added this line after Put: > put.setWriteToWAL(false); > 4) Added this line (only when running on cluster): > job.setNumReduceTasks(20); > > There are other 64-bit related improvements which I cannot try; mainly > because Amazon charges (way) too much for 64-bit machines. It costs me over > $25 for 15 machines for less than 3 hours, so I switched to 'm1.small' > 32-bit machines. Of course, one of the promises of the distributed > computing is that we will be able to use "cheap commodity hardware", right > :) So I would like to stick with 'm1.small' for now. (But I am willing to > use about 30 machines if that's going to help.) > > Anyway, I have noticed that one of my Mappers is taking too long. If anyone > would share ideas of how to improve Mapper speed, that would be greatly > appreciated. Basically, in this Mapper I read about 50,000 rows from a > HBase table using TableMapReduceUtil.initTableMapperJob() and do some > complex processing for "values" of each row. I don't write anything back in > HBase, but I do write quite a few lines (context.write()) to HDFS. Any > suggestions? > > Thanks once again for the help. > > > > 2009/12/13 <[email protected]> > >> Hello, >> >> Something Something <[email protected]> wrote: >> > PS: One thing I have noticed is that it goes to 66% very fast and then >> > slows down from there.. >> >> It seems that only one reducer works. You should increase reduce tasks. >> The default reduce task's number is written on >> hadoop/docs/mapred-default.html. >> The default parameter of mapred.reduce.tasks is 1. So only one reduce task >> runs. >> >> There are two ways to increase reduce tasks: >> 1. Use Job.setNumReduceTasks(int tasks) on your MapReduce job file. >> 2. Denote more mapred.reduce.tasks on hadoop/conf/mapred-site.xml. >> >> You can get the best perfomance if you run 20 reduce tasks. The detail of >> the number >> of reduce tasks is written on >> http://hadoop.apache.org/common/docs/r0.20.0/mapred_tutorial.html#Reducer >> at "How many Reduces?" as J-D wrote. Notice that >> JobConf.setNumReduceTasks(int) is >> already deprecated, so you should use Job.setNumReduceTasks(int tasks) >> rather than >> JobConf.setNumReduceTasks(int). >> -- >> Motohiko Mouri >> >
