Given that m1.small has 1 CPU, 1.7GB of RAM and 1/8 (or less) the IO
of the host machine and counting in the fact that those machines are
networked as a whole I expect it to much much slower that your local
machine. Those machines are so under-powered that the overhead of
hadoop/hbase probably overwhelms any gain from the total number of
nodes. Instead do this:

- Replace all your m1.small with m1.large in a factor of 4:1.
- Don't give ZK their own machine, in such a small environment it
doesn't make much sense. (give them their own EBS maybe)
- Use an ensemble of only 3 peers.
- Give HBase plenty of RAM like 4GB.

WRT your mappers, make sure you use scanner pre-fetching. In your job
setup set hbase.client.scanner.caching to something like 30.

J-D

On Tue, Dec 15, 2009 at 9:14 AM, Something Something
<[email protected]> wrote:
> Thanks J-D & Mtohiko for the tips.  Significant improvement in performance,
> but there's still room for improvement.  In my local pseudo distributed mode
> the 2 map reduce jobs now run in less than 4 minutes (from 32 mins) and in
> cluster of 10 nodes + 5 zk nodes they run in 11 minutes (down from 1 hour &
> 30 mins).  But still I would like to come to a point where they run faster
> on a cluster than on my local machine.
>
> Here's what I did:
>
> 1)  Fixed a bug in my code that was causing unnecessary writes to HBase.
> 2)  Added these two lines after creating 'new HTable':
>        table.setAutoFlush(false);
>        table.setWriteBufferSize(1024*1024*12);
> 3)  Added this line after Put:
>        put.setWriteToWAL(false);
> 4)  Added this line (only when running on cluster):
>    job.setNumReduceTasks(20);
>
> There are other 64-bit related improvements which I cannot try; mainly
> because Amazon charges (way) too much for 64-bit machines.  It costs me over
> $25 for 15 machines for less than 3 hours, so I switched to 'm1.small'
> 32-bit machines.  Of course, one of the promises of the distributed
> computing is that we will be able to use "cheap commodity hardware", right
> :)  So I would like to stick with 'm1.small' for now.  (But I am willing to
> use about 30 machines if that's going to help.)
>
> Anyway, I have noticed that one of my Mappers is taking too long.  If anyone
> would share ideas of how to improve Mapper speed, that would be greatly
> appreciated.  Basically, in this Mapper I read about 50,000 rows from a
> HBase table using TableMapReduceUtil.initTableMapperJob() and do some
> complex processing for "values" of each row.  I don't write anything back in
> HBase, but I do write quite a few lines (context.write()) to HDFS.  Any
> suggestions?
>
> Thanks once again for the help.
>
>
>
> 2009/12/13 <[email protected]>
>
>> Hello,
>>
>> Something Something <[email protected]> wrote:
>> > PS:  One thing I have noticed is that it goes to 66% very fast and then
>> > slows down from there..
>>
>> It seems that only one reducer works. You should increase reduce tasks.
>> The default reduce task's number is written on
>> hadoop/docs/mapred-default.html.
>> The default parameter of mapred.reduce.tasks is 1. So only one reduce task
>> runs.
>>
>> There are two ways to increase reduce tasks:
>> 1. Use Job.setNumReduceTasks(int tasks) on your MapReduce job file.
>> 2. Denote more mapred.reduce.tasks on hadoop/conf/mapred-site.xml.
>>
>> You can get the best perfomance if you run 20 reduce tasks. The detail of
>> the number
>> of reduce tasks is written on
>> http://hadoop.apache.org/common/docs/r0.20.0/mapred_tutorial.html#Reducer
>> at "How many Reduces?" as J-D wrote. Notice that
>> JobConf.setNumReduceTasks(int) is
>> already deprecated, so you should use Job.setNumReduceTasks(int tasks)
>> rather than
>> JobConf.setNumReduceTasks(int).
>> --
>> Motohiko Mouri
>>
>

Reply via email to