Btw, nothing says that ZK users (incl hbase) _must_ run a multi-node ZK
ensemble. For coordination tasks a single ZK server (standalone mode) is
often sufficient, you just need to realize you are sacrificing
reliability/availability.
Going from 1 -> 3 -> 5 -> 7 ZK servers in an ensemble should primarily
be driven by reliability requirements. See this page for details on
performance studies I've made for standalone and 3 server ZK ensembles:
http://bit.ly/4ekN8G
Patrick
Jean-Daniel Cryans wrote:
Given that m1.small has 1 CPU, 1.7GB of RAM and 1/8 (or less) the IO
of the host machine and counting in the fact that those machines are
networked as a whole I expect it to much much slower that your local
machine. Those machines are so under-powered that the overhead of
hadoop/hbase probably overwhelms any gain from the total number of
nodes. Instead do this:
- Replace all your m1.small with m1.large in a factor of 4:1.
- Don't give ZK their own machine, in such a small environment it
doesn't make much sense. (give them their own EBS maybe)
- Use an ensemble of only 3 peers.
- Give HBase plenty of RAM like 4GB.
WRT your mappers, make sure you use scanner pre-fetching. In your job
setup set hbase.client.scanner.caching to something like 30.
J-D
On Tue, Dec 15, 2009 at 9:14 AM, Something Something
<[email protected]> wrote:
Thanks J-D & Mtohiko for the tips. Significant improvement in performance,
but there's still room for improvement. In my local pseudo distributed mode
the 2 map reduce jobs now run in less than 4 minutes (from 32 mins) and in
cluster of 10 nodes + 5 zk nodes they run in 11 minutes (down from 1 hour &
30 mins). But still I would like to come to a point where they run faster
on a cluster than on my local machine.
Here's what I did:
1) Fixed a bug in my code that was causing unnecessary writes to HBase.
2) Added these two lines after creating 'new HTable':
table.setAutoFlush(false);
table.setWriteBufferSize(1024*1024*12);
3) Added this line after Put:
put.setWriteToWAL(false);
4) Added this line (only when running on cluster):
job.setNumReduceTasks(20);
There are other 64-bit related improvements which I cannot try; mainly
because Amazon charges (way) too much for 64-bit machines. It costs me over
$25 for 15 machines for less than 3 hours, so I switched to 'm1.small'
32-bit machines. Of course, one of the promises of the distributed
computing is that we will be able to use "cheap commodity hardware", right
:) So I would like to stick with 'm1.small' for now. (But I am willing to
use about 30 machines if that's going to help.)
Anyway, I have noticed that one of my Mappers is taking too long. If anyone
would share ideas of how to improve Mapper speed, that would be greatly
appreciated. Basically, in this Mapper I read about 50,000 rows from a
HBase table using TableMapReduceUtil.initTableMapperJob() and do some
complex processing for "values" of each row. I don't write anything back in
HBase, but I do write quite a few lines (context.write()) to HDFS. Any
suggestions?
Thanks once again for the help.
2009/12/13 <[email protected]>
Hello,
Something Something <[email protected]> wrote:
PS: One thing I have noticed is that it goes to 66% very fast and then
slows down from there..
It seems that only one reducer works. You should increase reduce tasks.
The default reduce task's number is written on
hadoop/docs/mapred-default.html.
The default parameter of mapred.reduce.tasks is 1. So only one reduce task
runs.
There are two ways to increase reduce tasks:
1. Use Job.setNumReduceTasks(int tasks) on your MapReduce job file.
2. Denote more mapred.reduce.tasks on hadoop/conf/mapred-site.xml.
You can get the best perfomance if you run 20 reduce tasks. The detail of
the number
of reduce tasks is written on
http://hadoop.apache.org/common/docs/r0.20.0/mapred_tutorial.html#Reducer
at "How many Reduces?" as J-D wrote. Notice that
JobConf.setNumReduceTasks(int) is
already deprecated, so you should use Job.setNumReduceTasks(int tasks)
rather than
JobConf.setNumReduceTasks(int).
--
Motohiko Mouri