HBase Performance Tuning

Liu Yan Mon, 06 Apr 2009 06:54:09 -0700

hi,
I have a 4-node cluster, with the following configuration:

1) master: 7.5G memory, dual-core CPU, running Hadoop NN/DN/TT/JT, HBase
Master and HBase Region Server
2) 2 slaves: 1.7G memory, single-core CPU, running Hadoop DN/TT, and HBase
Region Server


All the DNs on slaves are about 66% usage, while the DN on master is about
36% usage.

mapred.tasktracker.map.tasks.maximum: 12 (master), 4 (slaves)
mapred.tasktracker.reduce.tasks.maximum: 12 (master), 4 (slaves)

I am doing this job: I read a bunch of CSV files (hundreds) recursively from
a specified directory on HDFS, parse the file line by line. The first line
of each file is a "column list" for that particular file. My map task is
used to parse the files line by line, and reduce task is used to write the
parsed result into HBase. The total file size is about 2.6GB.

CSV ==> <NamedRowOffset, Text> == (map) ==>
<ImmutableBytesWritable, HbaseMapWritable<byte[], byte[]>> == (reduce) ==>
<ImmutableBytesWritable, BatchUpdate>

Note: NamedRowOffset is a custom class so we can know the current file name,
column names, etc.

I tried different number of map tasks and reduce tasks, and the total
throughput are different. I am trying to answer:

1) What's the best numbers for map and reduce tasks in my particular
scenario?
2) Besides the number of map and reduce tasks, do any other parameter(s)
matter?
3) What's the common approach to observe and fine tune the parameters
(considering both Hadoop and HBase)?

Regards,
Yan

HBase Performance Tuning

Reply via email to