hi, I have a 4-node cluster, with the following configuration: 1) master: 7.5G memory, dual-core CPU, running Hadoop NN/DN/TT/JT, HBase Master and HBase Region Server 2) 2 slaves: 1.7G memory, single-core CPU, running Hadoop DN/TT, and HBase Region Server
All the DNs on slaves are about 66% usage, while the DN on master is about 36% usage. mapred.tasktracker.map.tasks.maximum: 12 (master), 4 (slaves) mapred.tasktracker.reduce.tasks.maximum: 12 (master), 4 (slaves) I am doing this job: I read a bunch of CSV files (hundreds) recursively from a specified directory on HDFS, parse the file line by line. The first line of each file is a "column list" for that particular file. My map task is used to parse the files line by line, and reduce task is used to write the parsed result into HBase. The total file size is about 2.6GB. CSV ==> <NamedRowOffset, Text> == (map) ==> <ImmutableBytesWritable, HbaseMapWritable<byte[], byte[]>> == (reduce) ==> <ImmutableBytesWritable, BatchUpdate> Note: NamedRowOffset is a custom class so we can know the current file name, column names, etc. I tried different number of map tasks and reduce tasks, and the total throughput are different. I am trying to answer: 1) What's the best numbers for map and reduce tasks in my particular scenario? 2) Besides the number of map and reduce tasks, do any other parameter(s) matter? 3) What's the common approach to observe and fine tune the parameters (considering both Hadoop and HBase)? Regards, Yan
