I had my questions in the last email. Here are some of my observations: -- 24 map tasks is the total capacity of my cluster. But when I specified 24 maps, it only launched 18 map tasks. When I specified 32 maps, it launched 24 map tasks. When I specified 12 maps, it launched 10 map tasks. From the document, the number of map tasks specified by application is only a hint to the framework. My question is how to deduce the actual number based on this hint?
-- I noticed the following task summary in the log: Task A: File Systems HDFS bytes read 152,379,850 Local bytes read 559,188,620 Local bytes written 1,118,100,276 Task B: File Systems HDFS bytes read 8,725 Local bytes written 31,316 Is this meaning that Task B had only read from HDFS, while Task A read a significant amount of data from its local disk (hence better performance)? Task A also had many bytes of read from HDFS, is reducing this part as much as possible a direction of performance enhancement? -- I also noticed the following summary: Task C: Map-Reduce Framework Combine output records 0 Map input records 111 Map output bytes 30,791 Map input bytes 533 Combine input records 0 Map output records 111 Task D: Map-Reduce Framework Combine output records 0 Map input records 1,391,167 Map output bytes 554,233,388 Map input bytes 152,371,658 Combine input records 0 Map output records 1,391,145 It seems Task D is a much heavier one than Task C. I have 10 map tasks, 8 of them are similar (heavy ones) to Tasks D, and 2 of them are similar to Task C (very light). Why did this happen? Can I control this? -- I tried different number of reduce tasks. This is important since the time spent on map is very small (10 to 15 minutes) while the time spent on reduce is the major one (5 to 8 hours). While I specified 4 reducers (the same number as my region servers), I got the best throughput (a little less than 4 hours). When I specified 6 or 8 reduces, I got a much worse result (6~8 hours). The question is should I use the number of reduce tasks exactly the same as number of region servers? Regards, Yan 2009/4/6 Liu Yan <[email protected]> > hi, > I have a 4-node cluster, with the following configuration: > > 1) master: 7.5G memory, dual-core CPU, running Hadoop NN/DN/TT/JT, HBase > Master and HBase Region Server > 2) 2 slaves: 1.7G memory, single-core CPU, running Hadoop DN/TT, and HBase > Region Server > > All the DNs on slaves are about 66% usage, while the DN on master is about > 36% usage. > > mapred.tasktracker.map.tasks.maximum: 12 (master), 4 (slaves) > mapred.tasktracker.reduce.tasks.maximum: 12 (master), 4 (slaves) > > I am doing this job: I read a bunch of CSV files (hundreds) recursively > from a specified directory on HDFS, parse the file line by line. The first > line of each file is a "column list" for that particular file. My map task > is used to parse the files line by line, and reduce task is used to write > the parsed result into HBase. The total file size is about 2.6GB. > > CSV ==> <NamedRowOffset, Text> == (map) ==> > <ImmutableBytesWritable, HbaseMapWritable<byte[], byte[]>> == (reduce) ==> > <ImmutableBytesWritable, BatchUpdate> > > Note: NamedRowOffset is a custom class so we can know the current file > name, column names, etc. > > I tried different number of map tasks and reduce tasks, and the total > throughput are different. I am trying to answer: > > 1) What's the best numbers for map and reduce tasks in my particular > scenario? > 2) Besides the number of map and reduce tasks, do any other parameter(s) > matter? > 3) What's the common approach to observe and fine tune the parameters > (considering both Hadoop and HBase)? > > Regards, > Yan > >
