I had my questions in the last email. Here are some of my observations:
-- 24 map tasks is the total capacity of my cluster. But when I specified 24
maps, it only launched 18 map tasks. When I specified 32 maps, it launched
24 map tasks. When I specified 12 maps, it launched 10 map tasks. From the
document, the number of map tasks specified by application is only a hint to
the framework. My question is how to deduce the actual number based on this
hint?

-- I noticed the following task summary in the log:
Task A:
File Systems
HDFS bytes read 152,379,850
Local bytes read 559,188,620
Local bytes written 1,118,100,276
Task B:
File Systems
HDFS bytes read 8,725
Local bytes written 31,316

Is this meaning that Task B had only read from HDFS, while Task A read a
significant amount of data from its local disk (hence better performance)?
Task A also had many bytes of read from HDFS, is reducing this part as much
as possible a direction of performance enhancement?

-- I also noticed the following summary:
Task C:
Map-Reduce Framework
Combine output records 0
Map input records 111
Map output bytes 30,791
Map input bytes 533
Combine input records 0
Map output records 111
Task D:
Map-Reduce Framework
Combine output records 0
Map input records 1,391,167
Map output bytes 554,233,388
Map input bytes 152,371,658
Combine input records 0
Map output records 1,391,145

It seems Task D is a much heavier one than Task C. I have 10 map tasks, 8 of
them are similar (heavy ones) to Tasks D, and 2 of them are similar to Task
C (very light). Why did this happen? Can I control this?

-- I tried different number of reduce tasks. This is important since the
time spent on map is very small (10 to 15 minutes) while the time spent on
reduce is the major one (5 to 8 hours).
While I specified 4 reducers (the same number as my region servers), I got
the best throughput (a little less than 4 hours). When I specified 6 or 8
reduces, I got a much worse result (6~8 hours). The question is should I use
the number of reduce tasks exactly the same as number of region servers?

Regards,
Yan


2009/4/6 Liu Yan <[email protected]>

> hi,
> I have a 4-node cluster, with the following configuration:
>
> 1) master: 7.5G memory, dual-core CPU, running Hadoop NN/DN/TT/JT, HBase
> Master and HBase Region Server
> 2) 2 slaves: 1.7G memory, single-core CPU, running Hadoop DN/TT, and HBase
> Region Server
>
> All the DNs on slaves are about 66% usage, while the DN on master is about
> 36% usage.
>
> mapred.tasktracker.map.tasks.maximum: 12 (master), 4 (slaves)
> mapred.tasktracker.reduce.tasks.maximum: 12 (master), 4 (slaves)
>
> I am doing this job: I read a bunch of CSV files (hundreds) recursively
> from a specified directory on HDFS, parse the file line by line. The first
> line of each file is a "column list" for that particular file. My map task
> is used to parse the files line by line, and reduce task is used to write
> the parsed result into HBase. The total file size is about 2.6GB.
>
> CSV ==> <NamedRowOffset, Text> == (map) ==>
> <ImmutableBytesWritable, HbaseMapWritable<byte[], byte[]>> == (reduce) ==>
> <ImmutableBytesWritable, BatchUpdate>
>
> Note: NamedRowOffset is a custom class so we can know the current file
> name, column names, etc.
>
> I tried different number of map tasks and reduce tasks, and the total
> throughput are different. I am trying to answer:
>
> 1) What's the best numbers for map and reduce tasks in my particular
> scenario?
> 2) Besides the number of map and reduce tasks, do any other parameter(s)
> matter?
> 3) What's the common approach to observe and fine tune the parameters
> (considering both Hadoop and HBase)?
>
> Regards,
> Yan
>
>

Reply via email to