> (mapred.min.split.size can be only set to larger than HDFS block size)
>
I haven't tried this on a new mapreduce API, but

-Dmapred.min.split.size=<split_size_you_want> -Dmapred.map.tasks=100000000

I think this would let you set a split size smaller than the hdfs block size :)

Koji


On 2/17/11 2:32 PM, "Jim Falgout" <jim.falg...@pervasive.com> wrote:

Generally, if you have large files, setting the block size to 128M or larger is 
helpful. You can do that on a per file basis or set the block size for the 
whole filesystem. The larger block size cuts down on the number of map tasks 
required to handle the overall data size. I've experimented with 
mapred.min.split.size also and have usually found that the larger the split 
size, the better the overall run time. Of course there is a cut off point, 
especially with a very large cluster where larger split sizes will hurt overall 
scalability.

On tests I've run on a 10 and 20 node cluster though, setting the split size as 
high as 1GB has allows the overall Hadoop jobs to run faster, sometimes quite a 
bit faster. You will lose some locality, but it seems a trade off with the 
number of files that have to be shuffled for the reduction step.

-----Original Message-----
From: Boduo Li [mailto:birdeey...@gmail.com]
Sent: Thursday, February 17, 2011 12:01 PM
To: common-user@hadoop.apache.org
Subject: HDFS block size v.s. mapred.min.split.size

Hi,

I'm recently benchmarking Hadoop. I know two ways to control the input data 
size for each map task(): by changing the HDFS block size (have to reload data 
into HDFS in this case), or by setting mapred.min.split.size.

For my benchmarking task, I need to change the input size for a map task 
frequently. Changing HDFS block size and reloading data is really painful.
But using mapred.min.split.size seems to be problematic. I did some simple test 
to verify if Hadoop has similar performance in the following cases:

(1) HDFS block size = 32MB, mapred.min.split.size=64MB (mapred.min.split.size 
can be only set to larger than HDFS block size)

(2) HDFS block size = 64MB, mapred.min.split.size is not set

I ran the same job under these settings. Setting (1) takes 1374s to finish.
Setting (2) takes 1412s to finish.

I do understand that, given smaller HDFS block size, the I/O is more random.
But the 50-second difference seems too much for random I/O of input data.
Does anyone have any insight of it? Or does anyone know a better way to control 
the input size of each map task?

Thanks.


Reply via email to