RE: HADOOP-1 Regarding dfs.block.size vs mapred.max.split.size

Sandeep Paul Thu, 11 Sep 2014 04:55:06 -0700

Thank you for clarification

-----Original Message-----
From: "Vinayakumar B" <[email protected]>
Sent: ‎9/‎11/‎2014 4:49 PM
To: "[email protected]" <[email protected]>
Subject: Re: HADOOP-1 Regarding dfs.block.size vs mapred.max.split.size

Hi Sandeep,

AFAIK,

1. "dfs.block.size" and "mapred.max.split.size" are related logically to
get the best performance in case of reading big files and to get the data
locality.

2. There is no strict rule in the framework for the max split size . You
can specify more than block size.

3. If the split size is more than the block size, then single map needs to
read multiple blocks. This block might be in some other node, which will
increase the I/O duration.

4. As I said before, you will loose the data locality gain, in case of
reading from multiple blocks which are located in different nodes.

Regards,
Vinay

On Thu, Sep 11, 2014 at 2:45 PM, sandeep paul <[email protected]>
wrote:

> Hi ,
>
> I need confirmation regarding this two parameters and how they affect
> performance .
>
> I know(read) that always *mapred.max.split.size * should be less that
> *dfs.block.size,*
> But we always have an option of specifying  *mapred.max.split.size
> *greater
> than *dfs.block.size,*
> What will happen in that case will the FileInputFormat for calculating
> splits allows ?or it takes *dfs.block.size  *as the split size .
>
> Say if the framework allows then in that case one map-task will end up
> processing  more than one block (which will not be in local machine
> always),In that case how the performance Impact?.
>
> It would be a great help if anyone can help me get rid of this confusion.
>
> Thanks
> sandeep
>

RE: HADOOP-1 Regarding dfs.block.size vs mapred.max.split.size

Reply via email to