Thank you for clarification -----Original Message----- From: "Vinayakumar B" <[email protected]> Sent: 9/11/2014 4:49 PM To: "[email protected]" <[email protected]> Subject: Re: HADOOP-1 Regarding dfs.block.size vs mapred.max.split.size
Hi Sandeep, AFAIK, 1. "dfs.block.size" and "mapred.max.split.size" are related logically to get the best performance in case of reading big files and to get the data locality. 2. There is no strict rule in the framework for the max split size . You can specify more than block size. 3. If the split size is more than the block size, then single map needs to read multiple blocks. This block might be in some other node, which will increase the I/O duration. 4. As I said before, you will loose the data locality gain, in case of reading from multiple blocks which are located in different nodes. Regards, Vinay On Thu, Sep 11, 2014 at 2:45 PM, sandeep paul <[email protected]> wrote: > Hi , > > I need confirmation regarding this two parameters and how they affect > performance . > > I know(read) that always *mapred.max.split.size * should be less that > *dfs.block.size,* > But we always have an option of specifying *mapred.max.split.size > *greater > than *dfs.block.size,* > What will happen in that case will the FileInputFormat for calculating > splits allows ?or it takes *dfs.block.size *as the split size . > > Say if the framework allows then in that case one map-task will end up > processing more than one block (which will not be in local machine > always),In that case how the performance Impact?. > > It would be a great help if anyone can help me get rid of this confusion. > > Thanks > sandeep >
