[ 
https://issues.apache.org/jira/browse/HADOOP-8503?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13292644#comment-13292644
 ] 

Yang Yang commented on HADOOP-8503:
-----------------------------------

Harsh:

this is an issue in PIG, which uses the same config for multiple jobs in the 
same pig script. (one pig script normally translates to several MR jobs)


let's say, for the first PIG stage, I have a huge input file (10G). by default 
hadoop launches about 10G/128MB = 100 mappers.

if I have 400 mapper slots, I want to launch 400 mappers.  with the old 
InputFormat code, I could set min.split.size=25MB, with the new code, I could 
also set max.split.size=25MB, both would work fine.


but the next stage in pig script would take an input of 100GB, now, with 25MB 
split size,it's going to generate 4000 mappers, which is too much for my 400 
slots.
in the old code, I could set "mapred.map.tasks=400" to control the upper limit 
of map tasks number, or the lower limit of split size (which takes into effect 
in the Math.max() in computeSplitSize()  ). so I can still maintain 400 
mappers.  but the new code would lead to 4000 mappers, which don't make sense 
anymore.





                
> logic difference between old mapred.FileInputFormat and 
> mapreduce.lib.input.FileInputFormat
> -------------------------------------------------------------------------------------------
>
>                 Key: HADOOP-8503
>                 URL: https://issues.apache.org/jira/browse/HADOOP-8503
>             Project: Hadoop Common
>          Issue Type: Bug
>    Affects Versions: 0.20.0
>            Reporter: Yang Yang
>            Priority: Minor
>
> in the old mapred.FileInputFormat.getSplits(JobConf, int) 
>         long splitSize = computeSplitSize(goalSize, minSize, blockSize);
> so we could control splitSize with the goalSize, which is controlled by 
> mapred.map.tasks 
> in the new code, mapreduces.lib.input.FileInputFormat
>         long splitSize = computeSplitSize(blockSize, minSize, maxSize);
> i.e. we don't have goal size anymore, furthermore,
> the implementation of computeSplitSize() no longer makes sense:
>     return Math.max(minSize, Math.min(maxSize, blockSize));
> since we assume that maxSize is always bigger than minSize, the above line is 
> equivalent to  just
> return Math.min(maxSize, blockSize), so minSize is useless 

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

Reply via email to