[
https://issues.apache.org/jira/browse/HADOOP-960?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12468747
]
Doug Cutting commented on HADOOP-960:
-------------------------------------
> it seems like the need to split input evenly would be pretty common
Can you tell more about why you think this is important and useful? It's not
obvious to me.
Also, your original complaint was about the *number* of splits not matching
what you expect. Now you're complaining about the *size* of the splits not
being even. Which is it you need? Both? Why? If you pass one big file and
one little file and ask for six splits, should it break each file into three or
break the bigger file into four and the smaller in two? How should file size
be measured: number of records or number of bytes? There are myriad
possibilities. The base class implements something that should work well in
many cases by default, and it also has some knobs that make it somewhat
flexible, but it's not well documented.
> Incorrect number of map tasks when there are multiple input files
> -----------------------------------------------------------------
>
> Key: HADOOP-960
> URL: https://issues.apache.org/jira/browse/HADOOP-960
> Project: Hadoop
> Issue Type: Improvement
> Components: documentation
> Affects Versions: 0.10.1
> Reporter: Andrew McNabb
> Priority: Minor
>
> This problem happens with hadoop-streaming and possibly elsewhere. If there
> are 5 input files, it will create 130 map tasks, even if
> mapred.map.tasks=128. The number of map tasks is incorrectly set to a
> multiple of the number of files. (I wrote a much more complete bug report,
> but Jira lost it when it had an error, so I'm not in the mood to write it all
> again)
--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.