[
https://issues.apache.org/jira/browse/MAPREDUCE-801?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12737347#action_12737347
]
Doug Cutting commented on MAPREDUCE-801:
----------------------------------------
I just chatted with Arun and he asked me to make a more specific proposal, so
here goes:
- change the limit on tasks per job to be total locations per job
- if the jobtracker buffers locations from multiple jobs
-- record the total number of locations in the job file header
-- stop buffering jobs when the location limit is exceeded
- list the total number of locations in the web ui
- list the percentage of non-local i/o in the web ui
A limit on the number of locations per split would disallow reasonable
applications, e.g., those which might have small, very highly replicated input
and that should hence be easy to schedule on a busy cluster. For example,
distcp sets the replication of its input file to sqrt(clustersize) so that no
single datanode is hammered when all of the tasks read the same file. With a
4k node cluster, that's a replication of 63, e.g., one split, 63 locations.
> MAPREDUCE framework should issue warning with too many locations for a split
> ----------------------------------------------------------------------------
>
> Key: MAPREDUCE-801
> URL: https://issues.apache.org/jira/browse/MAPREDUCE-801
> Project: Hadoop Map/Reduce
> Issue Type: New Feature
> Reporter: Hong Tang
>
> Customized input-format may be buggy and report misleading locations through
> input-split, an example of which is PIG-878. When an input split returns too
> many locations, it would not only artificially inflate the percentage of data
> local or rack local maps, but also force scheduler to use more memory and
> work harder to conduct task assignment.
--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.