[jira] Commented: (MAPREDUCE-801) MAPREDUCE framework should issue warning with too many locations for a split

Doug Cutting (JIRA) Thu, 30 Jul 2009 16:11:41 -0700

    [ 
https://issues.apache.org/jira/browse/MAPREDUCE-801?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12737347#action_12737347
 ]


Doug Cutting commented on MAPREDUCE-801:
----------------------------------------

I just chatted with Arun and he asked me to make a more specific proposal, so 
here goes:
 - change the limit on tasks per job to be total locations per job
 - if the jobtracker buffers locations from multiple jobs
 -- record the total number of locations in the job file header
 -- stop buffering jobs when the location limit is exceeded
 - list the total number of locations in the web ui
 - list the percentage of non-local i/o in the web ui

A limit on the number of locations per split would disallow reasonable 
applications, e.g., those which might have small, very highly replicated input 
and that should hence be easy to schedule on a busy cluster.  For example, 
distcp sets the replication of its input file to sqrt(clustersize) so that no 
single datanode is hammered when all of the tasks read the same file.  With a 
4k node cluster, that's a replication of 63, e.g., one split, 63 locations.


> MAPREDUCE framework should issue warning with too many locations for a split
> ----------------------------------------------------------------------------
>
>                 Key: MAPREDUCE-801
>                 URL: https://issues.apache.org/jira/browse/MAPREDUCE-801
>             Project: Hadoop Map/Reduce
>          Issue Type: New Feature
>            Reporter: Hong Tang
>
> Customized input-format may be buggy and report misleading locations through 
> input-split, an example of which is PIG-878. When an input split returns too 
> many locations, it would not only artificially inflate the percentage of data 
> local or rack local maps, but also force scheduler to use more memory and 
> work harder to conduct task assignment.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (MAPREDUCE-801) MAPREDUCE framework should issue warning with too many locations for a split

Reply via email to