[jira] Commented: (HADOOP-1440) JobClient should not sort input-splits

Doug Cutting (JIRA) Mon, 11 Jun 2007 12:33:55 -0700

    [ 
https://issues.apache.org/jira/browse/HADOOP-1440?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12503576
 ]


Doug Cutting commented on HADOOP-1440:
--------------------------------------

> It sounds like the need is to name the reduces, not control their order.

Rather, it's useful to be able to name the maps when reduce is disabled and 
outputs correspond directly to splits.  It also may be useful to be able to 
determine map order, as the application may know things about their relative 
costs that the kernel cannot.  We should separate these two notions.  The 
returned order of splits can only be used to represent one or the other, not 
both.

Some time ago I'd proposed adding a 'float cost()' method to splits.  This 
could be used for sorting for performance.  Owen argued that 'long length()' 
was better, since it permitted space allocation.  Perhaps we need both: we 
should sort by cost() and potentially constrain task allocation by length().  
By default, cost() would be length().

Regardless, if sorting is done by the kernel (as it is now) then we should 
probably use the returned order of splits to determine the output partition 
when reduce is disabled.  Do we agree on that?

> JobClient should not sort input-splits
> --------------------------------------
>
>                 Key: HADOOP-1440
>                 URL: https://issues.apache.org/jira/browse/HADOOP-1440
>             Project: Hadoop
>          Issue Type: Improvement
>          Components: mapred
>    Affects Versions: 0.12.3
>         Environment: All
>            Reporter: Milind Bhandarkar
>            Assignee: Milind Bhandarkar
>             Fix For: 0.14.0
>
>
> Currently, the JobClient sorts the InputSplits returned by InputFormat in 
> descending order, so that the map tasks corresponding to larger input-splits 
> are scheduled first for execution than smaller ones. However, this causes 
> problems in applications that produce data-sets partitioned similarly to the 
> input partition with -reducer NONE.
> With -reducer NONE, map task i produces part-i. Howver, in the typical 
> applications that use -reducer NONE it should produce a partition that has 
> the same index as the input parrtition.
> (Of course, this requires that each partition should be fed in its entirety 
> to a map, rather than splitting it into blocks, but that is a separate issue.)
> Thus, sorting input splits should be either controllable via a configuration 
> variable, or the FileInputFormat should sort the splits and JobClient should 
> honor the order of splits.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (HADOOP-1440) JobClient should not sort input-splits

Reply via email to