[jira] Commented: (MAPREDUCE-1126) shuffle should use serialization to get comparator

Jay Booth (JIRA) Wed, 03 Feb 2010 12:55:53 -0800

    [ 
https://issues.apache.org/jira/browse/MAPREDUCE-1126?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12829251#action_12829251
 ]


Jay Booth commented on MAPREDUCE-1126:
--------------------------------------

+1 for the general concept of a lower-level API, great idea

Any thoughts regarding explicitly setting a Mapper per Split?  Joins between 
different formats are a pretty primary use case, and it's always awkward using 
MultipleInputs to shoehorn the different classes into a single conf..  as I 
understand it now, with MultipleInputs, the MapTask wakes up, looks at its 
input split, compares that to a magic configuration field mapping splits to 
mapper classes, and instantiates that mapper class.  Which leads to trouble if 
you need to mix it with, say, CombineFileInputFormat or anything else that 
relies on configuration, since the different static setConfigValue(conf) 
methods set a single value assuming a single mapper class.

If we set a specific mapper class per split, and then a specific config per 
mapper class, I think it would be a lot more flexible to shoehorn different 
types of functionality if you're a framework author -- if you're just a user, 
maybe you don't want to deal with the extra environment setup for simple jobs 
but if this is a lower level API, maybe it could be useful?  It would certainly 
be cleaner if a single-input job is just a N=1 multiple inputs job, rather than 
the current situation where a multiple inputs job is a configuration-level hack 
on top of the single-input framework.

> shuffle should use serialization to get comparator
> --------------------------------------------------
>
>                 Key: MAPREDUCE-1126
>                 URL: https://issues.apache.org/jira/browse/MAPREDUCE-1126
>             Project: Hadoop Map/Reduce
>          Issue Type: Improvement
>          Components: task
>            Reporter: Doug Cutting
>            Assignee: Aaron Kimball
>             Fix For: 0.22.0
>
>         Attachments: MAPREDUCE-1126.2.patch, MAPREDUCE-1126.3.patch, 
> MAPREDUCE-1126.4.patch, MAPREDUCE-1126.5.patch, MAPREDUCE-1126.6.patch, 
> MAPREDUCE-1126.patch, MAPREDUCE-1126.patch
>
>
> Currently the key comparator is defined as a Java class.  Instead we should 
> use the Serialization API to create key comparators.  This would permit, 
> e.g., Avro-based comparators to be used, permitting efficient sorting of 
> complex data types without having to write a RawComparator in Java.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (MAPREDUCE-1126) shuffle should use serialization to get comparator

Reply via email to