[jira] Commented: (MAPREDUCE-1126) shuffle should use serialization to get comparator

Scott Carey (JIRA) Fri, 29 Jan 2010 12:03:05 -0800

    [ 
https://issues.apache.org/jira/browse/MAPREDUCE-1126?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12806472#action_12806472
 ]


Scott Carey commented on MAPREDUCE-1126:
----------------------------------------

b...@scott: the annotations for Input/OutputFormat seem to be misplaced. It 
seems desirable to be able to write a single Map function that does wordcount 
on Strings, regardless of whether those strings are stored in newline-delimited 
text, sequence files, avro data files, or whatever.

Philip, yes, they are not in the right place.  I just wanted to bring into the 
conversation that 'SomeObject.setSomeBinding()' is not the only way to do these 
sort of things.  Annotations, unlike setter methods, can be moved around and 
adapted to work in various ways without breaking APIs.   For example, the 
Input/OutputFormat annotation could go on either a Map class, OR some other 
more specific annotation site, and with defaults and priority (set on 
configuration > annotated on configuration > annotated on map > default) 
determining which applies.

After thinking about it a bit more, and doing some research into how other APIs 
do some tricky things with Annotations, there are a few things to consider.  
* It is possible in some situations to infer the generic types of a class at 
runtime by constructing an instance of an object with the same type arguments.  
Example: 
http://wiki.fasterxml.com/JacksonInFiveMinutes#Data_Binding_with_Generics.
* Annotations on class A can be applied to class B "Mix-In Annotations"; 
http://wiki.fasterxml.com/JacksonMixInAnnotations
* Post-compile time checks via an annotation processor can validate code before 
run time in cases where the current M/R framework only breaks at run time.

What I think is most important to this discussion is that some layers of 
configuration complexity can be hidden from users, and some of it deferred to 
the future.
The 'site' of the configuration can be moved around with Annotations, opening 
up ways to simplify the steps required to do declarative configuration.

With this in mind, some additional complexity to the procedural configuration 
methods is more acceptable if there are good defaults and a later (backwards 
compatible) API addition simplifies things.  Likewise, some elements of 
complexity can be skipped for now if it can be seen that those could be 
available through a configuration extension later.  Perhaps the procedural API 
would never allow configuring a key and value to use different serializers to 
avoid API complexity, but an annotation extension in the future allows that.


> shuffle should use serialization to get comparator
> --------------------------------------------------
>
>                 Key: MAPREDUCE-1126
>                 URL: https://issues.apache.org/jira/browse/MAPREDUCE-1126
>             Project: Hadoop Map/Reduce
>          Issue Type: Improvement
>          Components: task
>            Reporter: Doug Cutting
>            Assignee: Aaron Kimball
>             Fix For: 0.22.0
>
>         Attachments: MAPREDUCE-1126.2.patch, MAPREDUCE-1126.3.patch, 
> MAPREDUCE-1126.4.patch, MAPREDUCE-1126.5.patch, MAPREDUCE-1126.6.patch, 
> MAPREDUCE-1126.patch, MAPREDUCE-1126.patch
>
>
> Currently the key comparator is defined as a Java class.  Instead we should 
> use the Serialization API to create key comparators.  This would permit, 
> e.g., Avro-based comparators to be used, permitting efficient sorting of 
> complex data types without having to write a RawComparator in Java.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (MAPREDUCE-1126) shuffle should use serialization to get comparator

Reply via email to