[ 
https://issues.apache.org/jira/browse/MAPREDUCE-1126?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12805760#action_12805760
 ] 

Arun C Murthy commented on MAPREDUCE-1126:
------------------------------------------

bq. Great points, Chris. Yahoo! has stated that a significant majority of their 
MapReduce jobs are written in Pig, and Facebook says the same of Hive. Among 
our many customers at Cloudera, it's far more common to target the MapReduce 
execution engine with a higher level language rather than the Java API. What 
you propose as the common case, then, appears to be uncommon in practice.

Uh, no. That is precisely the point - making it slightly harder on _framework_ 
authors is better than making it harder for the average users of the Map-Reduce 
api. Only the framework authors pay the cost...

---- 

Along similar sentiments I'd like to re-state:

{quote}
1. We should use the current global serializer factory for all contexts of a 
job.
4. Only the default comparator should come from the serializer. The user has to 
be able to override it in the framework (not change the serializer factory).
{quote}

I'm not convinced we need to allow multiple serialization mechanism for the 
same job. I'm also much less convinced that we need to allow a serializer per 
map-in-key, map-in-value, map-out-key, map-out-value, reduce-out-key, 
reduce-out-value etc.

I can see that we might have some phase of transition where people might move 
from Writables to Avro as the preferred serialization mechanism. For e.g. they 
might have SequenceFiles with Writables as input-records and might produce 
SequenceFiles with Avro output-records. However, even with a single 
serializer-factory for all contexts of a job it is trivial to write wrappers, 
provide bridges in libraries or other frameworks etc. to cross the chasm.

----

At a later point, *iff* we get to a world where we need to console multiple 
serialization mechanisms for the same job on a regular basis e.g. a world where 
we have a lot of data in Writables *and* Avro *and* Thrift etc. I'd like to 
propose a slightly less involved version of Chris's proposal.

The simplification is that we have view 4 separate 'record contexts':
# INPUT (map-in-key, map-in-value)
# INTERMEDIATE (map-out-key, map-out-value)
# OUTPUT (reduce-out-key, reduce-out-value)
# JOB_DEFINITION (currently only InputSplit, possibly more in future via 
MAPREDUCE-1183)

Then we have Chris's proposal:

{noformat}
enum Context {
  INPUT,
  INTERMEDIATE,
  OUTPUT,
  JOB_DEFINITION
}

Job::setSerializationFactory(Context context, SerializationFactory...)
{noformat}

Thus we allow serializers to be specified for the 'records' flowing through the 
Map-Reduce framework... allowing map-in-key and map-in-value to have different 
serialization mechanisms seems like an over-kill. Do we have use-cases for such 
requirements? 

> shuffle should use serialization to get comparator
> --------------------------------------------------
>
>                 Key: MAPREDUCE-1126
>                 URL: https://issues.apache.org/jira/browse/MAPREDUCE-1126
>             Project: Hadoop Map/Reduce
>          Issue Type: Improvement
>          Components: task
>            Reporter: Doug Cutting
>            Assignee: Aaron Kimball
>             Fix For: 0.22.0
>
>         Attachments: MAPREDUCE-1126.2.patch, MAPREDUCE-1126.3.patch, 
> MAPREDUCE-1126.4.patch, MAPREDUCE-1126.5.patch, MAPREDUCE-1126.6.patch, 
> MAPREDUCE-1126.patch, MAPREDUCE-1126.patch
>
>
> Currently the key comparator is defined as a Java class.  Instead we should 
> use the Serialization API to create key comparators.  This would permit, 
> e.g., Avro-based comparators to be used, permitting efficient sorting of 
> complex data types without having to write a RawComparator in Java.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

Reply via email to