[ https://issues.apache.org/jira/browse/MAPREDUCE-1126?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12805760#action_12805760 ]
Arun C Murthy commented on MAPREDUCE-1126: ------------------------------------------ bq. Great points, Chris. Yahoo! has stated that a significant majority of their MapReduce jobs are written in Pig, and Facebook says the same of Hive. Among our many customers at Cloudera, it's far more common to target the MapReduce execution engine with a higher level language rather than the Java API. What you propose as the common case, then, appears to be uncommon in practice. Uh, no. That is precisely the point - making it slightly harder on _framework_ authors is better than making it harder for the average users of the Map-Reduce api. Only the framework authors pay the cost... ---- Along similar sentiments I'd like to re-state: {quote} 1. We should use the current global serializer factory for all contexts of a job. 4. Only the default comparator should come from the serializer. The user has to be able to override it in the framework (not change the serializer factory). {quote} I'm not convinced we need to allow multiple serialization mechanism for the same job. I'm also much less convinced that we need to allow a serializer per map-in-key, map-in-value, map-out-key, map-out-value, reduce-out-key, reduce-out-value etc. I can see that we might have some phase of transition where people might move from Writables to Avro as the preferred serialization mechanism. For e.g. they might have SequenceFiles with Writables as input-records and might produce SequenceFiles with Avro output-records. However, even with a single serializer-factory for all contexts of a job it is trivial to write wrappers, provide bridges in libraries or other frameworks etc. to cross the chasm. ---- At a later point, *iff* we get to a world where we need to console multiple serialization mechanisms for the same job on a regular basis e.g. a world where we have a lot of data in Writables *and* Avro *and* Thrift etc. I'd like to propose a slightly less involved version of Chris's proposal. The simplification is that we have view 4 separate 'record contexts': # INPUT (map-in-key, map-in-value) # INTERMEDIATE (map-out-key, map-out-value) # OUTPUT (reduce-out-key, reduce-out-value) # JOB_DEFINITION (currently only InputSplit, possibly more in future via MAPREDUCE-1183) Then we have Chris's proposal: {noformat} enum Context { INPUT, INTERMEDIATE, OUTPUT, JOB_DEFINITION } Job::setSerializationFactory(Context context, SerializationFactory...) {noformat} Thus we allow serializers to be specified for the 'records' flowing through the Map-Reduce framework... allowing map-in-key and map-in-value to have different serialization mechanisms seems like an over-kill. Do we have use-cases for such requirements? > shuffle should use serialization to get comparator > -------------------------------------------------- > > Key: MAPREDUCE-1126 > URL: https://issues.apache.org/jira/browse/MAPREDUCE-1126 > Project: Hadoop Map/Reduce > Issue Type: Improvement > Components: task > Reporter: Doug Cutting > Assignee: Aaron Kimball > Fix For: 0.22.0 > > Attachments: MAPREDUCE-1126.2.patch, MAPREDUCE-1126.3.patch, > MAPREDUCE-1126.4.patch, MAPREDUCE-1126.5.patch, MAPREDUCE-1126.6.patch, > MAPREDUCE-1126.patch, MAPREDUCE-1126.patch > > > Currently the key comparator is defined as a Java class. Instead we should > use the Serialization API to create key comparators. This would permit, > e.g., Avro-based comparators to be used, permitting efficient sorting of > complex data types without having to write a RawComparator in Java. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.