[ https://issues.apache.org/jira/browse/MAPREDUCE-1126?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12804118#action_12804118 ]
Doug Cutting commented on MAPREDUCE-1126: ----------------------------------------- Owen, I would like to figure out what we need to do to address your concerns without losing functionality. Some notes on functionality: We need to support non-record top-level types. A Python or Ruby program might generate data whose top-level type is a union, array, map, long, string, etc. A serialization-specific class is not a natural Java representation for such data. We should not create AvroSpecificLong, AvroReflectLong, etc. wrappers as we have been forced to do with Writable. Rather applications should be able to use built-in Java types for such data. A class alone is not sufficient to specify a serialization. One must be able to specify both: - the binary format of the data - the mapping from binary to in-memory These are distinct. In Avro a schema defines the binary format and a DatumReader usually defines the mapping from binary to in-memory. Avro includes three different DatumReader implementations: generic, specific and reflect. These map built-in types differently. For example, ReflectDatumReader maps Avro strings to java.lang.String while SpecificDatumReader maps Avro strings to org.apache.avro.util.Utf8. Moreover an application can use a lower-level event-based API to map its in-memory Java data structures to Avro. For example, Pig data is a union of Bag, Tuple and some built-in types. It's straightforward to define an Avro schema for this, but none of Avro's provided DatumReader implementations might be optimal. Reflection is slow. Specific and generic would require copying data to-and-from generated classes. So Pig might best declare its schema and then directly read and write its own data structures directly as Avro data. Avro includes tools to efficiently make this type-safe. So Pig might define its own serialization for Long, null, etc. In the current patch one would use AvroGenericData.setMapOutputKeySchema(conf, schema) to specify that Avro's generic in-memory representation should be used for data whose binary format corresponds to a given schema. One would use WritableData.setMapOutputKeyClass(conf, class) to specify that a Writable class should be used to define both the binary format and its in-memory representation. Both of these set configuration parameters used by the serialization factory to produce an appropriate serializer. A third party can define new serializers for, e.g., Pig data, and configure their jobs to use it with something like PigData.setMapOutputFormat(conf). We cannot easily get this from the InputFormat. We could instead configure the input format with such information, and have it use the serialization factory. Applications would still need to set the same number of parameters, just in a different place: the binary format still needs to be declared, as does the mapping from binary to in-memory, these cannot be inferred automatically. That's a more substantial change that we did not wish to make in this patch. Moreover, the inputformat is irrelevant to the current patch, since the serialization used during the shuffle can be different than that used for input and output. We don't force input and output serializations to be identical, nor should we force intermediate serialization to match one or the other. The map input might be legacy data, and the map function might covert it to a different representation that requires a different serialization. This patch concerns intermediate data. MAPREDUCE-815 makes the corresponding changes for input and output data. The tests for MAPREDUCE-815 include an end-to-end Avro job whose key uses Avro's generic data representation and whose value is simply null. Have a look at the code that creates such a job: it looks much like our job creation today, the API doesn't appear fundamentally different to me. It's not currently possible to specify null as a MapReduce value, since null has no class. NullWritable is a crude, serialization-specific manner for doing this that, long-term, I hope we can deprecate in favor of simply declaring and passing null when that's appropriate. I hope these observations can help us reach consensus without extensive delay. > shuffle should use serialization to get comparator > -------------------------------------------------- > > Key: MAPREDUCE-1126 > URL: https://issues.apache.org/jira/browse/MAPREDUCE-1126 > Project: Hadoop Map/Reduce > Issue Type: Improvement > Components: task > Reporter: Doug Cutting > Assignee: Aaron Kimball > Fix For: 0.22.0 > > Attachments: MAPREDUCE-1126.2.patch, MAPREDUCE-1126.3.patch, > MAPREDUCE-1126.4.patch, MAPREDUCE-1126.5.patch, MAPREDUCE-1126.6.patch, > MAPREDUCE-1126.patch > > > Currently the key comparator is defined as a Java class. Instead we should > use the Serialization API to create key comparators. This would permit, > e.g., Avro-based comparators to be used, permitting efficient sorting of > complex data types without having to write a RawComparator in Java. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.