[jira] Commented: (MAPREDUCE-1126) shuffle should use serialization to get comparator

Doug Cutting (JIRA) Sat, 23 Jan 2010 09:40:43 -0800

    [ 
https://issues.apache.org/jira/browse/MAPREDUCE-1126?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12804118#action_12804118
 ]


Doug Cutting commented on MAPREDUCE-1126:
-----------------------------------------

Owen, I would like to figure out what we need to do to address your concerns 
without losing functionality.

Some notes on functionality:

We need to support non-record top-level types.  A Python or Ruby program might 
generate data whose top-level type is a union, array, map, long, string, etc.  
A serialization-specific class is not a natural Java representation for such 
data.  We should not create AvroSpecificLong, AvroReflectLong, etc. wrappers as 
we have been forced to do with Writable.  Rather applications should be able to 
use built-in Java types for such data.

A class alone is not sufficient to specify a serialization.  One must be able 
to specify both:
 - the binary format of the data
 - the mapping from binary to in-memory

These are distinct. In Avro a schema defines the binary format and a 
DatumReader usually defines the mapping from binary to in-memory.  Avro 
includes three different DatumReader implementations: generic, specific and 
reflect.  These map built-in types differently.  For example, 
ReflectDatumReader maps Avro strings to java.lang.String while 
SpecificDatumReader maps Avro strings to org.apache.avro.util.Utf8.

Moreover an application can use a lower-level event-based API to map its 
in-memory Java data structures to Avro.  For example, Pig data is a union of 
Bag, Tuple and some built-in types.  It's straightforward to define an Avro 
schema for this, but none of Avro's provided DatumReader implementations might 
be optimal.  Reflection is slow.  Specific and generic would require copying 
data to-and-from generated classes.  So Pig might best declare its schema and 
then directly read and write its own data structures directly as Avro data.  
Avro includes tools to efficiently make this type-safe.  So Pig might define 
its own serialization for Long, null, etc.

In the current patch one would use AvroGenericData.setMapOutputKeySchema(conf, 
schema) to specify that Avro's generic in-memory representation should be used 
for data whose binary format corresponds to a given schema.  One would use 
WritableData.setMapOutputKeyClass(conf, class) to specify that a Writable class 
should be used to define both the binary format and its in-memory 
representation.  Both of these set configuration parameters used by the 
serialization factory to produce an appropriate serializer.  A third party can 
define new serializers for, e.g., Pig data, and configure their jobs to use it 
with something like PigData.setMapOutputFormat(conf).

We cannot easily get this from the InputFormat.  We could instead configure the 
input format with such information, and have it use the serialization factory.  
Applications would still need to set the same number of parameters, just in a 
different place: the binary format still needs to be declared, as does the 
mapping from binary to in-memory, these cannot be inferred automatically.  
That's a more substantial change that we did not wish to make in this patch.

Moreover, the inputformat is irrelevant to the current patch, since the 
serialization used during the shuffle can be different than that used for input 
and output.  We don't force input and output serializations to be identical, 
nor should we force intermediate serialization to match one or the other.  The 
map input might be legacy data, and the map function might covert it to a 
different representation that requires a different serialization.  This patch 
concerns intermediate data.  MAPREDUCE-815 makes the corresponding changes for 
input and output data.

The tests for MAPREDUCE-815 include an end-to-end Avro job whose key uses 
Avro's generic data representation and whose value is simply null.  Have a look 
at the code that creates such a job: it looks much like our job creation today, 
the API doesn't appear fundamentally different to me.  It's not currently 
possible to specify null as a MapReduce value, since null has no class.  
NullWritable is a crude, serialization-specific manner for doing this that, 
long-term, I hope we can deprecate in favor of simply declaring and passing 
null when that's appropriate.

I hope these observations can help us reach consensus without extensive delay.

> shuffle should use serialization to get comparator
> --------------------------------------------------
>
>                 Key: MAPREDUCE-1126
>                 URL: https://issues.apache.org/jira/browse/MAPREDUCE-1126
>             Project: Hadoop Map/Reduce
>          Issue Type: Improvement
>          Components: task
>            Reporter: Doug Cutting
>            Assignee: Aaron Kimball
>             Fix For: 0.22.0
>
>         Attachments: MAPREDUCE-1126.2.patch, MAPREDUCE-1126.3.patch, 
> MAPREDUCE-1126.4.patch, MAPREDUCE-1126.5.patch, MAPREDUCE-1126.6.patch, 
> MAPREDUCE-1126.patch
>
>
> Currently the key comparator is defined as a Java class.  Instead we should 
> use the Serialization API to create key comparators.  This would permit, 
> e.g., Avro-based comparators to be used, permitting efficient sorting of 
> complex data types without having to write a RawComparator in Java.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (MAPREDUCE-1126) shuffle should use serialization to get comparator

Reply via email to