[jira] Commented: (MAPREDUCE-1126) shuffle should use serialization to get comparator

Chris Douglas (JIRA) Wed, 27 Jan 2010 16:20:58 -0800

    [ 
https://issues.apache.org/jira/browse/MAPREDUCE-1126?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12805709#action_12805709
 ]


Chris Douglas commented on MAPREDUCE-1126:
------------------------------------------

Replacing the type driven serialization with an explicitly specified, 
context-sensitive factory is 1) throwing away all Java type hierarchies, 2) 
asserting that the serialization defines the user types, and 3) implying that 
these types- and relationships between them- should remain opaque to the 
MapReduce framework.

It's making a tradeoff discussed in HADOOP-6323: all the type checks are 
removed from the framework, but enforced by the serializer. So 
{{WritableSerialization}}- appropriately- requires an exact match for the 
configured class, but other serializers may not. The MapReduce framework can't 
do any checks of its own- neither, notably, may Java- to verify properties of 
the types users supply; their semantics are _defined by_ the serialization. For 
example, a job using related {{Writable}} types may pass a compile-time type 
check, work with explicit Avro serialization in the intermediate data, but fail 
if it were run with implicit Writable serialization.

This is a *huge* shift. It means the generic, Java types for the Mapper, 
Reducer, collector etc. literally don't matter; they're effectively all 
{{Object}} (relying on autoboxing to collect primitive types). This means that 
every serialization has its own type semantics which need not look anything 
like what Java can enforce, inspect, or interpret. Given this, that the patch 
puts the serialization as the most prominent interface to MapReduce is not 
entirely surprising.

It's also powerful functionality. By allowing any user type to be 
serialized/deserialized per context, the long-term elimination of the key/value 
distinction doesn't change {{collect(K,V)}} to {{collect(Object)}} as proposed, 
but rather {{collect(Object...)}}: the serializer transforms the record into 
bytes, and the comparator works on that byte range, determining which bytes are 
relevant per the serialization contract. Especially for frameworks written on 
top of MapReduce, less restrictive interfaces here would surely be fertile 
ground for performance improvements.

That said: I hate this API for users. Someone writing a MapReduce job is 
writing a transform of data; how these data are encoded in different contexts 
is usually irrelevant to their task. Forcing the user to pick a serialization 
to declare their types to- rather than offering their types to MapReduce- is 
backwards for the vast majority of cases. Consider the Writable subtype example 
above: one is tying the correctness of the {{Mapper}} to the intermediate 
serialization declared in the submitter code, whose semantics are inscrutable. 
That's just odd.

If one's map is going to emit data without a common type, then doesn't it make 
sense to declare that instead of leaving the signature as {{Object}}? That is, 
particularly given MAPREDUCE-1411, wouldn't the equivalent of 
{{Mapper<Text,Text,Text,AvroRecord>}} be a more apt signature than 
{{Mapper<Text,Text,Text,Object>}} for an implementation emitting {{int}} and 
{{String}} as value types?

I much prefer the semantics of the global serializer, but wouldn't object to 
adding an inconspicuous knob in support of context-sensitive serialization. 
Would a {{Job::setSerializationFactory(CTXT, SerializationFactory...)}} method, 
such that {{CTXT}} is an enumerated type of framework-hooks (i.e. {{DEFAULT}}, 
{{MAP_OUTPUT_KEY}}, {{MAP_OUTPUT_VALUE}}, etc.) be satisfactory? This way, one 
can instruct the framework to use/prefer a particular serialization in one 
context without requiring most users to change their jobs. It also permits 
continued use of largely type-based serialization which- as Tom notes- is a 
very common case. Writing wrappers can be irritating, but for the MR API, I'd 
rather make it easier on common cases and users than on advanced uses and 
framework authors.

> shuffle should use serialization to get comparator
> --------------------------------------------------
>
>                 Key: MAPREDUCE-1126
>                 URL: https://issues.apache.org/jira/browse/MAPREDUCE-1126
>             Project: Hadoop Map/Reduce
>          Issue Type: Improvement
>          Components: task
>            Reporter: Doug Cutting
>            Assignee: Aaron Kimball
>             Fix For: 0.22.0
>
>         Attachments: MAPREDUCE-1126.2.patch, MAPREDUCE-1126.3.patch, 
> MAPREDUCE-1126.4.patch, MAPREDUCE-1126.5.patch, MAPREDUCE-1126.6.patch, 
> MAPREDUCE-1126.patch, MAPREDUCE-1126.patch
>
>
> Currently the key comparator is defined as a Java class.  Instead we should 
> use the Serialization API to create key comparators.  This would permit, 
> e.g., Avro-based comparators to be used, permitting efficient sorting of 
> complex data types without having to write a RawComparator in Java.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (MAPREDUCE-1126) shuffle should use serialization to get comparator

Reply via email to