[ https://issues.apache.org/jira/browse/MAPREDUCE-1126?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12805709#action_12805709 ]
Chris Douglas commented on MAPREDUCE-1126: ------------------------------------------ Replacing the type driven serialization with an explicitly specified, context-sensitive factory is 1) throwing away all Java type hierarchies, 2) asserting that the serialization defines the user types, and 3) implying that these types- and relationships between them- should remain opaque to the MapReduce framework. It's making a tradeoff discussed in HADOOP-6323: all the type checks are removed from the framework, but enforced by the serializer. So {{WritableSerialization}}- appropriately- requires an exact match for the configured class, but other serializers may not. The MapReduce framework can't do any checks of its own- neither, notably, may Java- to verify properties of the types users supply; their semantics are _defined by_ the serialization. For example, a job using related {{Writable}} types may pass a compile-time type check, work with explicit Avro serialization in the intermediate data, but fail if it were run with implicit Writable serialization. This is a *huge* shift. It means the generic, Java types for the Mapper, Reducer, collector etc. literally don't matter; they're effectively all {{Object}} (relying on autoboxing to collect primitive types). This means that every serialization has its own type semantics which need not look anything like what Java can enforce, inspect, or interpret. Given this, that the patch puts the serialization as the most prominent interface to MapReduce is not entirely surprising. It's also powerful functionality. By allowing any user type to be serialized/deserialized per context, the long-term elimination of the key/value distinction doesn't change {{collect(K,V)}} to {{collect(Object)}} as proposed, but rather {{collect(Object...)}}: the serializer transforms the record into bytes, and the comparator works on that byte range, determining which bytes are relevant per the serialization contract. Especially for frameworks written on top of MapReduce, less restrictive interfaces here would surely be fertile ground for performance improvements. That said: I hate this API for users. Someone writing a MapReduce job is writing a transform of data; how these data are encoded in different contexts is usually irrelevant to their task. Forcing the user to pick a serialization to declare their types to- rather than offering their types to MapReduce- is backwards for the vast majority of cases. Consider the Writable subtype example above: one is tying the correctness of the {{Mapper}} to the intermediate serialization declared in the submitter code, whose semantics are inscrutable. That's just odd. If one's map is going to emit data without a common type, then doesn't it make sense to declare that instead of leaving the signature as {{Object}}? That is, particularly given MAPREDUCE-1411, wouldn't the equivalent of {{Mapper<Text,Text,Text,AvroRecord>}} be a more apt signature than {{Mapper<Text,Text,Text,Object>}} for an implementation emitting {{int}} and {{String}} as value types? I much prefer the semantics of the global serializer, but wouldn't object to adding an inconspicuous knob in support of context-sensitive serialization. Would a {{Job::setSerializationFactory(CTXT, SerializationFactory...)}} method, such that {{CTXT}} is an enumerated type of framework-hooks (i.e. {{DEFAULT}}, {{MAP_OUTPUT_KEY}}, {{MAP_OUTPUT_VALUE}}, etc.) be satisfactory? This way, one can instruct the framework to use/prefer a particular serialization in one context without requiring most users to change their jobs. It also permits continued use of largely type-based serialization which- as Tom notes- is a very common case. Writing wrappers can be irritating, but for the MR API, I'd rather make it easier on common cases and users than on advanced uses and framework authors. > shuffle should use serialization to get comparator > -------------------------------------------------- > > Key: MAPREDUCE-1126 > URL: https://issues.apache.org/jira/browse/MAPREDUCE-1126 > Project: Hadoop Map/Reduce > Issue Type: Improvement > Components: task > Reporter: Doug Cutting > Assignee: Aaron Kimball > Fix For: 0.22.0 > > Attachments: MAPREDUCE-1126.2.patch, MAPREDUCE-1126.3.patch, > MAPREDUCE-1126.4.patch, MAPREDUCE-1126.5.patch, MAPREDUCE-1126.6.patch, > MAPREDUCE-1126.patch, MAPREDUCE-1126.patch > > > Currently the key comparator is defined as a Java class. Instead we should > use the Serialization API to create key comparators. This would permit, > e.g., Avro-based comparators to be used, permitting efficient sorting of > complex data types without having to write a RawComparator in Java. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.