[
https://issues.apache.org/jira/browse/HADOOP-6685?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12931566#action_12931566
]
Doug Cutting commented on HADOOP-6685:
--------------------------------------
Owen, thanks for the slides. I don't see a direct relation between this issue
and the issue of simplifying the implementation of efficient map-side joins
(MAPREDUCE-1183, more or less). Am I missing the connection, or is this a
distinct issue?
File formats are forever. More variations add significant, long-term
compatibility burdens to the project. We badly need to add support for a
higher-level object serialization system than Writable. But I'm not convinced
its wise to add such support to the exisiting Java-only container file formats.
So I'm all for a more generic serialization API that can be used by MapReduce
applications. I don't however see that it follows that we should provide
implementations of file formats with a large number of different serialization
systems, as that invites multiplicative long-term support issues. I'd prefer
that we instead direct users towards a single preferred high-level
serialization system and a single preferred container. Historically that's
been Writable and SequenceFile. We now need to migrate from these to a more
expressive, language-independent serialization system and container file. Our
APIs should be of course be general enough that it's possible to incorporate
different serialization systems and different file formats, but we needn't
provide implementations of all combinations of these, but should rather direct
folks towards a primary implementation.
Google benefits tremendously by having a single standard serialization system
and container file format. The Dremel paper
(http://sergey.melnix.com/pub/melnik_VLDB10.pdf) argues that this is an
essential enabler of their wide variety of interoperable systems. The further
we depart from this the harder we make it to build systems like Dremel that
multiply the utility of stored data.
Changing serialization systems or file formats is a major imposition for many
applications. They cannot afford to do it frequently. We should provide a
clear path forward from Writable+SequenceFile to a new system that's easier to
use, less fragile, and language-independent to better facilitate a rich
ecosystem of tools.
> Change the generic serialization framework API to use serialization-specific
> bytes instead of Map<String,String> for configuration
> ----------------------------------------------------------------------------------------------------------------------------------
>
> Key: HADOOP-6685
> URL: https://issues.apache.org/jira/browse/HADOOP-6685
> Project: Hadoop Common
> Issue Type: Improvement
> Reporter: Owen O'Malley
> Assignee: Owen O'Malley
> Attachments: serial.patch, SerializationAtSummit.pdf
>
>
> Currently, the generic serialization framework uses Map<String,String> for
> the serialization specific configuration. Since this data is really internal
> to the specific serialization, I think we should change it to be an opaque
> binary blob. This will simplify the interface for defining specific
> serializations for different contexts (MAPREDUCE-1462). It will also move us
> toward having serialized objects for Mappers, Reducers, etc (MAPREDUCE-1183).
--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.