[ 
https://issues.apache.org/jira/browse/HADOOP-6685?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12931566#action_12931566
 ] 

Doug Cutting commented on HADOOP-6685:
--------------------------------------

Owen, thanks for the slides.  I don't see a direct relation between this issue 
and the issue of simplifying the implementation of efficient map-side joins 
(MAPREDUCE-1183, more or less).  Am I missing the connection, or is this a 
distinct issue?

File formats are forever.  More variations add significant, long-term 
compatibility burdens to the project.  We badly need to add support for a 
higher-level object serialization system than Writable.  But I'm not convinced 
its wise to add such support to the exisiting Java-only container file formats. 
 So I'm all for a more generic serialization API that can be used by MapReduce 
applications.  I don't however see that it follows that we should provide 
implementations of file formats with a large number of different serialization 
systems, as that invites multiplicative long-term support issues.  I'd prefer 
that we instead direct users towards a single preferred high-level 
serialization system and a single preferred container.  Historically that's 
been Writable and SequenceFile.  We now need to migrate from these to a more 
expressive, language-independent serialization system and container file.  Our 
APIs should be of course be general enough that it's possible to incorporate 
different serialization systems and different file formats, but we needn't 
provide implementations of all combinations of these, but should rather direct 
folks towards a primary implementation.

Google benefits tremendously by having a single standard serialization system 
and container file format.  The Dremel paper 
(http://sergey.melnix.com/pub/melnik_VLDB10.pdf) argues that this is an 
essential enabler of their wide variety of interoperable systems.  The further 
we depart from this the harder we make it to build systems like Dremel that 
multiply the utility of stored data.

Changing serialization systems or file formats is a major imposition for many 
applications.  They cannot afford to do it frequently.  We should provide a 
clear path forward from Writable+SequenceFile to a new system that's easier to 
use, less fragile, and language-independent to better facilitate a rich 
ecosystem of tools.

> Change the generic serialization framework API to use serialization-specific 
> bytes instead of Map<String,String> for configuration
> ----------------------------------------------------------------------------------------------------------------------------------
>
>                 Key: HADOOP-6685
>                 URL: https://issues.apache.org/jira/browse/HADOOP-6685
>             Project: Hadoop Common
>          Issue Type: Improvement
>            Reporter: Owen O'Malley
>            Assignee: Owen O'Malley
>         Attachments: serial.patch, SerializationAtSummit.pdf
>
>
> Currently, the generic serialization framework uses Map<String,String> for 
> the serialization specific configuration. Since this data is really internal 
> to the specific serialization, I think we should change it to be an opaque 
> binary blob. This will simplify the interface for defining specific 
> serializations for different contexts (MAPREDUCE-1462). It will also move us 
> toward having serialized objects for Mappers, Reducers, etc (MAPREDUCE-1183).

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

Reply via email to