[jira] Commented: (HADOOP-6685) Change the generic serialization framework API to use serialization-specific bytes instead of Map for configuration

Owen O'Malley (JIRA) Thu, 18 Nov 2010 21:44:46 -0800

    [ 
https://issues.apache.org/jira/browse/HADOOP-6685?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12933686#action_12933686
 ]


Owen O'Malley commented on HADOOP-6685:
---------------------------------------

There were a few driving goals:
* Getting ProtocolBuffers, Thrift, and Avro types through MapReduce end to end. 
Obviously this includes supporting SequenceFiles, which are where the bulk of 
Hadoop data is currently stored.
* Supporting context-specific serializations (input key, input value, shuffle 
key, shuffle value, output key, output value, etc) so that different 
serialization options can chosen depending on the application's requirements. 
(MAPREDUCE-1452)
* Using serialization of the "active" objects themselves (input format, mapper, 
reducer, output format, etc.) to simplify making compound objects. This will 
allow us to get rid of the static methods to define properties like input and 
output directory without pushing them into the framework. (MAPREDUCE--1183)
* Clean up the serialization interface to make it  clear that each object has 
to be serialized independently. The current API gives the impression that the 
Serializer and Deserializer can hold state, which is incorrect, and led to a 
bug in the first implementation of the Java serialization plugin.

The first attempt to generalize the serialization metadata was done via string 
to string maps. Since MapReduce already has a configuration, which is a string 
to string map, they used that. However, they needed to nest the serialization 
maps into the configuration. So for each context there was a prefix string and 
the values under that prefix were the metadata for that serialization. This 
worked, but was very ugly. It lead to "stringly-typed" interfaces where you 
needed to read all of the code to figure out what the legal values for the 
configuration were. The code was full of  static methods for each serialization 
in each context that updated the configuration. Further, since it was never 
clear what was intended to be "visible" versus "opaque" the user ended up being 
responsible for all of it.

Therefore, I decided to use another approach. Instead of use string to string 
maps, we use bytes to capture the metadata. The bytes are opaque except to the 
serialization itself. This allows the serialization to define what data is 
important to it and handle it in a type and version safe manner.  It is also 
symmetric to the solution of MR-1183 where you use component specific metadata 
to save their parameters. That is the framework that has been laid out in this 
patch. It includes the work on the container files to show that it can be used 
to write and read the different serializations. It includes the serializations 
to show that they work correctly when used together. By making the framework 
use typed metadata instead of the very generic, but type-less, string to string 
map many user errors will be avoided.

Part of the lesion learned from the train wreck of MR-1126 was that 
implementing sweeping changes to the API and framework by writing and 
committing little patches spread out over 6 months is not a healthy way of 
working. The reviewer needs to understand why they care and how the parts are 
going to work together. I should have done this jira in a public development 
branch, but that wouldn't have lessened this debate. Doug and I just disagree 
about the design of the interface. The indication that he gave when I gave the 
presentation on my plan 5 months ago was that he didn't like it, but wouldn't 
block it. He reiterated that position on this jira 6 days ago. Have you changed 
your mind, Doug?

To Doug's specific points:

{quote}
inheritance is used in serialization implementations, and inheritance is harder 
to implement with binary objects
{quote}

Actually handling extensions is quite easy using protocol buffers, which is 
part of why I chose to use them for storing the metadata. Inheritance in string 
to string maps is quite tricky and must be managed completely by the plugin 
writer.

{quote
binary encodings are less transparent and create binary serialization bootstrap 
problems
{quote}

I will grant you they are less transparent and require a tool to dump their 
contents.  Bootstrapping wasn't a problem at all. (Granted, it would have been 
a problem with Avro, as I discussed here http://bit.ly/cJ1tVp 

{quote}
serialization metadata is not large nor read/written in inner loops, so binary 
is not required
{quote}

It isn't required, but it isn't a problem either.

{quote}
using a binary encoding for serialization metadata will require substantial 
changes to serialization clients.
{quote}

The change to the clients is the same size, regardless of whether the metadata 
is encoded in binary or string to string maps. It is extra information that 
needs to be available. The data is smaller and type-safe if it is done in 
binary compared to string to string maps.


> Change the generic serialization framework API to use serialization-specific 
> bytes instead of Map<String,String> for configuration
> ----------------------------------------------------------------------------------------------------------------------------------
>
>                 Key: HADOOP-6685
>                 URL: https://issues.apache.org/jira/browse/HADOOP-6685
>             Project: Hadoop Common
>          Issue Type: Improvement
>            Reporter: Owen O'Malley
>            Assignee: Owen O'Malley
>             Fix For: 0.22.0
>
>         Attachments: libthrift.jar, serial.patch, serial4.patch, 
> serial6.patch, serial7.patch, SerializationAtSummit.pdf
>
>
> Currently, the generic serialization framework uses Map<String,String> for 
> the serialization specific configuration. Since this data is really internal 
> to the specific serialization, I think we should change it to be an opaque 
> binary blob. This will simplify the interface for defining specific 
> serializations for different contexts (MAPREDUCE-1462). It will also move us 
> toward having serialized objects for Mappers, Reducers, etc (MAPREDUCE-1183).

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (HADOOP-6685) Change the generic serialization framework API to use serialization-specific bytes instead of Map for configuration

Reply via email to