[
https://issues.apache.org/jira/browse/HADOOP-6685?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12933686#action_12933686
]
Owen O'Malley commented on HADOOP-6685:
---------------------------------------
There were a few driving goals:
* Getting ProtocolBuffers, Thrift, and Avro types through MapReduce end to end.
Obviously this includes supporting SequenceFiles, which are where the bulk of
Hadoop data is currently stored.
* Supporting context-specific serializations (input key, input value, shuffle
key, shuffle value, output key, output value, etc) so that different
serialization options can chosen depending on the application's requirements.
(MAPREDUCE-1452)
* Using serialization of the "active" objects themselves (input format, mapper,
reducer, output format, etc.) to simplify making compound objects. This will
allow us to get rid of the static methods to define properties like input and
output directory without pushing them into the framework. (MAPREDUCE--1183)
* Clean up the serialization interface to make it clear that each object has
to be serialized independently. The current API gives the impression that the
Serializer and Deserializer can hold state, which is incorrect, and led to a
bug in the first implementation of the Java serialization plugin.
The first attempt to generalize the serialization metadata was done via string
to string maps. Since MapReduce already has a configuration, which is a string
to string map, they used that. However, they needed to nest the serialization
maps into the configuration. So for each context there was a prefix string and
the values under that prefix were the metadata for that serialization. This
worked, but was very ugly. It lead to "stringly-typed" interfaces where you
needed to read all of the code to figure out what the legal values for the
configuration were. The code was full of static methods for each serialization
in each context that updated the configuration. Further, since it was never
clear what was intended to be "visible" versus "opaque" the user ended up being
responsible for all of it.
Therefore, I decided to use another approach. Instead of use string to string
maps, we use bytes to capture the metadata. The bytes are opaque except to the
serialization itself. This allows the serialization to define what data is
important to it and handle it in a type and version safe manner. It is also
symmetric to the solution of MR-1183 where you use component specific metadata
to save their parameters. That is the framework that has been laid out in this
patch. It includes the work on the container files to show that it can be used
to write and read the different serializations. It includes the serializations
to show that they work correctly when used together. By making the framework
use typed metadata instead of the very generic, but type-less, string to string
map many user errors will be avoided.
Part of the lesion learned from the train wreck of MR-1126 was that
implementing sweeping changes to the API and framework by writing and
committing little patches spread out over 6 months is not a healthy way of
working. The reviewer needs to understand why they care and how the parts are
going to work together. I should have done this jira in a public development
branch, but that wouldn't have lessened this debate. Doug and I just disagree
about the design of the interface. The indication that he gave when I gave the
presentation on my plan 5 months ago was that he didn't like it, but wouldn't
block it. He reiterated that position on this jira 6 days ago. Have you changed
your mind, Doug?
To Doug's specific points:
{quote}
inheritance is used in serialization implementations, and inheritance is harder
to implement with binary objects
{quote}
Actually handling extensions is quite easy using protocol buffers, which is
part of why I chose to use them for storing the metadata. Inheritance in string
to string maps is quite tricky and must be managed completely by the plugin
writer.
{quote
binary encodings are less transparent and create binary serialization bootstrap
problems
{quote}
I will grant you they are less transparent and require a tool to dump their
contents. Bootstrapping wasn't a problem at all. (Granted, it would have been
a problem with Avro, as I discussed here http://bit.ly/cJ1tVp
{quote}
serialization metadata is not large nor read/written in inner loops, so binary
is not required
{quote}
It isn't required, but it isn't a problem either.
{quote}
using a binary encoding for serialization metadata will require substantial
changes to serialization clients.
{quote}
The change to the clients is the same size, regardless of whether the metadata
is encoded in binary or string to string maps. It is extra information that
needs to be available. The data is smaller and type-safe if it is done in
binary compared to string to string maps.
> Change the generic serialization framework API to use serialization-specific
> bytes instead of Map<String,String> for configuration
> ----------------------------------------------------------------------------------------------------------------------------------
>
> Key: HADOOP-6685
> URL: https://issues.apache.org/jira/browse/HADOOP-6685
> Project: Hadoop Common
> Issue Type: Improvement
> Reporter: Owen O'Malley
> Assignee: Owen O'Malley
> Fix For: 0.22.0
>
> Attachments: libthrift.jar, serial.patch, serial4.patch,
> serial6.patch, serial7.patch, SerializationAtSummit.pdf
>
>
> Currently, the generic serialization framework uses Map<String,String> for
> the serialization specific configuration. Since this data is really internal
> to the specific serialization, I think we should change it to be an opaque
> binary blob. This will simplify the interface for defining specific
> serializations for different contexts (MAPREDUCE-1462). It will also move us
> toward having serialized objects for Mappers, Reducers, etc (MAPREDUCE-1183).
--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.