[
https://issues.apache.org/jira/browse/HADOOP-1986?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12538761
]
Vivek Ratan commented on HADOOP-1986:
-------------------------------------
>> Yes, that's more-or-less assumed, but I yet fail to see it as a problem. All
>> record classes are generated from an IDL and it should be easy to generate a
>> no-arg ctor for those. Ditto for thrift. Things that implement Writable
>> today already must have a no-arg ctor. Can you please provide a more
>> detailed example of something that would prove difficult and why it is
>> important that it be easy?
I guess I'm doing a poor job explaining my point because the larger issue seems
to be missed. Given that there are two kinds of deserializers, those that
create objects and those that take in object references, we're discussion what
the deserialization interface should look like to handle both these kinds of
deserializers, right? We seem to have two choices: have the client figure out
which kind of deserializer it is interacting with and have it call the right
deserialize method, or have a single deserialize method and let the
deserializer create an object where necessary or use one provided by the
client. Right? Doug provided an example of the latter, and I thought there were
some issues. The biggest one is this: for a deserializer to create an object,
it needs to know the type of the object that is being deserialized. Some
deserializers (such as Java's serialization, and Writables, I think) know this,
because the class name is part of the serialized data. Others, such as Thrift
or Record I/O, do not serialize the class name (I'm pretty sure about Record
I/O, and the Thrift code I saw sometime back didn't serialize class names, as
far as I can remember), so they do not know which object to create. Doug
suggested that the serializer store the class name, when it is created by the
class factory. I said that wouldn't work because you will likely want a
singleton deserializer object to handle deserializing more than one class so
you cannot link a deserializer object with only one class. What this means is
that, IMO, you cannot get Thrift or Record I/O deserializers to create objects,
the way they work today, and for them, the client has to pass in an object.
similarly, there can be other serializers that always create an object, and
they cannot use one passed in by the client. That is the crux of my argument.
I also mentioned that for a serializer to create its own objects, it requires
all deserialized objects to support constructors with no args. yes, Thrift and
record I/O and Writables do so, but if we want to support various other kinds
of serialization platforms, we're forcing every supported platform to use
no-arg constructors. This seems like an unnecessary restriction to me. I don't
have an example of a deserializer where this would be an issue, but I can
easily imagine situations where you have objects without no-arg constructors
(there are lots of objects that we design where we don't want no-arg
constructors) which you want to deserialize. Anyways, this is a minor point,
and mostly theoretical (though valid, IMO). But it adds to my argument that you
want to have separate deserialize methods and let the client call the right
one. (There is also my argument that it's good design to have separate methods
to make memory management explicit, especially for languages like C++, but I
admit it's not a strong argument if we're only looking at Java).
Again, my point is that deserializers for Thrift and Record I/O cannot create
objects themselves and will always require the client to pass in the object (or
invoke the deserialize method on a known object), so they, or a layer around
them, cannot support a single deserialize method that can optionally take in
an object from a client or create one of its own, at least not without a lot of
pain.
> Add support for a general serialization mechanism for Map Reduce
> ----------------------------------------------------------------
>
> Key: HADOOP-1986
> URL: https://issues.apache.org/jira/browse/HADOOP-1986
> Project: Hadoop
> Issue Type: New Feature
> Components: mapred
> Reporter: Tom White
> Assignee: Tom White
> Fix For: 0.16.0
>
> Attachments: SerializableWritable.java, serializer-v1.patch
>
>
> Currently Map Reduce programs have to use WritableComparable-Writable
> key-value pairs. While it's possible to write Writable wrappers for other
> serialization frameworks (such as Thrift), this is not very convenient: it
> would be nicer to be able to use arbitrary types directly, without explicit
> wrapping and unwrapping.
--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.