[
https://issues.apache.org/jira/browse/HADOOP-1986?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12539065
]
Vivek Ratan commented on HADOOP-1986:
-------------------------------------
>> Why must you have a singleton serializer instance that handles more than one
>> class?
For many reasons. An easy one I can think of is that serializer instances can
have state (an input or output stream, that they keep open across each
serialization, for example). We've been talking about stateful serializers
earlier in this dicsussion and it seems to me like it's quite possible we'll
associate state with serializers for performance. Let's say you have a key
class and a value class, both of which are generated from the Record I?O
compiler, so that both inherit from the Record base class. And let's say you
want to serialize a number of keys and values into a file: a key, followed by a
value, followed by another key, and so on. If you have a separate serializer
instance for each of the key adn value class, they need to share the same
OutputStream object to the file you serialize them to. Having one serializer
instance that handles both keys and values (since they're both Records) will be
cleaner and easier. Its also quite possible that we have serialization
platforms that contain other states (maybe they use some libraries that need to
be initialized once, for example). So forcing people to not create serializers
for more than one class seems restrictive. The choice of whether the
serialization platform shares an instance across multiple classes should be
left to the platform.
>> So would clients like SequenceFile and the mapreduce shuffle require
>> different code to deserialize different classes? We need to have generic
>> client code.
Yes, and that is the fundamental tradeoff. The flip side of what I'm suggesting
is that the client has to write separate code for two kinds of serializers.
That's not great, but I'm arguing that that is better than restricting the kind
of serialization platforms we use, or restricting how we use them. The client
will have to write something like:
{code}
if (serializer.acceptObjectReference()) {
<some Class> o = new <some Class>();
serializer.deserialize(o);
...
}
else {
<some Class> o = serializer.deserialize();
...
}
{code}
Yeah, it's not great, but it's not so bad either, compared to forcing
serialization platforms to not create shared serializers. But it is a tradeoff.
If folks think we're OK forcing serialization platforms to not share serializer
instances across classes, resulting in cleaner client code, then that's fine. I
personally would choose the opposite. But I hope the tradeoff and the pros and
cons are clear.
>> Again, I don't see why Record I/O, where we control the code generation from
>> an IDL, cannot generate a no-arg ctor. Similarly for Thrift. The ctor does
>> not have to be public. We already bypass protections when we create
>> instances.
Well, yes for Thrift and Record I/O but maybe not so for some other platform we
may want to support in the future (and whose code we cannot control). And
besides, no-arg constructors are not the main reason for supporting a single
deserialize method, singleton serializers are.
> Add support for a general serialization mechanism for Map Reduce
> ----------------------------------------------------------------
>
> Key: HADOOP-1986
> URL: https://issues.apache.org/jira/browse/HADOOP-1986
> Project: Hadoop
> Issue Type: New Feature
> Components: mapred
> Reporter: Tom White
> Assignee: Tom White
> Fix For: 0.16.0
>
> Attachments: SerializableWritable.java, serializer-v1.patch
>
>
> Currently Map Reduce programs have to use WritableComparable-Writable
> key-value pairs. While it's possible to write Writable wrappers for other
> serialization frameworks (such as Thrift), this is not very convenient: it
> would be nicer to be able to use arbitrary types directly, without explicit
> wrapping and unwrapping.
--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.