[
https://issues.apache.org/jira/browse/HADOOP-1986?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12533604
]
Vivek Ratan commented on HADOOP-1986:
-------------------------------------
Owen, your details helped me understand your proposal a little bit more, but
I'm still unsure why we need serializers for various base types. Let me
describe in some detail what I was thinking about, and then let's see how
different it is from what you're proposing. It's a relatively long writeup
(again), but I couldn't think of a better way.
I think we all agree with Tom's proposal that we want the Mapper to be
instantiated for any kind of key and value classes, so let's start with that
requirement. As I mentioned in my earlier comment, there are two approaches.
Approach 1
--------------------
Let's say we have a Serializer interface as follows:
{code}
public interface Serializer {
public void serialize(Object o, OutputStream out) throws IOException;
public void deserialize(Object o, InputStream in) throws IOException;
}
{code}
A key difference that I'm proposing is that as far as a user is concerned, a
Serializer is able to serialize any kind of object, not an object of a
particular base type.
Next, we implement Serializer for each kind of serialization platform that we
want supported in Hadoop. Let's take Record I/O and Thrift, for example. we
would have:
{code}
public class RecordIOSerializer implements Serializer {
// this is our link to the Record I/O platform
private org.apache.hadoop.record.RecordOutput out;
public void init(output_type, stream) {
switch (output_type) {
case CSV: out = new CsvRecordOutput(stream);
...
}
public void serialize(Object o, OutputStream out) throws IOException {
// use reflection to walk through class members
for each field {
switch (field type) {
case byte:
out.writeByte(field value);
break;
case string:
out.writeString(...);
...
}
}
}
{code}
Similarly, we would have a _ThriftSerializer_ class that implements
_Serializer_. It would probably have a Thrift _TProtocol_ object as a member
variable (this could be a _TBinaryProtocol_ object if we want binary format, or
a corresponding one for another format). _ThriftSerializer.serialize()_ would
be similar to _RecordIOSerializer.serialize()_ in that the code would use
reflection to walk through the class members and call _TProtocol_ methods to
serialize base types.
Whenever we need to serialize/deserialize an object in MapReduce, we simply
obtain the right serializer object (_RecordIOSerializer_ or _ThriftSerializer_
or whatever) using a factory or through a configuration file, then simply
serialize/deserialize any object. In my earlier comment, I'd mentioned two
phases when serializing/deserializing. In this approach, the first phase, that
of walking through the class structure, is done in a generic manner, while we
invoke a specific serialization platform to do the second phase (that of
writing/reading base types).
This approach is useful in that clients do not have to write DDLs for any class
they want to serialize and there is no compilation of DDLs either. You also do
not need a generic interface that all serialization platforms have to implement
(so there is no need to change anything in Record I/O or Thrift). The tradeoff
is the potential expense of reflection. A deserializer may be a little harder
to write because you have to ensure that you walk through the class in the same
order as you do when you serialized it.
Approach 2
-----------------
Another approach is to have the serialization platform handle both phases
(walking through the class structure and writing/reading basic types). This
approach, I think, is closer to what Owen and Tom are suggesting. For this
approach, we have to assume that any serialization platform generates classes
that can read/write themselves and that all have a common base type. In Record
I/O, this base type would be _org.apache.hadoop.record.Record_, while for
Thrift, it would be the interface that Thrift has just added for all generated
classes (based on their conversations today). So you would have a type-based
_Serializer_ interface, as Tom describes in the first comment in this
discussion, and you would have an implementation for
_org.apache.hadoop.record.Record_ and one for the Thrift interface. Something
like this:
{code}
class RecordIOSerializer2 implements
Serializer<org.apache.hadoop.record.Record> {
void serialize(org.apache.hadoop.record.Record t, OutputStream o) {
RecordOutput out = getRecordOutput(o, ...);
t.serialize(out);
}
...
}
{code}
or
{code}
class ThriftSerializer2 implements Serializer<com.facebook.Thrift> {
void serialize(com.facebook.Thrift t, OutputStream o) {
TriftProtocol prot = getThriftProtocol(o, ...);
t.write(prot);
}
...
}
{code}
In this approach, any class that needs to be serilaized/deserialized would have
to be described in a DDL (if using Thrift or record I/O), and a stub
generated. So each Key or Value class would have to be generated through it's
DDL. The stub contains the code for walking through the class and
reading/writing it. For example, if I have a class _MyClass_ that I want to use
as a Key in a Map/Reduce task, I would write a DDL for it and run the
appropriate compiler. In record I/O, I would get a Java class _MyClass_, which
would extend _org.apache.hadoop.record.Record_, and this is the class I would
use for my Map/Reduce job. I'd also have to make sure that I associate
_Serializer<org.apache.hadoop.record.Record>_ with this class.
The benefit here is that the classes generated by the DDL compiler are
optimized for performance, but on the flip side, you need to define a DDL for
each class you want serialized, and generate its stub through a DDL compiler.
This may be OK for classes in core Hadoop, but it can be a cumbersome step for
user-defined classes.
Summary
---------------
These are the two approaches I can think of, and I suspect your approach maps
closely with Approach 2. I personally think that Approach 1 is significantly
easier to use and should be preferred unless reflection comes with an
unacceptable run-time cost. No DDLs, no stubs, nothing for the user to do. We
could also offer both, or use Approach 2 for core Hadoop classes and Approach 1
for classes the user defines. Approach 1 is also cleaner, IMO. You want to give
the user a serializer/deserializer that reads/writes any Object. The
_serialize()_ and _deserialize()_ methods should accept any object, or should
be methods available off of any class that can be serialized/deserialized.
Owen/Tom, I'm not sure I fully understand your approach (though I do think it's
similar to Approach 2), so perhaps you can provide some more details. if a user
wants to use their own class for a Key or Value, and they want to use Thrift or
Record I/O or their own custom code for serialization/deserializaton, do they
have to define a DDL for that class and run it through a compiler? Would you
also instantiate _Serializer_ for any type other than the base types for Record
I/O or Thrift? At a higher level, I still don't like the idea of binding
serializers to types. You may want to do that for implementation sake, but a
serializer's interface to a user should accept any object.
I haven't talked about implementation details regarding how you would configure
serializers for various Map/Reduce steps (deserializing Map input, serializing
Reduce output, storing intermediate data...). Those seem fairly easy
(relatively) to resolve. We can use config files, or class factories, or
whatever.
> Add support for a general serialization mechanism for Map Reduce
> ----------------------------------------------------------------
>
> Key: HADOOP-1986
> URL: https://issues.apache.org/jira/browse/HADOOP-1986
> Project: Hadoop
> Issue Type: New Feature
> Components: mapred
> Reporter: Tom White
> Assignee: Tom White
> Fix For: 0.16.0
>
> Attachments: SerializableWritable.java
>
>
> Currently Map Reduce programs have to use WritableComparable-Writable
> key-value pairs. While it's possible to write Writable wrappers for other
> serialization frameworks (such as Thrift), this is not very convenient: it
> would be nicer to be able to use arbitrary types directly, without explicit
> wrapping and unwrapping.
--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.