[jira] Commented: (HADOOP-1986) Add support for a general serialization mechanism for Map Reduce

Vivek Ratan (JIRA) Tue, 09 Oct 2007 22:29:12 -0700

    [ 
https://issues.apache.org/jira/browse/HADOOP-1986?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12533604
 ]


Vivek Ratan commented on HADOOP-1986:
-------------------------------------

Owen, your details helped me understand your proposal a little bit more, but 
I'm still unsure why we need serializers for various base types. Let me 
describe in some detail what I was thinking about, and then let's see how 
different it is from what you're proposing. It's a relatively long writeup 
(again), but I couldn't think of a better way. 

I think we all agree with Tom's proposal that we want the Mapper to be 
instantiated for any kind of key and value classes, so let's start with that 
requirement. As I mentioned in my earlier comment, there are two approaches. 

Approach 1
--------------------

Let's say we have a Serializer interface as follows: 
{code}
public interface Serializer {
  public void serialize(Object o, OutputStream out) throws IOException;
  public void deserialize(Object o, InputStream in) throws IOException;
}
{code}
A key difference that I'm proposing is that as far as a user is concerned, a 
Serializer is able to serialize any kind of object, not an object of a 
particular base type. 

Next, we implement Serializer for each kind of serialization platform that we 
want supported in Hadoop. Let's take Record I/O and Thrift, for example. we 
would have: 
{code}
public class RecordIOSerializer implements Serializer {

  // this is our link to the Record I/O platform 
  private org.apache.hadoop.record.RecordOutput out;

  public void init(output_type, stream) {
    switch (output_type) {
      case CSV: out = new CsvRecordOutput(stream);
        ...
  }

  public void serialize(Object o, OutputStream out) throws IOException {
    // use reflection to walk through class members
    for each field {
      switch (field type) {
        case byte: 
           out.writeByte(field value);
           break;
         case string:
            out.writeString(...);
         ...
      }
    }
}
{code}
Similarly, we would have a _ThriftSerializer_ class that implements 
_Serializer_. It would probably have a Thrift _TProtocol_ object as a member 
variable (this could be a _TBinaryProtocol_ object if we want binary format, or 
a corresponding one for another format). _ThriftSerializer.serialize()_ would 
be similar to _RecordIOSerializer.serialize()_ in that the code would use 
reflection to walk through the class members and call _TProtocol_ methods to 
serialize base types.

Whenever we need to serialize/deserialize an object in MapReduce, we simply 
obtain the right serializer object (_RecordIOSerializer_ or _ThriftSerializer_ 
or whatever) using a factory or through a configuration file, then simply 
serialize/deserialize any object. In my earlier comment, I'd mentioned two 
phases when serializing/deserializing. In this approach, the first phase, that 
of walking through the class structure, is done in a generic manner, while we 
invoke a specific serialization platform to do the second phase (that of 
writing/reading base types). 

This approach is useful in that clients do not have to write DDLs for any class 
they want to serialize and there is no compilation of DDLs either. You also do 
not need a generic interface that all serialization platforms have to implement 
(so there is no need to change anything in Record I/O or Thrift). The tradeoff 
is the potential expense of reflection. A deserializer may be a little harder 
to write because you have to ensure that you walk through the class in the same 
order as you do when you serialized it. 


Approach 2
-----------------

Another approach is to have the serialization platform handle both phases 
(walking through the class structure and writing/reading basic types). This 
approach, I think, is closer to what Owen and Tom are suggesting. For this 
approach, we have to assume that any serialization platform generates classes 
that can read/write themselves and that all have a common base type. In Record 
I/O, this base type would be _org.apache.hadoop.record.Record_, while for 
Thrift, it would be the interface that Thrift has just added for all generated 
classes (based on their conversations today). So you would have a type-based 
_Serializer_ interface, as Tom describes in the first comment in this 
discussion, and you would have an implementation for 
_org.apache.hadoop.record.Record_ and one for the Thrift interface. Something 
like this: 
{code}
class RecordIOSerializer2 implements 
Serializer<org.apache.hadoop.record.Record> {
  void serialize(org.apache.hadoop.record.Record t, OutputStream o) {
    RecordOutput out = getRecordOutput(o, ...);
    t.serialize(out);
  }
  ...
}
{code}

or 

{code}
class ThriftSerializer2 implements Serializer<com.facebook.Thrift> {
  void serialize(com.facebook.Thrift t, OutputStream o) {
    TriftProtocol prot = getThriftProtocol(o, ...);
    t.write(prot);
  }
  ...
}
{code}

In this approach, any class that needs to be serilaized/deserialized would have 
to be described in a DDL (if using Thrift or record I/O), and  a stub 
generated. So each Key or Value class would have to be generated through it's 
DDL. The stub contains the code for walking through the class and 
reading/writing it. For example, if I have a class _MyClass_ that I want to use 
as a Key in a Map/Reduce task, I would write a DDL for it and run the 
appropriate compiler. In record I/O, I would get a Java class _MyClass_, which 
would extend _org.apache.hadoop.record.Record_, and this is the class I would 
use for my Map/Reduce job. I'd also have to make sure that I associate 
_Serializer<org.apache.hadoop.record.Record>_ with this class. 

The benefit here is that the classes generated by the DDL compiler are 
optimized for performance, but on the flip side, you need to define a DDL for 
each class you want serialized, and generate its stub through a DDL compiler. 
This may be OK for classes in core Hadoop, but it can be a cumbersome step for 
user-defined classes. 


Summary
---------------

These are the two approaches I can think of, and I suspect your approach maps 
closely with Approach 2. I personally think that Approach 1 is significantly 
easier to use and should be preferred unless reflection comes with an 
unacceptable run-time cost. No DDLs, no stubs, nothing for the user to do. We 
could also offer both, or use Approach 2 for core Hadoop classes and Approach 1 
for classes the user defines. Approach 1 is also cleaner, IMO. You want to give 
the user a serializer/deserializer that reads/writes any Object. The 
_serialize()_ and _deserialize()_ methods should accept any object, or should 
be methods available off of any class that can be serialized/deserialized.  

Owen/Tom, I'm not sure I fully understand your approach (though I do think it's 
similar to Approach 2), so perhaps you can provide some more details. if a user 
wants to use their own class for a Key or Value, and they want to use Thrift or 
Record I/O or their own custom code for serialization/deserializaton, do they 
have to define a DDL for that class and run it through a compiler? Would you 
also instantiate _Serializer_ for any type other than the base types for Record 
I/O or Thrift? At a higher level, I still don't like the idea of binding 
serializers to types. You may want to do that for implementation sake, but a 
serializer's interface to a user should accept any object.

I haven't talked about implementation details regarding how you would configure 
serializers for various Map/Reduce steps (deserializing Map input, serializing 
Reduce output, storing intermediate data...). Those seem fairly easy 
(relatively) to resolve. We can use config files, or class factories, or 
whatever. 

> Add support for a general serialization mechanism for Map Reduce
> ----------------------------------------------------------------
>
>                 Key: HADOOP-1986
>                 URL: https://issues.apache.org/jira/browse/HADOOP-1986
>             Project: Hadoop
>          Issue Type: New Feature
>          Components: mapred
>            Reporter: Tom White
>            Assignee: Tom White
>             Fix For: 0.16.0
>
>         Attachments: SerializableWritable.java
>
>
> Currently Map Reduce programs have to use WritableComparable-Writable 
> key-value pairs. While it's possible to write Writable wrappers for other 
> serialization frameworks (such as Thrift), this is not very convenient: it 
> would be nicer to be able to use arbitrary types directly, without explicit 
> wrapping and unwrapping.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (HADOOP-1986) Add support for a general serialization mechanism for Map Reduce

Reply via email to