[jira] Commented: (HADOOP-3788) Add serialization for Protocol Buffers

Alex Loddengaard (JIRA) Wed, 03 Sep 2008 03:04:08 -0700

    [ 
https://issues.apache.org/jira/browse/HADOOP-3788?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12627948#action_12627948
 ]


Alex Loddengaard commented on HADOOP-3788:
------------------------------------------

After fiddling with Protocol Buffers (PBs) and reading documentation around 
them, actually using PBs may not require the introduction of a new 
Serialization class.

PBs work in the following way:

First, the developer defines a .proto file, which is essentially a schema that 
describes the type of data the user wishes to deal with.  Below is an example 
of an addressbook.proto file taken from the PB documentation, located here: 
http://code.google.com/apis/protocolbuffers/docs/javatutorial.html
{code:title=addressbook.proto|borderStyle=solid}
package tutorial;

option java_package = "com.example.tutorial";
option java_outer_classname = "AddressBookProtos";

message Person {
  required string name = 1;
  required int32 id = 2;
  optional string email = 3;

  enum PhoneType {
    MOBILE = 0;
    HOME = 1;
    WORK = 2;
  }

  message PhoneNumber {
    required string number = 1;
    optional PhoneType type = 2 [default = HOME];
  }

  repeated PhoneNumber phone = 4;
}

message AddressBook {
  repeated Person person = 1;
}
{code}

Once the user has defined their .proto file, they use PB's compiler, _protoc_, 
to generate an outer Java class, accompanied by a few subclasses.  Amongst the 
generated code are methods to serialize and deserialize given an OutputStream 
and InputStream, respectively.

Refactoring .proto files is somewhat tricky given the way PBs work (read their 
documentation for more info), so Google recommends that PBs are wrapped inside 
of other classes and only used when serializing and deserializing.  This 
structure fits in perfectly with Hadoop's Writable structure.  That is, if a 
user wants to utilize PBs, they need to use _protoc_ to create a Java class, 
which is essentially a Bean.  They can then define a new Writable 
implementation that uses the _protoc_-generated class to serialize and 
deserialize.  This is all possible without creating a 
ProtocolBuffersSerialization class because the default Serialization, 
org.apache.hadoop.io.serializer.WritableSerialization, delegates its read and 
write methods to the Writable that is being serialized or deserialized.

A general ProtocolBuffersSerialization class would not fully utilize PBs to 
their fullest, because it would have to use a very primitive, generalized 
.proto file (for example a file with just one field: a large String).

With that said, a few things can be done with regard to this feature:
# I can create an example that extends Text and overwrites its serialization 
methods to use a _protoc_-generated class
# I can begin extending Hadoop's Writable implementations to use PBs instead
# I can begin replacing Hadoop's Writable implementations to use PBs instead
# I can try and create a general ProtocolBuffersSerialization and see how it 
performs, though this solution seems against the premise of using PBs
# You can tell me that my understanding of PBs is completely wrong (please 
follow-up with a more accurate description if this is the case :))

Before either option 2 or 3 is decided on, profiling should be done to ensure 
that PBs are in fact faster than Java's built in mechanism.  If profiling 
proves PBs are faster in all cases, then it seems like option 3 would be the 
most desirable.  However, perhaps more discussion should be made to determine 
if 2 or 3 or some other solution altogether is better.

Again, I'm totally new here, so please argue with me if I have misunderstood 
Hadoop's workings, PBs, or anything else.  While I'm waiting for responses, I 
can begin working on option 1 to prove my understanding of PBs is correct.

> Add serialization for Protocol Buffers
> --------------------------------------
>
>                 Key: HADOOP-3788
>                 URL: https://issues.apache.org/jira/browse/HADOOP-3788
>             Project: Hadoop Core
>          Issue Type: Wish
>          Components: examples, mapred
>            Reporter: Tom White
>            Assignee: Alex Loddengaard
>
> Protocol Buffers (http://code.google.com/p/protobuf/) are a way of encoding 
> data in a compact binary format. This issue is to write a 
> ProtocolBuffersSerialization to support using Protocol Buffers types in 
> MapReduce programs, including an example program. This should probably go 
> into contrib. 

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (HADOOP-3788) Add serialization for Protocol Buffers

Reply via email to