[
https://issues.apache.org/jira/browse/HADOOP-3788?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12627948#action_12627948
]
Alex Loddengaard commented on HADOOP-3788:
------------------------------------------
After fiddling with Protocol Buffers (PBs) and reading documentation around
them, actually using PBs may not require the introduction of a new
Serialization class.
PBs work in the following way:
First, the developer defines a .proto file, which is essentially a schema that
describes the type of data the user wishes to deal with. Below is an example
of an addressbook.proto file taken from the PB documentation, located here:
http://code.google.com/apis/protocolbuffers/docs/javatutorial.html
{code:title=addressbook.proto|borderStyle=solid}
package tutorial;
option java_package = "com.example.tutorial";
option java_outer_classname = "AddressBookProtos";
message Person {
required string name = 1;
required int32 id = 2;
optional string email = 3;
enum PhoneType {
MOBILE = 0;
HOME = 1;
WORK = 2;
}
message PhoneNumber {
required string number = 1;
optional PhoneType type = 2 [default = HOME];
}
repeated PhoneNumber phone = 4;
}
message AddressBook {
repeated Person person = 1;
}
{code}
Once the user has defined their .proto file, they use PB's compiler, _protoc_,
to generate an outer Java class, accompanied by a few subclasses. Amongst the
generated code are methods to serialize and deserialize given an OutputStream
and InputStream, respectively.
Refactoring .proto files is somewhat tricky given the way PBs work (read their
documentation for more info), so Google recommends that PBs are wrapped inside
of other classes and only used when serializing and deserializing. This
structure fits in perfectly with Hadoop's Writable structure. That is, if a
user wants to utilize PBs, they need to use _protoc_ to create a Java class,
which is essentially a Bean. They can then define a new Writable
implementation that uses the _protoc_-generated class to serialize and
deserialize. This is all possible without creating a
ProtocolBuffersSerialization class because the default Serialization,
org.apache.hadoop.io.serializer.WritableSerialization, delegates its read and
write methods to the Writable that is being serialized or deserialized.
A general ProtocolBuffersSerialization class would not fully utilize PBs to
their fullest, because it would have to use a very primitive, generalized
.proto file (for example a file with just one field: a large String).
With that said, a few things can be done with regard to this feature:
# I can create an example that extends Text and overwrites its serialization
methods to use a _protoc_-generated class
# I can begin extending Hadoop's Writable implementations to use PBs instead
# I can begin replacing Hadoop's Writable implementations to use PBs instead
# I can try and create a general ProtocolBuffersSerialization and see how it
performs, though this solution seems against the premise of using PBs
# You can tell me that my understanding of PBs is completely wrong (please
follow-up with a more accurate description if this is the case :))
Before either option 2 or 3 is decided on, profiling should be done to ensure
that PBs are in fact faster than Java's built in mechanism. If profiling
proves PBs are faster in all cases, then it seems like option 3 would be the
most desirable. However, perhaps more discussion should be made to determine
if 2 or 3 or some other solution altogether is better.
Again, I'm totally new here, so please argue with me if I have misunderstood
Hadoop's workings, PBs, or anything else. While I'm waiting for responses, I
can begin working on option 1 to prove my understanding of PBs is correct.
> Add serialization for Protocol Buffers
> --------------------------------------
>
> Key: HADOOP-3788
> URL: https://issues.apache.org/jira/browse/HADOOP-3788
> Project: Hadoop Core
> Issue Type: Wish
> Components: examples, mapred
> Reporter: Tom White
> Assignee: Alex Loddengaard
>
> Protocol Buffers (http://code.google.com/p/protobuf/) are a way of encoding
> data in a compact binary format. This issue is to write a
> ProtocolBuffersSerialization to support using Protocol Buffers types in
> MapReduce programs, including an example program. This should probably go
> into contrib.
--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.