[
https://issues.apache.org/jira/browse/HADOOP-6685?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12932817#action_12932817
]
Owen O'Malley commented on HADOOP-6685:
---------------------------------------
{quote}
Is there a strong reason to use ProtocolBuffers here rather than Avro, which is
already a dependency and provides similar functionality?
{quote}
It isn't clear what context you mean here:
- Giving the user the ability to use ProtocolBuffers
- Using protocol buffers for the metadata
For the first point, ProtocolBuffers is a extremely well engineered and
documented project. The fit and finish are excellent. It has been finely honed
by years of extensive use in production systems. Providing the capability to
natively run ProtocolBuffers through the pipeline without third party add ons
is a big win. See Kevin's presentation about using Protocol Buffers at Twitter
http://www.slideshare.net/kevinweil/protocol-buffers-and-hadoop-at-twitter.
For the second point, Avro is completely unsuitable for that context. For the
serializer's metadata, I need to encode a singleton object. With Avro, I would
need to encode the schema and then the metadata information. To add insult to
injury, the schema will be substantially larger than the data. With
ProtocolBuffers, I just encode the data. The metadata is part of the record. In
other contexts where there are a lot of the same objects being serialized, Avro
is more efficient. This context is very different. As a final point, as I've
told you previously, the Avro setup is very expensive. Writing a 2 row sequence
file is 10x slower using Avro compared to ProtocolBuffers.
I understand that you'd like Avro to be the one and only serialization format
that Hadoop supports. Especially since that will help you push the development
of Avro forward. You forcing Avro on the users is unhealthy for Hadoop.
> Change the generic serialization framework API to use serialization-specific
> bytes instead of Map<String,String> for configuration
> ----------------------------------------------------------------------------------------------------------------------------------
>
> Key: HADOOP-6685
> URL: https://issues.apache.org/jira/browse/HADOOP-6685
> Project: Hadoop Common
> Issue Type: Improvement
> Reporter: Owen O'Malley
> Assignee: Owen O'Malley
> Fix For: 0.22.0
>
> Attachments: libthrift.jar, serial.patch, serial4.patch,
> serial6.patch, SerializationAtSummit.pdf
>
>
> Currently, the generic serialization framework uses Map<String,String> for
> the serialization specific configuration. Since this data is really internal
> to the specific serialization, I think we should change it to be an opaque
> binary blob. This will simplify the interface for defining specific
> serializations for different contexts (MAPREDUCE-1462). It will also move us
> toward having serialized objects for Mappers, Reducers, etc (MAPREDUCE-1183).
--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.