[ 
https://issues.apache.org/jira/browse/HADOOP-6685?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12932817#action_12932817
 ] 

Owen O'Malley commented on HADOOP-6685:
---------------------------------------

{quote}
Is there a strong reason to use ProtocolBuffers here rather than Avro, which is 
already a dependency and provides similar functionality?
{quote}

It isn't clear what context you mean here:
- Giving the user the ability to use ProtocolBuffers
- Using protocol buffers for the metadata

For the first point, ProtocolBuffers is a extremely well engineered and 
documented project. The fit and finish are excellent. It has been finely honed 
by years of extensive use in production systems. Providing the capability to 
natively run ProtocolBuffers through the pipeline without third party add ons 
is a big win. See Kevin's presentation about using Protocol Buffers at Twitter 
http://www.slideshare.net/kevinweil/protocol-buffers-and-hadoop-at-twitter.

For the second point,  Avro is completely unsuitable for that context. For the 
serializer's metadata, I need to encode a singleton object. With Avro, I would 
need to encode the schema and then the metadata information. To add insult to 
injury, the schema will be substantially larger than the data. With 
ProtocolBuffers, I just encode the data. The metadata is part of the record. In 
other contexts where there are a lot of the same objects being serialized, Avro 
is more efficient. This context is very different. As a final point, as I've 
told you previously, the Avro setup is very expensive. Writing a 2 row sequence 
file is 10x slower using Avro compared to ProtocolBuffers.

I understand that you'd like Avro to be the one and only serialization format 
that Hadoop supports. Especially since that will help you push the development 
of Avro forward. You forcing Avro on the users is unhealthy for Hadoop.

> Change the generic serialization framework API to use serialization-specific 
> bytes instead of Map<String,String> for configuration
> ----------------------------------------------------------------------------------------------------------------------------------
>
>                 Key: HADOOP-6685
>                 URL: https://issues.apache.org/jira/browse/HADOOP-6685
>             Project: Hadoop Common
>          Issue Type: Improvement
>            Reporter: Owen O'Malley
>            Assignee: Owen O'Malley
>             Fix For: 0.22.0
>
>         Attachments: libthrift.jar, serial.patch, serial4.patch, 
> serial6.patch, SerializationAtSummit.pdf
>
>
> Currently, the generic serialization framework uses Map<String,String> for 
> the serialization specific configuration. Since this data is really internal 
> to the specific serialization, I think we should change it to be an opaque 
> binary blob. This will simplify the interface for defining specific 
> serializations for different contexts (MAPREDUCE-1462). It will also move us 
> toward having serialized objects for Mappers, Reducers, etc (MAPREDUCE-1183).

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

Reply via email to