[jira] Commented: (HADOOP-6685) Change the generic serialization framework API to use serialization-specific bytes instead of Map for configuration

Doug Cutting (JIRA) Fri, 19 Nov 2010 09:30:42 -0800

    [ 
https://issues.apache.org/jira/browse/HADOOP-6685?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12933877#action_12933877
 ]


Doug Cutting commented on HADOOP-6685:
--------------------------------------

> Getting ProtocolBuffers, Thrift, and Avro types through MapReduce end to end. 
> Obviously this includes supporting SequenceFiles, which are where the bulk of 
> Hadoop data is currently stored.

This does not follow.  We cannot currently pass an object that does not 
implement Writable through the shuffle without wrapping it in a Writable.  
However we can and do currently support input and output of objects that do not 
implement Writable: RecordReader and RecordWriter do not require Writable.  So 
no modifications to SequenceFile are required to permit end-to-end passage of 
non-Writables in mapreduce.

> Supporting context-specific serializations (input key, input value, shuffle 
> key, shuffle value, output key, output value, etc) so that different 
> serialization options can chosen depending on the application's requirements.

This does not require a binary format, only a metadata format that can be 
somehow nested.  HADOOP-6420 made this possible.

> This worked, but was very ugly. It lead to "stringly-typed" interfaces where 
> you needed to read all of the code to figure out what the legal values for 
> the configuration were.

This sounds like a documentation issue, not a functional deficiency.  This 
style is used consistently throughout Hadoop.  If we seek to replace 
Configuration that should perhaps be considered wholesale rather than piecemeal.

> By making the framework use typed metadata instead of the very generic, but 
> type-less, string to string map many user errors will be avoided.

The current style is to provide methods to access configurations and metadata.  
These methods prevent such type errors.  I have not seen a large number of 
complaints from end users about this aspect of Hadoop.

> The indication that he gave when I gave the presentation on my plan 5 months 
> ago was that he didn't like it, but wouldn't block it. He reiterated that 
> position on this jira 6 days ago. Have you changed your mind, Doug?

I had hoped that not threatening a veto but rather providing strong criticism 
would elicit compromise and collaboration.  It seems to have unfortunately 
achieved the opposite.  I am sorry to learn that this strategy has failed and, 
yes, I am now leaning towards a veto of this issue.

> Bootstrapping wasn't a problem at all.

Bootstrapping a generic serialization system by requiring a particular 
serialization system is a bootstrapping problem.

> The change to the clients is the same size, regardless of whether the 
> metadata is encoded in binary or string to string maps.

That's not true.  If clients already use a Map<String,String> like 
Configuration (as jobs do) then no change is required.


> Change the generic serialization framework API to use serialization-specific 
> bytes instead of Map<String,String> for configuration
> ----------------------------------------------------------------------------------------------------------------------------------
>
>                 Key: HADOOP-6685
>                 URL: https://issues.apache.org/jira/browse/HADOOP-6685
>             Project: Hadoop Common
>          Issue Type: Improvement
>            Reporter: Owen O'Malley
>            Assignee: Owen O'Malley
>             Fix For: 0.22.0
>
>         Attachments: libthrift.jar, serial.patch, serial4.patch, 
> serial6.patch, serial7.patch, SerializationAtSummit.pdf
>
>
> Currently, the generic serialization framework uses Map<String,String> for 
> the serialization specific configuration. Since this data is really internal 
> to the specific serialization, I think we should change it to be an opaque 
> binary blob. This will simplify the interface for defining specific 
> serializations for different contexts (MAPREDUCE-1462). It will also move us 
> toward having serialized objects for Mappers, Reducers, etc (MAPREDUCE-1183).

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (HADOOP-6685) Change the generic serialization framework API to use serialization-specific bytes instead of Map for configuration

Reply via email to