[
https://issues.apache.org/jira/browse/HADOOP-6685?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12933877#action_12933877
]
Doug Cutting commented on HADOOP-6685:
--------------------------------------
> Getting ProtocolBuffers, Thrift, and Avro types through MapReduce end to end.
> Obviously this includes supporting SequenceFiles, which are where the bulk of
> Hadoop data is currently stored.
This does not follow. We cannot currently pass an object that does not
implement Writable through the shuffle without wrapping it in a Writable.
However we can and do currently support input and output of objects that do not
implement Writable: RecordReader and RecordWriter do not require Writable. So
no modifications to SequenceFile are required to permit end-to-end passage of
non-Writables in mapreduce.
> Supporting context-specific serializations (input key, input value, shuffle
> key, shuffle value, output key, output value, etc) so that different
> serialization options can chosen depending on the application's requirements.
This does not require a binary format, only a metadata format that can be
somehow nested. HADOOP-6420 made this possible.
> This worked, but was very ugly. It lead to "stringly-typed" interfaces where
> you needed to read all of the code to figure out what the legal values for
> the configuration were.
This sounds like a documentation issue, not a functional deficiency. This
style is used consistently throughout Hadoop. If we seek to replace
Configuration that should perhaps be considered wholesale rather than piecemeal.
> By making the framework use typed metadata instead of the very generic, but
> type-less, string to string map many user errors will be avoided.
The current style is to provide methods to access configurations and metadata.
These methods prevent such type errors. I have not seen a large number of
complaints from end users about this aspect of Hadoop.
> The indication that he gave when I gave the presentation on my plan 5 months
> ago was that he didn't like it, but wouldn't block it. He reiterated that
> position on this jira 6 days ago. Have you changed your mind, Doug?
I had hoped that not threatening a veto but rather providing strong criticism
would elicit compromise and collaboration. It seems to have unfortunately
achieved the opposite. I am sorry to learn that this strategy has failed and,
yes, I am now leaning towards a veto of this issue.
> Bootstrapping wasn't a problem at all.
Bootstrapping a generic serialization system by requiring a particular
serialization system is a bootstrapping problem.
> The change to the clients is the same size, regardless of whether the
> metadata is encoded in binary or string to string maps.
That's not true. If clients already use a Map<String,String> like
Configuration (as jobs do) then no change is required.
> Change the generic serialization framework API to use serialization-specific
> bytes instead of Map<String,String> for configuration
> ----------------------------------------------------------------------------------------------------------------------------------
>
> Key: HADOOP-6685
> URL: https://issues.apache.org/jira/browse/HADOOP-6685
> Project: Hadoop Common
> Issue Type: Improvement
> Reporter: Owen O'Malley
> Assignee: Owen O'Malley
> Fix For: 0.22.0
>
> Attachments: libthrift.jar, serial.patch, serial4.patch,
> serial6.patch, serial7.patch, SerializationAtSummit.pdf
>
>
> Currently, the generic serialization framework uses Map<String,String> for
> the serialization specific configuration. Since this data is really internal
> to the specific serialization, I think we should change it to be an opaque
> binary blob. This will simplify the interface for defining specific
> serializations for different contexts (MAPREDUCE-1462). It will also move us
> toward having serialized objects for Mappers, Reducers, etc (MAPREDUCE-1183).
--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.