[
https://issues.apache.org/jira/browse/HADOOP-1986?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12534055
]
Vivek Ratan commented on HADOOP-1986:
-------------------------------------
No, Approach 1, as I've defined it in my previous comment, is *NOT* what you're
proposing. Approach 1 does not require any DDLs, it does not instantiate
_Serializer_ objects for different types, there is no _Writable_, no
_ThriftRecord_. The confusion/disagreement perhaps stems from the fact that
there are two different but related issues being discussed here (maybe we need
separate Jiras for each, but I think they're related enough to be discussed
together). One issue is to do with how do we integrate various serialization
platforms into the system, i.e., what does the interface look to the user, and
the other has more to do with implementation/configuration (this is probably
not a very clean demarcation, but it seems pretty valid to me).
A lot of initial comments assumed that we would have a serialization interface
based on type, so that you could have serializers for different types. These
types would usually be base types for each serialization platform, but they
also might be more concrete types. What I'm suggesting in Approach 1 is a
different way to look at this. Approach 1 does not care whether your platform
has a base class or not. It uses reflection to walk through a class structure
and interacts with the serialization platform at the level of
serializing/deserializing basic types (ints, longs, strings, etc), which each
serialization platform provides. Approach 2 is the one that needs you to
perhaps create Serializers for base classes for each platform (one for
_ThriftRecord_, one for Jute record, and so on), and that seems closer to your
examples.
[I've sorta waved my hand on how you would configure, or have the user choose
between, various serializers, especially in Approach 2. A lot of your comments,
and those of Tom and Doug's, seem to me to focus on this issue. ]
The reason I harp on the two approaches (Approach 1 and Approach 2) is that
they are, to me, quite different. There is a clear tradeoff between usability
and performance. Approach 1 favors the former, Approach 2 the latter. Approach
1 is really easy to use. No DDLs and very little for the user to do. However,
as I had mentioned earlier, and as Doug's comments seem to indicate, there is a
real danger of its performance being slow. I don't have an idea of how slow.
Anybody know how expensive introspection can be (I'm sure it also depends on
how deeply nested a class is or how many member variables it has, and so on)?
I think we should support both approaches. It seems quite reasonable to me that
there will be users who want to define their own key or value classes, don't
want to write serialization/deserialization code for them, don't want to define
DDLs or install Thrift or run the Jute compiler, and don't mind paying the
extra penalty for introspection. Al they need to do is define their Key or
Value class, and pick a serialization platform (Record I/O or Thrift or
Writable or whatever) through some simple config option. Wherever we
serialize/deserialize in the Map/Reduce code (in SequenceFile, or in the Output
Collector), the code simply calls _Serializer.serialize()_, which accepts any
Object type. Again, no DDL, nothing. But if you need better performance, or you
want to use some fancy DDL feature (such as marking fields as optional or
having default values, or even versioning), then you have to support Approach
2, which requires the key and value classes to be defined in DDLs, compiled,
and integrated . We don't use these extra DDL features for basic serialization
yet, but it's quite reasonable to expect users to want support for them in the
near future.
Maybe what we should do is actually measure the performance implication of
introspection. A generic serializer/deserializer for Approach 1 shouldn't be
hard to write and we can compare its performance to that for a DDL-generated
class. If the difference is acceptable, it's much simpler to provide just
Approach 1. if not, we could either provide both approaches or just provide
Approach 2, wait till enough people complain that it's hard, and optionally
provide Approach 1.
If we do use Approach 2, we will need something that handles the mapping
between a class and it's serializer, and I think your (Owen's) suggestion is
fine. I haven't offered any alternate solution.
On to Doug's comments:
> If we discard the DDL and code-generation, then we're stuck with
> introspection, no?
Yes. No DDLs and no code generation implies Approach 1, and hence
introspection. DDLs and code-generation implies Approach 2, and hence no
introspection.
> Finally, if we keep the DDL and generate only the class, not its serializers,
> then there could theoretically be compatibility issues with other languages.
> If, for example, the DDL defines different types that map to the same type in
> Java (short versus character?) then using introspection could cause problems.
Why would you want to do this? The only benefit of DDL is serializers. I don't
understand the use case here.
> Do I worry too much?
:) Introspection performance is a real worry, but we should be able to test it
out, and perhaps also get enough anecdotal evidence.
> Add support for a general serialization mechanism for Map Reduce
> ----------------------------------------------------------------
>
> Key: HADOOP-1986
> URL: https://issues.apache.org/jira/browse/HADOOP-1986
> Project: Hadoop
> Issue Type: New Feature
> Components: mapred
> Reporter: Tom White
> Assignee: Tom White
> Fix For: 0.16.0
>
> Attachments: SerializableWritable.java
>
>
> Currently Map Reduce programs have to use WritableComparable-Writable
> key-value pairs. While it's possible to write Writable wrappers for other
> serialization frameworks (such as Thrift), this is not very convenient: it
> would be nicer to be able to use arbitrary types directly, without explicit
> wrapping and unwrapping.
--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.