[jira] Commented: (HADOOP-1986) Add support for a general serialization mechanism for Map Reduce

Vivek Ratan (JIRA) Thu, 11 Oct 2007 07:43:42 -0700

    [ 
https://issues.apache.org/jira/browse/HADOOP-1986?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12534055
 ]


Vivek Ratan commented on HADOOP-1986:
-------------------------------------

No, Approach 1, as I've defined it in my previous comment, is *NOT* what you're 
proposing. Approach 1 does not require any DDLs, it does not instantiate 
_Serializer_ objects for different types, there is no _Writable_, no 
_ThriftRecord_. The confusion/disagreement perhaps stems from the fact that 
there are two different but related issues being discussed here (maybe we need 
separate Jiras for each, but I think they're related enough to be discussed 
together). One issue is to do with how do we integrate various serialization 
platforms into the system, i.e., what does the interface look to the user, and 
the other has more to do with implementation/configuration (this is probably 
not a very clean demarcation, but it seems pretty valid to me). 

A lot of initial comments assumed that we would have a serialization interface 
based on type, so that you could have serializers for different types. These 
types would usually be base types for each serialization platform, but they 
also might be more concrete types. What I'm suggesting in Approach 1 is a 
different way to look at this. Approach 1 does not care whether your platform 
has a base class or not. It uses reflection to walk through a class structure 
and interacts with the serialization platform at the level of 
serializing/deserializing basic types (ints, longs, strings, etc), which each 
serialization platform provides. Approach 2 is the one that needs you to 
perhaps create Serializers for base classes for each platform (one for 
_ThriftRecord_, one for Jute record, and so on), and that seems closer to your 
examples. 

[I've sorta waved my hand on how you would configure, or have the user choose 
between, various serializers, especially in Approach 2. A lot of your comments, 
and those of Tom and Doug's, seem to me to focus on this issue. ]

The reason I harp on the two approaches (Approach 1 and Approach 2) is that 
they are, to me, quite different. There is a clear tradeoff between usability 
and performance. Approach 1 favors the former, Approach 2 the latter. Approach 
1 is really easy to use. No DDLs and very little for the user to do. However, 
as I had mentioned earlier, and as Doug's comments seem to indicate, there is a 
real danger of its performance being slow. I don't have an idea of how slow. 
Anybody know how expensive introspection can be (I'm sure it also depends on 
how deeply nested a class is or how many member variables it has, and so on)? 

I think we should support both approaches. It seems quite reasonable to me that 
there will be users who want to define their own key or value classes, don't 
want to write serialization/deserialization code for them, don't want to define 
DDLs or install Thrift or run the Jute compiler, and don't mind paying the 
extra penalty for introspection. Al they need to do is define their Key or 
Value class, and pick a serialization platform (Record I/O or Thrift or 
Writable or whatever) through some simple config option. Wherever we 
serialize/deserialize in the Map/Reduce code (in SequenceFile, or in the Output 
Collector), the code simply calls _Serializer.serialize()_, which accepts any 
Object type. Again, no DDL, nothing. But if you need better performance, or you 
want to use some fancy DDL feature (such as marking fields as optional or 
having default values, or even versioning), then you have to support Approach 
2, which requires the key and value classes to be defined in DDLs, compiled, 
and integrated . We don't use these extra DDL features for basic serialization 
yet, but it's quite reasonable to expect users to want support for them in the 
near future.  

Maybe what we should do is actually measure the performance implication of 
introspection. A generic serializer/deserializer for Approach 1 shouldn't be 
hard to write and we can compare its performance to that for a DDL-generated 
class. If the difference is acceptable, it's much simpler to provide just 
Approach 1. if not, we could either provide both approaches or just provide 
Approach 2, wait till enough people complain that it's hard, and optionally 
provide Approach 1. 

If we do use Approach 2, we will need something that handles the mapping 
between a class and it's serializer, and I think your (Owen's) suggestion is 
fine. I haven't offered any alternate solution. 

On to Doug's comments: 

> If we discard the DDL and code-generation, then we're stuck with 
> introspection, no?

Yes. No DDLs and no code generation implies Approach 1, and hence 
introspection. DDLs and code-generation implies Approach 2, and hence no 
introspection. 

> Finally, if we keep the DDL and generate only the class, not its serializers, 
> then there could theoretically be compatibility issues with other languages.  
> If, for example, the DDL defines different types that map to the same type in 
> Java (short versus character?) then using introspection could cause problems.

Why would you want to do this? The only benefit of DDL is serializers. I don't 
understand the use case here. 

> Do I worry too much?

:) Introspection performance is a real worry, but we should be able to test it 
out, and perhaps also get enough anecdotal evidence. 

> Add support for a general serialization mechanism for Map Reduce
> ----------------------------------------------------------------
>
>                 Key: HADOOP-1986
>                 URL: https://issues.apache.org/jira/browse/HADOOP-1986
>             Project: Hadoop
>          Issue Type: New Feature
>          Components: mapred
>            Reporter: Tom White
>            Assignee: Tom White
>             Fix For: 0.16.0
>
>         Attachments: SerializableWritable.java
>
>
> Currently Map Reduce programs have to use WritableComparable-Writable 
> key-value pairs. While it's possible to write Writable wrappers for other 
> serialization frameworks (such as Thrift), this is not very convenient: it 
> would be nicer to be able to use arbitrary types directly, without explicit 
> wrapping and unwrapping.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (HADOOP-1986) Add support for a general serialization mechanism for Map Reduce

Reply via email to