Serialization framework use SequenceFile/TFile/Other metadata to instantiate 
deserializer
-----------------------------------------------------------------------------------------

                 Key: HADOOP-4243
                 URL: https://issues.apache.org/jira/browse/HADOOP-4243
             Project: Hadoop Core
          Issue Type: Improvement
          Components: contrib/serialization
            Reporter: Pete Wyckoff


SequenceFile metadata is useful for storing additional information about the 
serialized data, for example, for RecordIO, whether the data is CSV or Binary.  
For thrift, the same thing - Binary, JSON, ...

For Hive, this may be especially important, because it has a Dynamic generic 
serializer/deserializer that takes its DDL at runtime (as opposed to RecordIO 
and Thrift which require pre-compilation into a specific class whose name can 
be stored in the sequence file key or value class).   In this case, the class 
name is like Record.java in RecordIO - it doesn't tell you anything without the 
DDL.

One way to address this could be adding the sequence file metadata to the 
getDeserializer call in Serialization interface.  The api would then be 
something like getDeserializer(Class<?>, Map<Text, Text> metadata) or 
Properties metadata.

But, I am open to proposals.

This also means that saying a class implements Writable is not enough to 
necessarily deserialize it since it may do specific actions based on the 
metadata - e.g., RecordIO might determine whether to use CSV rather than the 
default Binary deserialization.

There's the other issue of the getSerializer returning the metadata to be 
written to the Sequence/T File.



-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

Reply via email to