[ 
https://issues.apache.org/jira/browse/HADOOP-6165?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Tom White updated HADOOP-6165:
------------------------------

    Attachment: HADOOP-6165.patch

Here's a patch with some ideas about how to go about this. It is very 
preliminary.



1. One of the problems is that Serialization/Serializer/Deserializer are all 
interfaces, which makes it difficult to evolve them. One way to manage this is 
to introduce Base{Serialization,Serializer,Deserializer} abstract classes that 
implement the corresponding interface. SerializationFactory will read the 
io.serializations configuration property and if a serialization implements 
BaseSerialization it will use that directly, while if it is a (legacy) 
Serialization it will wrap it in a BaseSerialization. The trick here is to put 
legacy Serializations at the end, since they have less metadata and are 
therefore less discriminating.

The Serialization/Serializer/Deserializer interfaces are all deprecated and can 
be removed in a future release, leaving only 
Base{Serialization,Serializer,Deserializer}.

2. In addition to the Map<String, String> metadata do we need to keep the class 
metadata? That is, do we need 

public abstract boolean accept(Class<?> c, Map<String, String> metadata);

or is the following sufficient?

public abstract boolean accept(Map<String, String> metadata); 

We could have a "class" entry in the map which stores this information, but 
we'd have to convert it to a Class object to do the isAssignableFrom check that 
some serializations need to do, e.g. Writable.class.isAssignableFrom(c). 
Perhaps this is OK.

3. Should we have a Metadata class to permit evolution of beyond Map<String, 
String>? (E.g. to keep a Class property.)

4. Where does the metadata come from? In the context of MapReduce, the answer 
depends on the stage of MapReduce. (None of these changes have been implemented 
in the patch.)

i. Map input

The metadata comes from the container. For example, in SequenceFiles the 
metadata comes from the key-value class types, and the SequenceFile metadata (a 
Map<Text, Text>, which is ideally suited for this scheme).

To support this, SequenceFile.Reader would pass its metadata to the 
deserializer. Similarly, SequenceFile.Writer would add metadata from the 
BaseSerializer to the SequenceFile's writer.

ii. Map output/Reduce input

The metadata would have to be supplied by the MapReduce framework. Just like we 
have mapred.mapoutput.{key,value}.class, we could have properties to specify 
extra metadata. Metadata is a map, so something like 
mapred.mapoutput.{key,value}.metadata.K where K can be an arbitrary string.

For example, one might define mapred.mapoutput.key.metadata.avroSchema to be 
the Avro schema for map output key types. To get this to work we would need 
support from Configuration to get a Map from a property prefix. So 
conf.getMap("mapred.mapoutput.key.metadata") would return a Map<String, String> 
of all the properties under the mapred.mapoutput.key.metadata prefix.

iii. Reduce output

The metadata would have to be supplied by the MapReduce framework. Just like 
the map output we could have mapred.output.{key,value}.metadata.K properties.

5. Resolution process

To take an Avro example: AvroReflectSerialization's accept method would look 
for a "serialization" key of 
org.apache.hadoop.io.serializer.avro.AvroReflectSerialization. The nice thing 
about this is that we don't need a list of packages, or even a base type 
(AvroReflectSerializeable). This would only work if we had the mechanisms in 4 
so that the metadata was passed around correctly.

Writables are an existing Serialization, so the implementation is different, 
since there is plenty of existing data with no extra metadata (in SequenceFiles 
for instance). So its accept method would check to see if the "serialization" 
key is set, and if it is, that it is 
"org.apache.hadoop.io.serializer.WritableSerialization". If not set, it would 
fall back to the existing check: Writable.class.isAssignableFrom(c).


> Add metadata to Serializations
> ------------------------------
>
>                 Key: HADOOP-6165
>                 URL: https://issues.apache.org/jira/browse/HADOOP-6165
>             Project: Hadoop Common
>          Issue Type: New Feature
>          Components: contrib/serialization
>            Reporter: Tom White
>            Priority: Blocker
>             Fix For: 0.21.0
>
>         Attachments: HADOOP-6165.patch
>
>
> The Serialization framework only allows a class to be passed as metadata. 
> This assumes there is a one-to-one mapping between types and Serializations, 
> which is overly restrictive. By permitting applications to pass arbitrary 
> metadata to Serializations, they can get more control over which 
> Serialization is used, and would also allow, for example, one to pass an Avro 
> schema to an Avro Serialization.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

Reply via email to