[
https://issues.apache.org/jira/browse/HADOOP-6165?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Tom White updated HADOOP-6165:
------------------------------
Attachment: HADOOP-6165.patch
Here's a patch with some ideas about how to go about this. It is very
preliminary.
1. One of the problems is that Serialization/Serializer/Deserializer are all
interfaces, which makes it difficult to evolve them. One way to manage this is
to introduce Base{Serialization,Serializer,Deserializer} abstract classes that
implement the corresponding interface. SerializationFactory will read the
io.serializations configuration property and if a serialization implements
BaseSerialization it will use that directly, while if it is a (legacy)
Serialization it will wrap it in a BaseSerialization. The trick here is to put
legacy Serializations at the end, since they have less metadata and are
therefore less discriminating.
The Serialization/Serializer/Deserializer interfaces are all deprecated and can
be removed in a future release, leaving only
Base{Serialization,Serializer,Deserializer}.
2. In addition to the Map<String, String> metadata do we need to keep the class
metadata? That is, do we need
public abstract boolean accept(Class<?> c, Map<String, String> metadata);
or is the following sufficient?
public abstract boolean accept(Map<String, String> metadata);
We could have a "class" entry in the map which stores this information, but
we'd have to convert it to a Class object to do the isAssignableFrom check that
some serializations need to do, e.g. Writable.class.isAssignableFrom(c).
Perhaps this is OK.
3. Should we have a Metadata class to permit evolution of beyond Map<String,
String>? (E.g. to keep a Class property.)
4. Where does the metadata come from? In the context of MapReduce, the answer
depends on the stage of MapReduce. (None of these changes have been implemented
in the patch.)
i. Map input
The metadata comes from the container. For example, in SequenceFiles the
metadata comes from the key-value class types, and the SequenceFile metadata (a
Map<Text, Text>, which is ideally suited for this scheme).
To support this, SequenceFile.Reader would pass its metadata to the
deserializer. Similarly, SequenceFile.Writer would add metadata from the
BaseSerializer to the SequenceFile's writer.
ii. Map output/Reduce input
The metadata would have to be supplied by the MapReduce framework. Just like we
have mapred.mapoutput.{key,value}.class, we could have properties to specify
extra metadata. Metadata is a map, so something like
mapred.mapoutput.{key,value}.metadata.K where K can be an arbitrary string.
For example, one might define mapred.mapoutput.key.metadata.avroSchema to be
the Avro schema for map output key types. To get this to work we would need
support from Configuration to get a Map from a property prefix. So
conf.getMap("mapred.mapoutput.key.metadata") would return a Map<String, String>
of all the properties under the mapred.mapoutput.key.metadata prefix.
iii. Reduce output
The metadata would have to be supplied by the MapReduce framework. Just like
the map output we could have mapred.output.{key,value}.metadata.K properties.
5. Resolution process
To take an Avro example: AvroReflectSerialization's accept method would look
for a "serialization" key of
org.apache.hadoop.io.serializer.avro.AvroReflectSerialization. The nice thing
about this is that we don't need a list of packages, or even a base type
(AvroReflectSerializeable). This would only work if we had the mechanisms in 4
so that the metadata was passed around correctly.
Writables are an existing Serialization, so the implementation is different,
since there is plenty of existing data with no extra metadata (in SequenceFiles
for instance). So its accept method would check to see if the "serialization"
key is set, and if it is, that it is
"org.apache.hadoop.io.serializer.WritableSerialization". If not set, it would
fall back to the existing check: Writable.class.isAssignableFrom(c).
> Add metadata to Serializations
> ------------------------------
>
> Key: HADOOP-6165
> URL: https://issues.apache.org/jira/browse/HADOOP-6165
> Project: Hadoop Common
> Issue Type: New Feature
> Components: contrib/serialization
> Reporter: Tom White
> Priority: Blocker
> Fix For: 0.21.0
>
> Attachments: HADOOP-6165.patch
>
>
> The Serialization framework only allows a class to be passed as metadata.
> This assumes there is a one-to-one mapping between types and Serializations,
> which is overly restrictive. By permitting applications to pass arbitrary
> metadata to Serializations, they can get more control over which
> Serialization is used, and would also allow, for example, one to pass an Avro
> schema to an Avro Serialization.
--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.