Re: Serialization with additional schema info

Ted Dunning Wed, 03 Sep 2008 22:31:06 -0700

What happens when you want to plug in an alternative format?

Thrift allows your to switch from compact binary to JSON formats on the
fly.  Having serializers in the objects makes that hard.


Oddly enough, the way that thrift does this is by putting the serializer in
the code as you suggest, but it isn't code that you have to write and it can
be pretty sophisticated as a result.

On Wed, Sep 3, 2008 at 10:11 PM, Alex Loddengaard <[EMAIL PROTECTED]>wrote:

> Tom and Jay,
>
> This thread has sparked an interest in me, because I'm in the midst of
> creating a new Protocol Buffer serialization.  While the two of you are
> discussing the pros and cons of the current synchronization framework, I
> would like to ask a question myself.
>
> I have always let classes define for themselves how they serialize, because
> this seems like a simple approach.  Just for the sake of my personal
> advancement, can someone present an argument as to why using a
> serialization
> framework with a SerializationFactory is a better solution compared to a
> system where each Writable implementation defines for itself its
> serialization mechanism?
>
> I have done what research I can to try and get insight to this question,
> but
> I haven't turned up anything useful.  After some thought, I may have
> answered my own question.  Let me know if I'm missing any pros and cons
> from
> the list below:
>
> *Pros for a Writable-defined serialization framework:*
> -Simple (to plugin all one must do is implement the Writable interface)
> -Intuitive
> -Efficient (the class knows its data best, so it should decide the best
> approach to storing that data)
>
> *Cons:
> *-Refactoring danger (if a class changes its serialization mechanism, then
> stored legacy instances are lost)
> -Not as flexible (if one wants to switch from Protocol Buffers to Thrift,
> then each Writable class would need to change)
>
> Again, I'm mostly sending this email to expand my knowledge of good O-O
> approaches, as I'm a newcomer to Hadoop.
>
> Thanks,
>
> Alex
>
> On Thu, Sep 4, 2008 at 6:52 AM, Jay Kreps <[EMAIL PROTECTED]> wrote:
>
> > Hi Tom,
> >
> > Your concern about cross-dependencies makes sense.  To state the
> > problem more generally, a serializer is a mapping of on-disk
> > structures to in-memory structures; one way to maintain this mapping
> > is to pregenerate a class for each mapping using some IDL like thrift,
> > but another common way is just to store the mapping as data.  Both
> > approaches have pluses and minuses (the usual trade-offs of
> > code-generation vs reflection) but I believe the current interface
> > only allows the code generation approach.
> >
> > To remove the dependence on mapreduce from what I had below in (3) you
> > would change the SerializationFactory method from
> > SerializationFactory.getSerialization(Class<T> c)
> > to
> > SerializationFactory.getSerialization(Class<T> c, String s)
> >
> > and to use it in MapTask when you create the serializer for the
> > MapOutputBuffer instead of
> > keySerializer =
> > serializationFactory.getSerializer(job.getMapOutputKeyClass());
> > you would have
> > keySerializer =
> > serializationFactory.getSerializer(job.getMapOutputKeyClass(),
> > job.get("mapred.mapper.serialization.info"));
> >
> > And similarly for reducers and deserializers. The parameter would
> > contain any additional schema information you needed to disambiguate a
> > class such as Object, List, or Map that might have different
> > serializations depending on its contents (e.g. it might get passed in
> > List.class and  "list(int32)" to instruct the serializer to read a
> > list of integers, or Map.class and "map(int32, list(string))" to
> > indicate a map of integers to lists of strings. Then implementations
> > could chose to use the string parameter, the class parameter, or both
> > when constructing the Serialization.
> >
> > Thanks for the pointer to jaql, that seems very cool, but I believe
> > jaql would have the same problem if they tried to implement any kind
> > of compact structured storage.  Jaql would return a JArray or JRecord
> > which might have a variety of fields and you would want to store the
> > data about what kinds of fields separately.
> >
> > Thanks,
> >
> > -Jay
> >
> > > From: "Tom White" <[EMAIL PROTECTED]>
> > > To: [email protected]
> > > Date: Wed, 3 Sep 2008 14:15:28 +0100
> > > Subject: Re: Serialization with additional schema info
> > > Jay,
> > >
> > > The Serialization and MapReduce APIs are very class-based - so having
> > > fixed types with dynamic serialization capabilities doesn't work as
> > > well in the current design.
> > >
> > > I like 2 better than 1, but both make the Serialization API dependent
> > > on MapReduce, which it currently isn't. And arguably it shouldn't be
> > > as you could use it simply to do serialization of data, outside a
> > > MapReduce context. Perhaps SerializerType is just a String, which also
> > > makes things more flexible (at the expense of type-safety)?
> > >
> > > What would the API changes look like for 3?
> > >
> > > Also, I believe the Jaql team has been looking at how to write JSON
> > > serializers, so perhaps there is an opportunity for collaboration
> > > here?
> > >
> > > Tom
> > >
> > > On Mon, Sep 1, 2008 at 9:52 PM, Jay Kreps <[EMAIL PROTECTED]>
> > wrote:
> > > > Hi All,
> > > >
> > > > I am interested in hooking up a custom serialization layer I use to
> the
> > new
> > > > pluggable Hadoop serialization framework. It appears that the
> framework
> > > > assumes there is a one-to-one mapping between java classes and
> > > > serializations.  This is exactly what we want to get away
> from--having
> > a
> > > > common data format allows us to easily write generic data aggregation
> > jobs
> > > > that work with any type. This is exactly how a database supports many
> > > > generic operations such as joins, group bys, etc--because the
> > dataformat is
> > > > always a set of tuples which can be generically manipulated without
> > > > understanding any of the details of interpretation rather than user
> > defined
> > > > complex types the db can't operate on. To do this I need to store
> data
> > in a
> > > > standard way with supported types and have a short string schema
> > description
> > > > along with each file, and pass that description to a generic
> > > > serializer/deserializer in order to tell it how to read the bytes in
> > the
> > > > file. The problem I have is that there is no way to get the
> additional
> > > > schema information into the serializer to tell it how to serialize
> and
> > > > deserialize.
> > > >
> > > > Some Details in case the general problem is too vague:
> > > >
> > > > A very nice generic data format that maps well to programming
> languages
> > is
> > > > JSON. For example a user could be stored like this: {"name":"Jay",
> > > > "date-o-birth":"05-25-1980", "age":28, "is_active": true, etc.}. But
> > since
> > > > we store the same fields with each "row", this is highly inefficient.
> > It
> > > > makes more sense to just store the necessary bytes for the values,
> and
> > store
> > > > what fields we are expecting, and the expected type seperately. This
> > let's
> > > > us store numbers compactly as well.
> > > >
> > > > JSON supports numbers, strings, lists, and maps, which all have
> natural
> > > > mappings in Java. The above user example would translate to a java
> Map
> > > > containing the given keys and values.
> > > >
> > > > Here is where the trouble starts. I can't do this in the existing
> > > > SerializationFactory because the type for the object is just
> Map.class,
> > but
> > > > that doesn't contain enough info to properly deserialize the class.
> In
> > > > reality I need a string describing the type, such as
> > > >  {"name":"string", "date-o-birth":"date", "age":"int32",
> > > > "is_active":"boolean", ...}
> > > > Note that this string contains all the information needed to add in
> the
> > > > property names and to correctly interpret the bytes as Integer or
> > Boolean,
> > > > or whatever.
> > > >
> > > > The obvious solution is to just add this schema into the JobConf as a
> > > > property such as "map.key.schema.info", and use it to construct the
> > right
> > > > serializer in the Serialization implmentation. The problem with this
> is
> > that
> > > > there is no way for the Serialization implementation to know whether
> it
> > is
> > > > constructing the map key, map value, reduce key, or reduce value.
> > > >
> > > > Some possible solutions:
> > > >
> > > > For now I am just sticking with wrapping up map and reduce to do the
> > > > serialization/deserialization to solve my problem. However this seems
> > like a
> > > > common case where the serialization needs information not present in
> > the
> > > > class itself, and I would like to add support to do it right. Would
> you
> > guys
> > > > accept a patch that did one of the following:
> > > >
> > > > 1. Make SerializationFactory have a getMapKeySerializer,
> > > > getMapValueSerializer, etc. method and allow the user to specify
> their
> > own
> > > > SerializationFactory by setting a property with the appropriate class
> > name.
> > > > This is probably the most flexible and doesn't break any user
> > serialization
> > > > implementations. The getMapKeySerializer method can then check the
> > > > map.key.schema.info in addition to mapred.mapinput.key.class.
> > > > 2. Change Serialization.getSerializer(
> > > Class c) to
> > > > Serialization.getSerializer(Class c, SerializerType k) where
> > SerializerType
> > > > = enum {MapKey, MapValue, ReduceKey, ReduceValue}. This allows the
> > > > serialization implementer to invent their own properties
> > (map.key.schema or
> > > > whatever) and fetch the appropriate thing.
> > > > 3. Add mapred.mapinput.serializer.info,
> > mapred.reduceinput.serializer.info,
> > > > etc. and pass the value of this into the constructor of the
> serializer
> > if it
> > > > has a constructor with a single string argument.
> > > >
> > > > Or maybe there a better way to accomplish this?
> > > >
> > > > Thanks!
> > > >
> > > > -Jay
> > > >
> >
>



-- 
ted

Re: Serialization with additional schema info

Reply via email to