I am all up for not hijacking avro API's :)

Dirty-bits serialization came up first in Hadoop mapreduce, since we have
to serialize the data and the mutation state between tasks. I can think of
other cases, where you may want to serialize object-mutation state, where
you are passing the objects through the wire, but it is not a big use case
(compared to hadoop).

Currently, we are extending DatumReader/Writer's in avro, and define a
custom Hadoop serialization to use the DatumReader/Writers, which
effectively augments the on-wire data format cleanly. We can accomplish a
similar thing, without extending DatumReaders, but wrapping around them. I
believe DatumReader/Writer APIs should be visible, but I am not sure.
Otherwise we can use higher level public API's to do the serialization.

Cheers,
Enis

On Fri, May 18, 2012 at 2:17 PM, Ed Kohlwey <[email protected]> wrote:

> Enis,
> Thanks for the pointers. Are the dirty bits only used by Map/Reduce or
> for general persistence in terms of application logic? I guess in the
> latter case its ok for them to be transient, and if the only other use
> case is in Map/Reduce, something could maybe be done in the input and
> output formats to avoid fiddling with the pseudo-official Avro API's.
>
> On Fri, May 18, 2012 at 2:05 PM, Enis Söztutar <[email protected]>
> wrote:
> > Hi Ed,
> >
> > Good to see some interest in pushing things forward.
> >
> > As the javadoc says, FakeResolvingDecoder is pretty much a big dirty hack
> > to work around Avro's internals, but as you pointed out much has changed
> in
> > Avro, so we may have to rethink those parts.
> >
> > We need the dirty bits in the serialization for mapreduce, but not for
> the
> > final serialization at the store (hbase, cassandra, etc). The reasoning
> is
> > that during map - reduce phases, we may mutate the objects in map, which
> is
> > serialized and deserialized from reduce and used there.
> >
> > I have not spend any time on the change in avro for some time, so cannot
> > comment on what would be the cleanest way to go. Either way, we can
> augment
> > the schema, or hijack DatumReaders/Writers. If you are willing to work on
> > this, I think it is best to find out what is public / stable in avro, and
> > extend those parts. When we first wrote these parts, avro was very young,
> > and it was not clear what was the public API. Maybe consulting avro
> folks,
> > and pushing for changes / hooks in avro so that things don't break is a
> > good option.
> >
> > I don't believe we need anything other that dirty bits to be augmented.
> If
> > you are planning to work on this, feel free to reach out.
> >
> > Cheers,
> > Enis
> >
> > On Fri, May 18, 2012 at 8:45 AM, Ed Kohlwey <[email protected]> wrote:
> >
> >> Hi,
> >> I'm working on updating Gora to Avro 1.7- I've mostly figured out what
> >> I need to do, except whats happening in FakeResolvingDecoder.java.
> >>
> >> Avro now uses a nice factory system which essentially prevents you
> >> from extending some of these core classes, so a different workaround
> >> will have to do.
> >>
> >> It looks like this is basically a way to work around having dirty bits
> >> added to the Avro protocol. Is that right? Has there been any
> >> historical discussion of doing things differently like augmenting
> >> record schemas to include dirty bits, or making the dirty bits a
> >> transient member of a parent class? Or am I off base here?
> >>
> >> Is there any augmenting done other than dirty bits?
> >>
>

Reply via email to