I am all up for not hijacking avro API's :) Dirty-bits serialization came up first in Hadoop mapreduce, since we have to serialize the data and the mutation state between tasks. I can think of other cases, where you may want to serialize object-mutation state, where you are passing the objects through the wire, but it is not a big use case (compared to hadoop).
Currently, we are extending DatumReader/Writer's in avro, and define a custom Hadoop serialization to use the DatumReader/Writers, which effectively augments the on-wire data format cleanly. We can accomplish a similar thing, without extending DatumReaders, but wrapping around them. I believe DatumReader/Writer APIs should be visible, but I am not sure. Otherwise we can use higher level public API's to do the serialization. Cheers, Enis On Fri, May 18, 2012 at 2:17 PM, Ed Kohlwey <[email protected]> wrote: > Enis, > Thanks for the pointers. Are the dirty bits only used by Map/Reduce or > for general persistence in terms of application logic? I guess in the > latter case its ok for them to be transient, and if the only other use > case is in Map/Reduce, something could maybe be done in the input and > output formats to avoid fiddling with the pseudo-official Avro API's. > > On Fri, May 18, 2012 at 2:05 PM, Enis Söztutar <[email protected]> > wrote: > > Hi Ed, > > > > Good to see some interest in pushing things forward. > > > > As the javadoc says, FakeResolvingDecoder is pretty much a big dirty hack > > to work around Avro's internals, but as you pointed out much has changed > in > > Avro, so we may have to rethink those parts. > > > > We need the dirty bits in the serialization for mapreduce, but not for > the > > final serialization at the store (hbase, cassandra, etc). The reasoning > is > > that during map - reduce phases, we may mutate the objects in map, which > is > > serialized and deserialized from reduce and used there. > > > > I have not spend any time on the change in avro for some time, so cannot > > comment on what would be the cleanest way to go. Either way, we can > augment > > the schema, or hijack DatumReaders/Writers. If you are willing to work on > > this, I think it is best to find out what is public / stable in avro, and > > extend those parts. When we first wrote these parts, avro was very young, > > and it was not clear what was the public API. Maybe consulting avro > folks, > > and pushing for changes / hooks in avro so that things don't break is a > > good option. > > > > I don't believe we need anything other that dirty bits to be augmented. > If > > you are planning to work on this, feel free to reach out. > > > > Cheers, > > Enis > > > > On Fri, May 18, 2012 at 8:45 AM, Ed Kohlwey <[email protected]> wrote: > > > >> Hi, > >> I'm working on updating Gora to Avro 1.7- I've mostly figured out what > >> I need to do, except whats happening in FakeResolvingDecoder.java. > >> > >> Avro now uses a nice factory system which essentially prevents you > >> from extending some of these core classes, so a different workaround > >> will have to do. > >> > >> It looks like this is basically a way to work around having dirty bits > >> added to the Avro protocol. Is that right? Has there been any > >> historical discussion of doing things differently like augmenting > >> record schemas to include dirty bits, or making the dirty bits a > >> transient member of a parent class? Or am I off base here? > >> > >> Is there any augmenting done other than dirty bits? > >> >

