Re: HUG talk on PTD/Avro

Ken Krugler Mon, 26 Apr 2010 13:12:52 -0700

Hi Doug,

On Apr 23, 2010, at 1:31pm, Doug Cutting wrote:

Ken Krugler wrote:
3. It would be great to get feedback on both the Avro Cascadingscheme (http://github.com/bixolabs/cascading.avro) and the contentwe're currently saving in the Avro file.
Overall it looks fine to me.
What do you think of https://issues.apache.org/jira/browse/AVRO-513?Would that make your life much easier?

I read through it, but don't understand why "...explicitly detectsequences of matching data" is a issue.

What's the definition of "matching data"? Is there a common use casefor Avro where you need to detect duplicates?

It might be more efficient, instead of reading Avro generic data andconverting it to your desired representation, to subclassGenericDatumReader and override #readString(), #readBytes(),#readMap(), and #readArray(). Similarly for DatumWriter. But we'dthen also need to permit one to configure AvroRecordReader to use adifferent DatumReader implementation. We might, e.g., add aDataRepresentationFactory interface:
interface DataRepresentation<T> {
 DatumReader<T> createDatumReader();
 DatumWriter<T> createDatumWriter();
}

Then we could replace AvroJob#setInputSpecific() and#setInputGeneric() with#setInputRepresentation(Class<DataRepresentation> rep, Schema s).You could subclass GenericDatumReader & Writer and implement aDataRepresentation that returns these.
Worth it?

I assume the performance win comes because there's only one conversionto/from the serialized & stored data, versus two.

If so, then it would definitely be faster, but I don't know by howmuch. It seems like the most likely bottleneck would be with strings,as these need conversion and can be long/common.

I'd either need to hook up a profiler to a typical read or write flow,or disable the string conversion and measure the speedup.


So no recommendation for now, until I get time to try that out.

Thanks,

-- Ken

--------------------------------------------
Ken Krugler
+1 530-210-6378
http://bixolabs.com
e l a s t i c   w e b   m i n i n g

Re: HUG talk on PTD/Avro

Reply via email to