Hi Doug,

On Apr 23, 2010, at 1:31pm, Doug Cutting wrote:

Ken Krugler wrote:
3. It would be great to get feedback on both the Avro Cascading scheme (http://github.com/bixolabs/cascading.avro) and the content we're currently saving in the Avro file.

Overall it looks fine to me.

What do you think of https://issues.apache.org/jira/browse/AVRO-513? Would that make your life much easier?

I read through it, but don't understand why "...explicitly detect sequences of matching data" is a issue.

What's the definition of "matching data"? Is there a common use case for Avro where you need to detect duplicates?

It might be more efficient, instead of reading Avro generic data and converting it to your desired representation, to subclass GenericDatumReader and override #readString(), #readBytes(), #readMap(), and #readArray(). Similarly for DatumWriter. But we'd then also need to permit one to configure AvroRecordReader to use a different DatumReader implementation. We might, e.g., add a DataRepresentationFactory interface:

interface DataRepresentation<T> {
 DatumReader<T> createDatumReader();
 DatumWriter<T> createDatumWriter();
}

Then we could replace AvroJob#setInputSpecific() and #setInputGeneric() with #setInputRepresentation(Class<DataRepresentation> rep, Schema s). You could subclass GenericDatumReader & Writer and implement a DataRepresentation that returns these.

Worth it?

I assume the performance win comes because there's only one conversion to/from the serialized & stored data, versus two.

If so, then it would definitely be faster, but I don't know by how much. It seems like the most likely bottleneck would be with strings, as these need conversion and can be long/common.

I'd either need to hook up a profiler to a typical read or write flow, or disable the string conversion and measure the speedup.

So no recommendation for now, until I get time to try that out.

Thanks,

-- Ken

--------------------------------------------
Ken Krugler
+1 530-210-6378
http://bixolabs.com
e l a s t i c   w e b   m i n i n g




Reply via email to