Hi Doug,
On Apr 23, 2010, at 1:31pm, Doug Cutting wrote:
Ken Krugler wrote:
3. It would be great to get feedback on both the Avro Cascading
scheme (http://github.com/bixolabs/cascading.avro) and the content
we're currently saving in the Avro file.
Overall it looks fine to me.
What do you think of https://issues.apache.org/jira/browse/AVRO-513?
Would that make your life much easier?
I read through it, but don't understand why "...explicitly detect
sequences of matching data" is a issue.
What's the definition of "matching data"? Is there a common use case
for Avro where you need to detect duplicates?
It might be more efficient, instead of reading Avro generic data and
converting it to your desired representation, to subclass
GenericDatumReader and override #readString(), #readBytes(),
#readMap(), and #readArray(). Similarly for DatumWriter. But we'd
then also need to permit one to configure AvroRecordReader to use a
different DatumReader implementation. We might, e.g., add a
DataRepresentationFactory interface:
interface DataRepresentation<T> {
DatumReader<T> createDatumReader();
DatumWriter<T> createDatumWriter();
}
Then we could replace AvroJob#setInputSpecific() and
#setInputGeneric() with
#setInputRepresentation(Class<DataRepresentation> rep, Schema s).
You could subclass GenericDatumReader & Writer and implement a
DataRepresentation that returns these.
Worth it?
I assume the performance win comes because there's only one conversion
to/from the serialized & stored data, versus two.
If so, then it would definitely be faster, but I don't know by how
much. It seems like the most likely bottleneck would be with strings,
as these need conversion and can be long/common.
I'd either need to hook up a profiler to a typical read or write flow,
or disable the string conversion and measure the speedup.
So no recommendation for now, until I get time to try that out.
Thanks,
-- Ken
--------------------------------------------
Ken Krugler
+1 530-210-6378
http://bixolabs.com
e l a s t i c w e b m i n i n g