We are also planning on storing avro serialized data into k/v stores. There is no advantage to using Avro if we were to keep the schema with each value (or key, which may also have a schema). Instead, we will keep the schema in a special meta-store not unlike the header of an avro file. Any client can see what schema the k/v database uses and act appropriately.
One other big advantage of Avro in this use case is that one can stream in the tuple bytes from an Avro file (output from a M/R job perhaps) and place them into alternate stores without having to de-serialize and re-serialize the data. This would be a mere byte copy assuming all the right API's were exposed. Having the same low-level format for tuples everywhere is a big win. On 9/28/09 12:41 PM, "Florian Leibert" <[email protected]> wrote: Hi Doug, thanks for your response - yeah i had worked it out. However I felt there was a need for a SeekableByteArrayInput - I filed a JIRA (http://issues.apache.org/jira/browse/AVRO-126) and submitted a patch. That was really useful when storing things in Voldemort - in the case of a K/V store, it may be overkill to always store the schema along... Thanks, Florian On Mon, Sep 28, 2009 at 12:14 PM, Doug Cutting <[email protected]> wrote: Florian Leibert wrote: I just figured out that I can just use the GenericDatumWriter instead of the DataFileWriter - the former doesn't store the schema in the file while the latter does. Florian, It sounds like you worked this one out for yourself. Different DatumWriter implementations encode equivalent data identically. They differ in how the data is represented in Java, not when serialized. The best practice with Avro is to store the schema with serialized data, so that later, even if the schema in your application has changed, you can still read that data. Avro's data file stores the schema once per file. Avro RPC clients pass the MD5 hash of their schema with each request, and, when a server has not seen that version of the schema, the client must resubmit the request with the full schema. If you're, e.g., potentially storing different versions of a record in a database, then you might consider annotating each entry with the hash of its schema and separately maintaining a table mapping hashes to schemas, so that applications can always find the schema that was used to write the data when processing it. I hope this helps! Cheers, Doug
