Re: Writing GenericRecords w/o saving the schema information

Scott Carey Mon, 28 Sep 2009 13:59:00 -0700

We are also planning on storing avro serialized data into k/v stores.  There is 
no advantage to using Avro if we were to keep the schema with each value (or 
key, which may also have a schema). Instead, we will keep the schema in a 
special meta-store not unlike the header of an avro file.  Any client can see 
what schema the k/v database uses and act appropriately.


One other big advantage of Avro in this use case is that one can stream in the 
tuple bytes from an Avro file (output from a M/R job perhaps) and place them 
into alternate stores without having to de-serialize and re-serialize the data. 
 This would be a mere byte copy assuming all the right API's were exposed.  
Having the same low-level format for tuples everywhere is a big win.


On 9/28/09 12:41 PM, "Florian Leibert" <[email protected]> wrote:

Hi Doug,
thanks for your response - yeah i had worked it out. However I felt there was a 
need for a SeekableByteArrayInput - I filed a JIRA 
(http://issues.apache.org/jira/browse/AVRO-126) and submitted a patch. That was 
really useful when storing things in Voldemort - in the case of a K/V store, it 
may be overkill to always store the schema along...

Thanks,
Florian

On Mon, Sep 28, 2009 at 12:14 PM, Doug Cutting <[email protected]> wrote:
Florian Leibert wrote:
I just figured out that I can just use the GenericDatumWriter instead of the 
DataFileWriter - the former doesn't store the schema in the file while the 
latter does.

Florian,

It sounds like you worked this one out for yourself.  Different DatumWriter 
implementations encode equivalent data identically.  They differ in how the 
data is represented in Java, not when serialized.

The best practice with Avro is to store the schema with serialized data, so 
that later, even if the schema in your application has changed, you can still 
read that data.  Avro's data file stores the schema once per file.  Avro RPC 
clients pass the MD5 hash of their schema with each request, and, when a server 
has not seen that version of the schema, the client must resubmit the request 
with the full schema.  If you're, e.g., potentially storing different versions 
of a record in a database, then you might consider annotating each entry with 
the hash of its schema and separately maintaining a table mapping hashes to 
schemas, so that applications can always find the schema that was used to write 
the data when processing it.

I hope this helps!

Cheers,

Doug

Re: Writing GenericRecords w/o saving the schema information

Reply via email to