Re: Schema evolution in Gora

Alparslan Avcı Thu, 10 Apr 2014 04:49:36 -0700

Hi folks,

I also think that "schema evolution over time" is an important problemthat we should handle. Because of this, it is really hard to extend thedata schema on any application which uses Gora. We've experienced thisin Nutch.


About proposedsolutions;

- "Should we store the Schema along with the data?"-> IMHO, we shouldstore the schema but we should also discuss about the way that we store.Talat's 'recipe' can be a good option for this, and moreover; I think ofstoring all field schemas separately instead of storing persistentschema in one piece. Although storing every field schema is more complexthan storing only one big persistent schema, it will give us moreextensibility and ease at back-compatibility. And again for fieldschemas, we should discuss the way of storing (serialized/notserialized?, store to where?, etc.).


- "Should we store a Hash of the Schema along with the data? Should we support Schema 
versioning? Should we support Schema fingerprinting?" -> We can need to support 
schema versioning, since it may help to compare evaluated schemas. But if we store the 
schema, we won't need to store the hash, or support fingerprinting, I think.


Alparslan


On 08-04-2014 14:57, Talat Uyarer wrote:

Hi all,

IMHO we can store a NEW field called "recipe of persistent" about
written record. The Recipe field store information of which field has
been serialized with which serializer. It is stored as a serialized
with string serializer. Every getting datas from store It is
deserialized. And that object of data is generated from this recipe's
schema. The recipe field store similar with persistent's schema but it
has some different definition and extra information about fields. For
example in schema of persistent has a union field similar to below:

{"name": "name", "type": ["null","string"],"default":null}

If it is serialized by string serializer. it is written in the recipe field

{"name": "name", "type": "string","default":null}

Thus name field can be deserialized without persistent's schema.
Another benefit: If persistent's schema is changed, we can still
deserialize without any information.

I hope I can be understandable. :)

Talat

2014-04-08 12:11 GMT+03:00 Henry Saputra <[email protected]>:

Technically it was named after a dog, hence the logo, which just happen to
match that abbreviation :)

On Tuesday, April 1, 2014, Renato Marroquín Mogrovejo <
[email protected]> wrote:

Hi Lewis,

This is for sure a very interesting and something that GORA should deal
with.
It is funny that only now I found out that GORA actually means "Generic
Object Representation using Avro". This means that we will always have to
use Avro for everything? Never mind, we all can discuss about this when the
time comes.
For the little reading I did about data evolution,  :
- Schema along with data -> This could be done in a similar way as we are
approaching the union fields i.e. append an extra field to the data with
its schema, deserialize the schema, and then check if the data can actually
suffice the query or not. Of course this would be part of 0.5 :)
- Hash of the Schema along with the data, Schema versioning, Schema
fingerprinting ->
This needs some way of looking up saved schemas (versions, hashes, or
schema fingerprints).


Renato M.


2014-04-01 16:47 GMT+02:00 Lewis John Mcgibbney 
<[email protected]<javascript:;>

:
Hi Folks,
I've ended up in a conversation [0] over on user@avro regarding Schema
evolution.
Right now our workflow is as follows

  * write .avsc schema and use GoraCompiler to generate Persistent data
beans.
  * use the Persistent class whenever we wish to read to or write from the
data.

AFAICT, as explained in [0], this presents us with a problem. Namely that
we have very sketchy support to Schema evolution over time.

We narrowly avoided minor situation over in Nutch when we added a

'batchId'

Field to our WebPage Schema as some Tools when attempting to read Field's
which were simply not present for some records.

So this thread is opened to discussion surrounding what we can/must do to
improve this.
Should we store the Schema along with the data?
Should we store a Hash of the Schema along with the data?
Should we support Schema versioning?
Should we support Schema fingerprinting?

Of course this is something for the 0.5-SNAPSHOT development drive but it
is something which we need to sort out as time goes on.

Ta
Lewis

[0] http://www.mail-archive.com/user%40avro.apache.org/msg02748.html

--
*Lewis*

Re: Schema evolution in Gora

Reply via email to