Re: Schema evolution in Gora

Talat Uyarer Tue, 22 Jul 2014 09:44:06 -0700

Hi Folks,

Wdyt ? We should solve this problem for stable deserialization and
serialization. If we decide any solution, I can work on it. I have
time.


Talat

2014-04-10 14:48 GMT+03:00 Alparslan Avcı <[email protected]>:
> Hi folks,
>
> I also think that "schema evolution over time" is an important problem that
> we should handle. Because of this, it is really hard to extend the data
> schema on any application which uses Gora. We've experienced this in Nutch.
>
> About proposedsolutions;
>
> - "Should we store the Schema along with the data?"-> IMHO, we should store
> the schema but we should also discuss about the way that we store. Talat's
> 'recipe' can be a good option for this, and moreover; I think of storing all
> field schemas separately instead of storing persistent schema in one piece.
> Although storing every field schema is more complex than storing only one
> big persistent schema, it will give us more extensibility and ease at
> back-compatibility. And again for field schemas, we should discuss the way
> of storing (serialized/not serialized?, store to where?, etc.).
>
> - "Should we store a Hash of the Schema along with the data? Should we
> support Schema versioning? Should we support Schema fingerprinting?" -> We
> can need to support schema versioning, since it may help to compare
> evaluated schemas. But if we store the schema, we won't need to store the
> hash, or support fingerprinting, I think.
>
>
> Alparslan
>
>
>
> On 08-04-2014 14:57, Talat Uyarer wrote:
>>
>> Hi all,
>>
>> IMHO we can store a NEW field called "recipe of persistent" about
>> written record. The Recipe field store information of which field has
>> been serialized with which serializer. It is stored as a serialized
>> with string serializer. Every getting datas from store It is
>> deserialized. And that object of data is generated from this recipe's
>> schema. The recipe field store similar with persistent's schema but it
>> has some different definition and extra information about fields. For
>> example in schema of persistent has a union field similar to below:
>>
>> {"name": "name", "type": ["null","string"],"default":null}
>>
>> If it is serialized by string serializer. it is written in the recipe
>> field
>>
>> {"name": "name", "type": "string","default":null}
>>
>> Thus name field can be deserialized without persistent's schema.
>> Another benefit: If persistent's schema is changed, we can still
>> deserialize without any information.
>>
>> I hope I can be understandable. :)
>>
>> Talat
>>
>> 2014-04-08 12:11 GMT+03:00 Henry Saputra <[email protected]>:
>>>
>>> Technically it was named after a dog, hence the logo, which just happen
>>> to
>>> match that abbreviation :)
>>>
>>> On Tuesday, April 1, 2014, Renato Marroquín Mogrovejo <
>>> [email protected]> wrote:
>>>
>>>> Hi Lewis,
>>>>
>>>> This is for sure a very interesting and something that GORA should deal
>>>> with.
>>>> It is funny that only now I found out that GORA actually means "Generic
>>>> Object Representation using Avro". This means that we will always have
>>>> to
>>>> use Avro for everything? Never mind, we all can discuss about this when
>>>> the
>>>> time comes.
>>>> For the little reading I did about data evolution,  :
>>>> - Schema along with data -> This could be done in a similar way as we
>>>> are
>>>> approaching the union fields i.e. append an extra field to the data with
>>>> its schema, deserialize the schema, and then check if the data can
>>>> actually
>>>> suffice the query or not. Of course this would be part of 0.5 :)
>>>> - Hash of the Schema along with the data, Schema versioning, Schema
>>>> fingerprinting ->
>>>> This needs some way of looking up saved schemas (versions, hashes, or
>>>> schema fingerprints).
>>>>
>>>>
>>>> Renato M.
>>>>
>>>>
>>>> 2014-04-01 16:47 GMT+02:00 Lewis John Mcgibbney
>>>> <[email protected]<javascript:;>
>>>>>
>>>>> :
>>>>> Hi Folks,
>>>>> I've ended up in a conversation [0] over on user@avro regarding Schema
>>>>> evolution.
>>>>> Right now our workflow is as follows
>>>>>
>>>>>   * write .avsc schema and use GoraCompiler to generate Persistent data
>>>>> beans.
>>>>>   * use the Persistent class whenever we wish to read to or write from
>>>>> the
>>>>> data.
>>>>>
>>>>> AFAICT, as explained in [0], this presents us with a problem. Namely
>>>>> that
>>>>> we have very sketchy support to Schema evolution over time.
>>>>>
>>>>> We narrowly avoided minor situation over in Nutch when we added a
>>>>
>>>> 'batchId'
>>>>>
>>>>> Field to our WebPage Schema as some Tools when attempting to read
>>>>> Field's
>>>>> which were simply not present for some records.
>>>>>
>>>>> So this thread is opened to discussion surrounding what we can/must do
>>>>> to
>>>>> improve this.
>>>>> Should we store the Schema along with the data?
>>>>> Should we store a Hash of the Schema along with the data?
>>>>> Should we support Schema versioning?
>>>>> Should we support Schema fingerprinting?
>>>>>
>>>>> Of course this is something for the 0.5-SNAPSHOT development drive but
>>>>> it
>>>>> is something which we need to sort out as time goes on.
>>>>>
>>>>> Ta
>>>>> Lewis
>>>>>
>>>>> [0] http://www.mail-archive.com/user%40avro.apache.org/msg02748.html
>>>>>
>>>>> --
>>>>> *Lewis*
>>>>>
>>
>>
>



-- 
Talat UYARER
Websitesi: http://talat.uyarer.com
Twitter: http://twitter.com/talatuyarer
Linkedin: http://tr.linkedin.com/pub/talat-uyarer/10/142/304

Re: Schema evolution in Gora

Reply via email to