Re: Schema evolution in Gora

Henry Saputra Tue, 22 Jul 2014 13:16:58 -0700

+1 to start, but at this point there is no solution yet so looks like
it is open for solution proposal.



- Henry

On Tue, Jul 22, 2014 at 9:43 AM, Talat Uyarer <[email protected]> wrote:
> Hi Folks,
>
> Wdyt ? We should solve this problem for stable deserialization and
> serialization. If we decide any solution, I can work on it. I have
> time.
>
> Talat
>
> 2014-04-10 14:48 GMT+03:00 Alparslan Avcı <[email protected]>:
>> Hi folks,
>>
>> I also think that "schema evolution over time" is an important problem that
>> we should handle. Because of this, it is really hard to extend the data
>> schema on any application which uses Gora. We've experienced this in Nutch.
>>
>> About proposedsolutions;
>>
>> - "Should we store the Schema along with the data?"-> IMHO, we should store
>> the schema but we should also discuss about the way that we store. Talat's
>> 'recipe' can be a good option for this, and moreover; I think of storing all
>> field schemas separately instead of storing persistent schema in one piece.
>> Although storing every field schema is more complex than storing only one
>> big persistent schema, it will give us more extensibility and ease at
>> back-compatibility. And again for field schemas, we should discuss the way
>> of storing (serialized/not serialized?, store to where?, etc.).
>>
>> - "Should we store a Hash of the Schema along with the data? Should we
>> support Schema versioning? Should we support Schema fingerprinting?" -> We
>> can need to support schema versioning, since it may help to compare
>> evaluated schemas. But if we store the schema, we won't need to store the
>> hash, or support fingerprinting, I think.
>>
>>
>> Alparslan
>>
>>
>>
>> On 08-04-2014 14:57, Talat Uyarer wrote:
>>>
>>> Hi all,
>>>
>>> IMHO we can store a NEW field called "recipe of persistent" about
>>> written record. The Recipe field store information of which field has
>>> been serialized with which serializer. It is stored as a serialized
>>> with string serializer. Every getting datas from store It is
>>> deserialized. And that object of data is generated from this recipe's
>>> schema. The recipe field store similar with persistent's schema but it
>>> has some different definition and extra information about fields. For
>>> example in schema of persistent has a union field similar to below:
>>>
>>> {"name": "name", "type": ["null","string"],"default":null}
>>>
>>> If it is serialized by string serializer. it is written in the recipe
>>> field
>>>
>>> {"name": "name", "type": "string","default":null}
>>>
>>> Thus name field can be deserialized without persistent's schema.
>>> Another benefit: If persistent's schema is changed, we can still
>>> deserialize without any information.
>>>
>>> I hope I can be understandable. :)
>>>
>>> Talat
>>>
>>> 2014-04-08 12:11 GMT+03:00 Henry Saputra <[email protected]>:
>>>>
>>>> Technically it was named after a dog, hence the logo, which just happen
>>>> to
>>>> match that abbreviation :)
>>>>
>>>> On Tuesday, April 1, 2014, Renato Marroquín Mogrovejo <
>>>> [email protected]> wrote:
>>>>
>>>>> Hi Lewis,
>>>>>
>>>>> This is for sure a very interesting and something that GORA should deal
>>>>> with.
>>>>> It is funny that only now I found out that GORA actually means "Generic
>>>>> Object Representation using Avro". This means that we will always have
>>>>> to
>>>>> use Avro for everything? Never mind, we all can discuss about this when
>>>>> the
>>>>> time comes.
>>>>> For the little reading I did about data evolution,  :
>>>>> - Schema along with data -> This could be done in a similar way as we
>>>>> are
>>>>> approaching the union fields i.e. append an extra field to the data with
>>>>> its schema, deserialize the schema, and then check if the data can
>>>>> actually
>>>>> suffice the query or not. Of course this would be part of 0.5 :)
>>>>> - Hash of the Schema along with the data, Schema versioning, Schema
>>>>> fingerprinting ->
>>>>> This needs some way of looking up saved schemas (versions, hashes, or
>>>>> schema fingerprints).
>>>>>
>>>>>
>>>>> Renato M.
>>>>>
>>>>>
>>>>> 2014-04-01 16:47 GMT+02:00 Lewis John Mcgibbney
>>>>> <[email protected]<javascript:;>
>>>>>>
>>>>>> :
>>>>>> Hi Folks,
>>>>>> I've ended up in a conversation [0] over on user@avro regarding Schema
>>>>>> evolution.
>>>>>> Right now our workflow is as follows
>>>>>>
>>>>>>   * write .avsc schema and use GoraCompiler to generate Persistent data
>>>>>> beans.
>>>>>>   * use the Persistent class whenever we wish to read to or write from
>>>>>> the
>>>>>> data.
>>>>>>
>>>>>> AFAICT, as explained in [0], this presents us with a problem. Namely
>>>>>> that
>>>>>> we have very sketchy support to Schema evolution over time.
>>>>>>
>>>>>> We narrowly avoided minor situation over in Nutch when we added a
>>>>>
>>>>> 'batchId'
>>>>>>
>>>>>> Field to our WebPage Schema as some Tools when attempting to read
>>>>>> Field's
>>>>>> which were simply not present for some records.
>>>>>>
>>>>>> So this thread is opened to discussion surrounding what we can/must do
>>>>>> to
>>>>>> improve this.
>>>>>> Should we store the Schema along with the data?
>>>>>> Should we store a Hash of the Schema along with the data?
>>>>>> Should we support Schema versioning?
>>>>>> Should we support Schema fingerprinting?
>>>>>>
>>>>>> Of course this is something for the 0.5-SNAPSHOT development drive but
>>>>>> it
>>>>>> is something which we need to sort out as time goes on.
>>>>>>
>>>>>> Ta
>>>>>> Lewis
>>>>>>
>>>>>> [0] http://www.mail-archive.com/user%40avro.apache.org/msg02748.html
>>>>>>
>>>>>> --
>>>>>> *Lewis*
>>>>>>
>>>
>>>
>>
>
>
>
> --
> Talat UYARER
> Websitesi: http://talat.uyarer.com
> Twitter: http://twitter.com/talatuyarer
> Linkedin: http://tr.linkedin.com/pub/talat-uyarer/10/142/304

Re: Schema evolution in Gora

Reply via email to