Re: Proposal: RFCs for Avro 2.x

Zoltan Farkas Wed, 29 Apr 2020 06:33:23 -0700

I am all for expanding the core types… the current logical type shortcuts in 
the IDL lang will make this a bit more interesting to implement (I think they 
confuse more than they help)…


Regarding ID based field tracking, I am not sure I understand what problem does 
it solve, and there might be better solutions for it.

but these discussions should be made as part of the AEP process…

—Z


> On Apr 28, 2020, at 8:50 PM, Ryan Blue <rb...@netflix.com.INVALID> wrote:
> 
> +1 for removing code that isn't maintained. We can still bring it back if
> anyone is interested, but I like the idea of retiring it so that users get
> a clear idea of its state (unmaintained) and so it doesn't slow down
> development (releases blocked by code rot). I support separate versioning
> and updating to semantic versioning, too!
> 
> For the 2.0 format, I think there may be some other reasons to consider it
> as well.
> 
> First, it would be great to expand the core set of types to include
> timestamps, dates, decimals, and maps with non-string keys. These are
> available through logical types, but logical types are difficult to
> configure and require deserialization and conversion instead of just
> deserialization. We could gain performance and make Avro much easier to use
> by adding to the core set of types.
> 
> Second, I would like to see Avro adopt or support id-based field tracking
> in schemas. We've built this in Apache Iceberg so that schema evolution in
> Iceberg tables never have unintended side-effects. For example, dropping a
> column and adding one with the same name never mixes the dropped column's
> data with the new column's data; and it's still possible to un-delete
> columns. Another benefit of id-based schemas is that producers and
> consumers don't need to coordinate schema changes or keep old aliases. The
> name of a column is whatever the id is labelled with in the reader's schema.
> 
> I'm not sure that even these are enough to break compatibility with v1, but
> I think it's worth a discussion.
> 
> On Tue, Apr 28, 2020 at 1:01 AM Ismaël Mejía <ieme...@gmail.com> wrote:
> 
>> Huge +1 to recover the Avro Enhancement Proposals (AEP)
>> 
>> The experimental features Ryan mentioned definitely merit(ed) to be
>> part of it, and in particular the procedure to decide when they will
>> become ‘stable’ or default, for example for fastread. Also other
>> proposals/discussions like the split release or semantic versioning
>> should be part of it.
>> 
>> About Avro 2.0.0 I think breaking binary compatibility of the format
>> is going to prove to be a hard sell (are named unions valuable enough
>> to break backwards compatibility?), if we can extend the binary format
>> in a compatible way there is no reason to have 2.0.0 so I agree that
>> there is a delicate balance we should avoid because strict stability
>> could let us also ostracized.
>> 
>> What I personally would like is to make Avro as lean and efficient as
>> possible and focus mostly in the binary format part and tools probably
>> removing the less used parts (IPC/RPC/trevni) so it is good to see
>> that other people are starting to agree on that.
>> 
>> One more radical idea I would like is to try is to unify a bit the
>> implementations probably having a robust low level one in one systems
>> language (C or Rust) and bindings for all the languages that rely on
>> it but this is probably more because of my frustration of seeing
>> projects that take this approach becoming slowly the standard and
>> Apacho Avro relegated (this is already happening on the python front).
>> 
>> In general the critical issue with Avro are the downstream
>> consequences of our actions, and of course we will always have
>> incomplete information, but we can investigate and see if changes are
>> worth.
>> 
>> Regards,
>> Ismaël
>> 
>> On Mon, Apr 27, 2020 at 6:51 PM Ryan Skraba <r...@skraba.com> wrote:
>>> 
>>> Hello!
>>> 
>>> You bring up some good points -- I'm glad Avro is so widely used, but
>>> it does make me nervous to see any changes that might break other
>>> projects, or change any behaviour.
>>> 
>>> Currently, we've talked about managing developer expectations with
>>> semantic versioning (especially with the necessary Jackson API cleanup
>>> that happened in 1.9.x), or versioning artifacts separately.
>>> 
>>> We also have a couple of experimental/feature flags for some behaviour
>>> changes:
>> https://cwiki.apache.org/confluence/display/AVRO/Experimental+features+in+Avro
>>> 
>>> And there is already a page for Avro Enhancement Proposals that look
>>> largely out of date:
>>> 
>> https://cwiki.apache.org/confluence/display/AVRO/Avro+Enhancement+Proposals
>>> 
>>> Moving some of the extras to a separate repo brings many of the same
>>> problems as versioning artifacts separately (nobody wants to deal with
>>> a compatibility matrix).  I'm definitely not against it, but I'm not
>>> sure how it would improve the situation.
>>> 
>>> There's a fine line between being extremely stable and being
>>> paralyzed! I would be enthusiastic about any process changes that
>>> would help us encourage and adopt new features (and fixes) more
>>> quickly.
>>> 
>>> All my best, Ryan
>>> 
>>> 
>>> On Sun, Apr 26, 2020 at 11:18 AM Driesprong, Fokko <fo...@driesprong.frl>
>> wrote:
>>>> 
>>>> Hi Andy,
>>>> 
>>>> Thanks for reaching out. Sorry for not being so active in the community
>>>> lately.
>>>> 
>>>> Since Avro 1.8.2 there has been some activity on the repository again,
>>>> fixing stuff like security issues and migrating to later versions of
>> Java.
>>>> Avro has been around for 10 years now, and I would like to keep (some)
>>>> backward compatibility to make sure that people are still going to use
>> it
>>>> for another 10 years :) In the past, the idea was to keep the format
>>>> backward compatibility, this excludes the Java API to. So we did some
>>>> changes to the API, such as removing Jackson from the public API and
>>>> aggressively migrating from Joda Time to Java JSR-310. This caused a
>> lot of
>>>> issues because Avro is deeply nested in a lot of projects. For
>> example, it
>>>> is a huge task to update Avro in Hive or Hadoop. Therefore we believe
>> that
>>>> backward compatibility is very important.
>>>> 
>>>> And I agree that we should mainly focus on the Avro spec itself, and
>> not
>>>> too much on File I/O and Network etc :) However, if we decide to break
>> an
>>>> API, we should do it for a good reason.
>>>> 
>>>> Cheers, Fokko
>>>> 
>>>> Op wo 22 apr. 2020 om 16:09 schreef Andy Le <anhl...@gmail.com>:
>>>> 
>>>>> Hi guys,
>>>>> 
>>>>> I'm new to this vibrant open source community. My story with Avro
>> can be
>>>>> found here [1]
>>>>> 
>>>>> While implementing the feature, I got stuck and had various
>> discussions
>>>>> with Dough Cutting, Fokko Driesprong.... You may see here [2]
>>>>> 
>>>>> Here my (bias) observations about our current Avro 1.9.x:
>>>>> 
>>>>> - Some improvements can't be made due to fear of backward
>>>>> incompatibilities. For example: specifications about named Union.
>>>>> 
>>>>> - If `Apache Avro™ is a data serialization system.` then the
>> repository
>>>>> `apache/avro` should solely focus on (de)serialization, right?
>> Currently
>>>>> our repository contains many nice-to-have-but-not-critical things
>> like:
>>>>> File I/O, Network I/O....
>>>>> 
>>>>> IMHO, I think:
>>>>> 
>>>>> - We should publicly gather RFCs for Avro 2.x
>>>>> 
>>>>> - We should move such nice things out of Avro 2.x (may be to other
>>>>> dedicated repositories)
>>>>> 
>>>>> What do you think about my suggestions. Pls kindly let me know.
>>>>> 
>>>>> Thank you & be strong.
>>>>> 
>>>>> [1] My fork: https://github.com/anhldbk/avro-fork#why-this-fork
>>>>> [2] My opened issue:
>>>>> 
>> https://issues.apache.org/jira/browse/AVRO-2808?jql=reporter%3Danhldbk%20AND%20resolution%20is%20EMPTY
>>>>> 
>>>>> 
>>>>> 
>> 
> 
> 
> -- 
> Ryan Blue
> Software Engineer
> Netflix

Re: Proposal: RFCs for Avro 2.x

Reply via email to