Re: Proposal: RFCs for Avro 2.x

Ryan Blue Tue, 28 Apr 2020 17:52:22 -0700

+1 for removing code that isn't maintained. We can still bring it back if
anyone is interested, but I like the idea of retiring it so that users get
a clear idea of its state (unmaintained) and so it doesn't slow down
development (releases blocked by code rot). I support separate versioning
and updating to semantic versioning, too!


For the 2.0 format, I think there may be some other reasons to consider it
as well.

First, it would be great to expand the core set of types to include
timestamps, dates, decimals, and maps with non-string keys. These are
available through logical types, but logical types are difficult to
configure and require deserialization and conversion instead of just
deserialization. We could gain performance and make Avro much easier to use
by adding to the core set of types.

Second, I would like to see Avro adopt or support id-based field tracking
in schemas. We've built this in Apache Iceberg so that schema evolution in
Iceberg tables never have unintended side-effects. For example, dropping a
column and adding one with the same name never mixes the dropped column's
data with the new column's data; and it's still possible to un-delete
columns. Another benefit of id-based schemas is that producers and
consumers don't need to coordinate schema changes or keep old aliases. The
name of a column is whatever the id is labelled with in the reader's schema.

I'm not sure that even these are enough to break compatibility with v1, but
I think it's worth a discussion.

On Tue, Apr 28, 2020 at 1:01 AM Ismaël Mejía <[email protected]> wrote:

> Huge +1 to recover the Avro Enhancement Proposals (AEP)
>
> The experimental features Ryan mentioned definitely merit(ed) to be
> part of it, and in particular the procedure to decide when they will
> become ‘stable’ or default, for example for fastread. Also other
> proposals/discussions like the split release or semantic versioning
> should be part of it.
>
> About Avro 2.0.0 I think breaking binary compatibility of the format
> is going to prove to be a hard sell (are named unions valuable enough
> to break backwards compatibility?), if we can extend the binary format
> in a compatible way there is no reason to have 2.0.0 so I agree that
> there is a delicate balance we should avoid because strict stability
> could let us also ostracized.
>
> What I personally would like is to make Avro as lean and efficient as
> possible and focus mostly in the binary format part and tools probably
> removing the less used parts (IPC/RPC/trevni) so it is good to see
> that other people are starting to agree on that.
>
> One more radical idea I would like is to try is to unify a bit the
> implementations probably having a robust low level one in one systems
> language (C or Rust) and bindings for all the languages that rely on
> it but this is probably more because of my frustration of seeing
> projects that take this approach becoming slowly the standard and
> Apacho Avro relegated (this is already happening on the python front).
>
> In general the critical issue with Avro are the downstream
> consequences of our actions, and of course we will always have
> incomplete information, but we can investigate and see if changes are
> worth.
>
> Regards,
> Ismaël
>
> On Mon, Apr 27, 2020 at 6:51 PM Ryan Skraba <[email protected]> wrote:
> >
> > Hello!
> >
> > You bring up some good points -- I'm glad Avro is so widely used, but
> > it does make me nervous to see any changes that might break other
> > projects, or change any behaviour.
> >
> > Currently, we've talked about managing developer expectations with
> > semantic versioning (especially with the necessary Jackson API cleanup
> > that happened in 1.9.x), or versioning artifacts separately.
> >
> > We also have a couple of experimental/feature flags for some behaviour
> > changes:
> https://cwiki.apache.org/confluence/display/AVRO/Experimental+features+in+Avro
> >
> > And there is already a page for Avro Enhancement Proposals that look
> > largely out of date:
> >
> https://cwiki.apache.org/confluence/display/AVRO/Avro+Enhancement+Proposals
> >
> > Moving some of the extras to a separate repo brings many of the same
> > problems as versioning artifacts separately (nobody wants to deal with
> > a compatibility matrix).  I'm definitely not against it, but I'm not
> > sure how it would improve the situation.
> >
> > There's a fine line between being extremely stable and being
> > paralyzed! I would be enthusiastic about any process changes that
> > would help us encourage and adopt new features (and fixes) more
> > quickly.
> >
> > All my best, Ryan
> >
> >
> > On Sun, Apr 26, 2020 at 11:18 AM Driesprong, Fokko <[email protected]>
> wrote:
> > >
> > > Hi Andy,
> > >
> > > Thanks for reaching out. Sorry for not being so active in the community
> > > lately.
> > >
> > > Since Avro 1.8.2 there has been some activity on the repository again,
> > > fixing stuff like security issues and migrating to later versions of
> Java.
> > > Avro has been around for 10 years now, and I would like to keep (some)
> > > backward compatibility to make sure that people are still going to use
> it
> > > for another 10 years :) In the past, the idea was to keep the format
> > > backward compatibility, this excludes the Java API to. So we did some
> > > changes to the API, such as removing Jackson from the public API and
> > > aggressively migrating from Joda Time to Java JSR-310. This caused a
> lot of
> > > issues because Avro is deeply nested in a lot of projects. For
> example, it
> > > is a huge task to update Avro in Hive or Hadoop. Therefore we believe
> that
> > > backward compatibility is very important.
> > >
> > > And I agree that we should mainly focus on the Avro spec itself, and
> not
> > > too much on File I/O and Network etc :) However, if we decide to break
> an
> > > API, we should do it for a good reason.
> > >
> > > Cheers, Fokko
> > >
> > > Op wo 22 apr. 2020 om 16:09 schreef Andy Le <[email protected]>:
> > >
> > > > Hi guys,
> > > >
> > > > I'm new to this vibrant open source community. My story with Avro
> can be
> > > > found here [1]
> > > >
> > > > While implementing the feature, I got stuck and had various
> discussions
> > > > with Dough Cutting, Fokko Driesprong.... You may see here [2]
> > > >
> > > > Here my (bias) observations about our current Avro 1.9.x:
> > > >
> > > > - Some improvements can't be made due to fear of backward
> > > > incompatibilities. For example: specifications about named Union.
> > > >
> > > > - If `Apache Avro™ is a data serialization system.` then the
> repository
> > > > `apache/avro` should solely focus on (de)serialization, right?
> Currently
> > > > our repository contains many nice-to-have-but-not-critical things
> like:
> > > > File I/O, Network I/O....
> > > >
> > > > IMHO, I think:
> > > >
> > > > - We should publicly gather RFCs for Avro 2.x
> > > >
> > > > - We should move such nice things out of Avro 2.x (may be to other
> > > > dedicated repositories)
> > > >
> > > > What do you think about my suggestions. Pls kindly let me know.
> > > >
> > > > Thank you & be strong.
> > > >
> > > > [1] My fork: https://github.com/anhldbk/avro-fork#why-this-fork
> > > > [2] My opened issue:
> > > >
> https://issues.apache.org/jira/browse/AVRO-2808?jql=reporter%3Danhldbk%20AND%20resolution%20is%20EMPTY
> > > >
> > > >
> > > >
>


-- 
Ryan Blue
Software Engineer
Netflix

Re: Proposal: RFCs for Avro 2.x

Reply via email to