Re: [DISCUSS] Moving Variant to Parquet Details

Steve Loughran Wed, 11 Sep 2024 06:29:26 -0700

I'm thinking about some implementation issues, especially that well-known
obsession of mine: demonstrating the correctness of specifications
through machine readable formats such as TLA+, JUnit and scalatest (*)



   1. the spec and at least some test suites should be closely linked, so
   that all changes to the spec can trigger some regression testing. (
   2. which lines up for requiring all subsequent work from including new
   tests.
   3. And ideally wiring up CI builds so that PRs can be fed all the way
   through downstream uses (spark, iceberg, ...) so it becomes immediately
   clear when something has been broken.
   4. Finally, the notion of: who becomes a committer on this, or at least
   promises to be active reviewers. that's organisational, not technical

steve

(*) Yes, I consider test suites to be non-normative formal specification
languages when these tests are directly derived from the specification.
What do people think the Hadoop FS API spec and tests are other than Z-lang
specs and an translation of them to Junit?)


On Tue, 10 Sept 2024 at 20:22, rdb...@gmail.com <rdb...@gmail.com> wrote:

> To me, what matters the most is not really the repository, but the release
> process. Since the variant code is going to be fairly rapidly developed and
> may not have a stable API, I'd prefer to have it on a separate release
> cycle and start the versioning at 0.1.0 to avoid a misconception that the
> API is stable.
>
> Coming back to the repository decision, if we agree on separate releases
> then I think it is probably easier to use a separate repository as well.
> That way it is easier to find the code, manage dependencies (including
> thrift and protobuf), and we don't have to worry about unifying the build
> system.
>
> Ryan
>
> On Tue, Sep 10, 2024 at 8:45 AM Daniel Weeks <dwe...@apache.org> wrote:
>
> > I feel like it's reasonable to put the specification in the
> > 'parquet-format' repo and reduce the confusion that would be caused by
> > having specs split across repos.
> >
> > As for the implementations, we already know there will be multiple and
> some
> > will be in languages where there is no current repo in the parquet
> > project.  I agree with the proposed approach of a 'parquet-variant'
> project
> > where we keep all of the different language implementations.  There are a
> > number of benefits including keeping implementations more consistent and
> > having a single place for reviewers/maintainers to focus their attention
> > while the initial donation/implementation progresses.  It's easier to
> > split out an implementation if necessary than combine them and given the
> > relatively small size of this feature, it may never be an issue.
> >
> > Another thing to consider is that a lot of projects have custom
> > implementations of a parquet read/write path and requiring that they add
> a
> > dependency to parquet-java or arrow-rs to get variant support, for
> example,
> > feels like it would just cause more fragmentation across implementations
> as
> > they may choose to build their own.  I feel like the fastest path to
> > general adoption is to keep the implementation separate so that we can
> rely
> > on reuse as much as possible.
> >
> > -Dan
> >
> >
> >
> > On Tue, Sep 10, 2024 at 2:29 AM Andrew Lamb <andrewlam...@gmail.com>
> > wrote:
> >
> > > From a Rust perspective, I think putting the spec in the parquet-format
> > > repo makes sense as it will become part of the parquet spec.
> > >
> > > In terms of what repository the rust variant implementation would live
> > in:
> > > * if there are parquet committers who plan to help implement and
> maintain
> > > it, then putting it in parquet-variant could make sense
> > > * if the idea is that the existing parquet-rs maintainers would help
> > > maintain it, putting it in the existing `arrow-rs` repo makes more
> sense
> > to
> > > me (this would likely also make initial development easier)
> > >
> > > Technically I would expect the rust implementation to be its own
> "crate"
> > > (equivalent of a library) that is released separately, that the parquet
> > > crate depended on but not the other way around.
> > >
> > > Hope that helps,
> > > Andrew
> > >
> > > On Tue, Sep 10, 2024 at 12:33 AM Gene Pang <gene.p...@gmail.com>
> wrote:
> > >
> > > > Hi all,
> > > >
> > > > The Spark community has agreed
> > > > <https://lists.apache.org/thread/pkybo148j6qyn2wsjnmyrhqs3crn9b89>
> to
> > > move
> > > > the Variant specification and implementation to the Parquet project.
> > > >
> > > > However, there are several details we need to figure out with the
> move
> > to
> > > > Parquet. I have started a document with some of the topics and
> details
> > we
> > > > need to finalize.
> > > >
> > > >
> > > >
> > >
> >
> https://docs.google.com/document/d/1guEzBQjzOEEZvvibeZjNraKmZHWtxQR95O_DvtZU0xw/edit?usp=sharing
> > > >
> > > > Please take a look at the document and leave comments, questions and
> > > > feedback to help reach a conclusion.
> > > >
> > > > Thanks,
> > > > Gene
> > > >
> > >
> >
>

Re: [DISCUSS] Moving Variant to Parquet Details

Reply via email to