Re: [DISCUSS] [Rust] Move Rust components to new repos and process

Micah Kornfield Sun, 11 Apr 2021 20:21:12 -0700

>
> Continuing to grow the community is important, so
>
I would recommend that the most active contributors compensate for
> this with higher level discussions (e-mails / google documents) that
> get circulated on the mailing list, or GitHub issues. I would be
> concerned about using GitHub issues exclusively for higher level
> discussions, since you may not reach everyone you want to reach.



+1 to this.  I don't think this is only a Rust issue (but might be
challenging to be disciplined with so much of the workflow moving to
Github). We could probably all be better about making sure we are
discussing issues that impact the project on the mailing list (this
includes bringing off mailing list discussions back to the ML from places
like Zulipchat, Slack and Sync calls).

This isn't a prerequisite but I'd ask if members of the Rust community are
willing to do a retrospective in a few months (i.e. is there more
engagement?  What things went well?  What things went less well?
Unexpected pain points? etc).   This is a very interesting experiment and
hopefully something we can all learn from.

-Micah

On Sun, Apr 11, 2021 at 4:00 PM Wes McKinney <wesmck...@gmail.com> wrote:

> I'm supportive of this if it addresses most of the issues that folks
> in the Rust community have been having.
>
> A handful of thoughts
>
> * Small nit: the DataFusion repository probably needs to be called
> apache/arrow-datafusion, so that all repos related to the Arrow TLP
> have a common prefix. Example:
> https://github.com/apache/calcite-avatica.
>
> * It's too bad that developing multiple interdependent crates out of a
> monorepo is not easier / more favored. In other languages (e.g. C++,
> Python), change management and refactoring is so much easier (I recall
> how painful things were for us when Parquet C++ was a separate
> repository). But I accept the way things are.
>
> * It'll be interesting to see how easy/hard it is to build
> cross-language projects that involve Rust in the future. For example,
> if we wanted to build a pyarrow add-on that depended on Rust-core or
> DataFusion, I suppose we would build against a pinned version of the
> dependencies. If changes are needed in the dependencies, we may need
> to create some tools to help validate changes that require PRs into
> multiple repositories.
>
> * The new absence of a centralized issue tracker and changelog for
> Rust means that it may be more difficult for new people entering the
> community to inspect all of the information about what the Rust
> community is doing. Continuing to grow the community is important, so
> I would recommend that the most active contributors compensate for
> this with higher level discussions (e-mails / google documents) that
> get circulated on the mailing list, or GitHub issues. I would be
> concerned about using GitHub issues exclusively for higher level
> discussions, since you may not reach everyone you want to reach.
>
> * AFAIK, Rust could always have conducted its own independent releases
> if there were a willing volunteer to do the release management. This
> new structure will force someone new to volunteer (nearly all releases
> in recent years have been done by Krisztian and Kou), otherwise you
> won't have releases. If there is a belief that independent releases
> were "forbidden" by the PMC in the past, that is not true to my
> knowledge but I apologize for any contributions to this
> misunderstanding. Keep in mind that I assisted the JavaScript
> developers in doing independent JS releases in the past until
> development slowed down and they didn't want to make independent
> releases anymore.
>
> * Regarding version numbers: I think as long as we clearly document
> the Format version corresponding to each Library version, I think it
> is OK if Rust's version numbers diverge from apache/arrow. As we
> improve our integration testing, we should try to make it easier to
> probe / assess a library's level of Format compatibility as new
> features are added.
>
> In any case, after voting to accept these changes, please let the rest
> of us (the non-Rust-centric folks) know how we can assist you.
>
> Thanks,
> Wes
>
> On Sat, Apr 10, 2021 at 11:29 PM Jorge Cardoso Leitão
> <jorgecarlei...@gmail.com> wrote:
> >
> > Hi Jacob,
> >
> > My understanding is that our integration tests can be roughly summarized
> as
> > follows:
> >
> > * Every Implementation compiles 4 binaries:
> >      * a consumer (validate)
> >      * a producer (json_to_file)
> >      * a flight server
> >      * a flight client
> >
> > These binaries are used via a CLI.
> >
> > * The producer reads a json file into memory and writes it in the IPC
> > format.
> > * The consumer receives a tuple (arrow file path, json path), reads the
> > arrow file and verifies that it equals (under a non-spec'd but accepted
> > definition of equality) the json file.
> > * Flight is equivalent, but under the flight protocol
> >
> > Archery <https://github.com/apache/arrow/tree/master/dev/archery>, a
> Python
> > package encapsulating many of the development workflows we have, has
> > metadata associated with each of these binaries, for each implementation,
> > including where the implementation installs that binary and how to call
> > them. For example, this is the metadata
> > <
> https://github.com/apache/arrow/blob/master/dev/archery/archery/integration/tester_go.py
> >
> > for c++.
> >
> > We then have a runner
> > <
> https://github.com/apache/arrow/blob/master/dev/archery/archery/integration/runner.py
> >
> > that picks every implementation hard-coded on it and runs everyone
> against
> > everyone
> > <
> https://github.com/apache/arrow/blob/master/dev/archery/archery/integration/runner.py#L79
> >.
> > E.g.
> > {
> >     ((c++, producer), (go, consumer)),
> >     ((go, producer), (c++, consumer)),
> >     ((rust, producer), (c++, consumer)),
> >     ((c++, producer), (rust, consumer)),
> >     ...
> > }
> >
> > The tests consist roughly of "producer reads json and writes arrow file;
> > consumer reads arrow file and it compares against the original json".
> This
> > proves that whatever a producer writes, the consumer reads what was
> > originally intended to be produced / transmitted. AFAI know, we run this
> > against each of the golden files in the git submodule
> testing/arrow-testing.
> >
> > More experienced folks on this topic may offer a better overview of this
> > than me. ^_^
> >
> > The line I wrote would correspond to: the CI of a PR in rust compiles
> Rust
> > integration binaries based on that PR, compiles the binaries of every
> other
> > implementation from apache/arrow master (or picks them from the latest
> > cached docker image?), runs every combination that includes Rust, and
> exits
> > zero iff all checks are good.
> >
> > Best,
> > Jorge
> >
> >
> >
> > On Sat, Apr 10, 2021 at 4:42 PM Jacob Quinn <quinn.jac...@gmail.com>
> wrote:
> >
> > > Jorge,
> > >
> > > * in rust, run integration tests against the latest apache/master on
> every
> > > > PR
> > > >
> > >
> > > I've started to familiarize myself with the archery integration
> framework
> > > over the last few days. Could you clarify for the "archery novices"
> what
> > > exactly ^ this line would mean? Does apache/master refer to the C++
> > > implementation as the "reference implementation", so rust would test
> > > against/integrate with it? Or is it the arrow JSON format that needs
> to be
> > > consumed into valid arrow in-memory, then produce the same arrow JSON
> from
> > > in-memory arrow (this seems to be the extent of the go integration
> tests at
> > > least)?
> > >
> > > Sorry if this easily answerable from knowing archery better, but I'm
> still
> > > in the learning/discovery phase of how exactly all the integration
> tests
> > > are setup/run.
> > >
> > > -Jacob
> > >
> > >
> > > On Sat, Apr 10, 2021 at 1:03 AM Jorge Cardoso Leitão <
> > > jorgecarlei...@gmail.com> wrote:
> > >
> > > > Hi,
> > > >
> > > > Wrt to integration tests, I agree that it is important to have a plan
> > > prior
> > > > to this.
> > > >
> > > > What we have been doing in the apache/arrow:
> > > >
> > > > 1. only release if integration tests pass against each other
> > > > 2. release the signed tar with the latest of every implementation
> (i.e.
> > > > master)
> > > >
> > > > My suggestion for independent versioning:
> > > >
> > > > CI:
> > > >
> > > > * in rust, run integration tests against the latest apache/master on
> > > every
> > > > PR
> > > > * in apache/arrow, run integration tests against the latest released
> rust
> > > > version
> > > >
> > > > Release mechanism:
> > > >
> > > > 1. an arrow crate can only be released if it passes integration tests
> > > > against the current latest apache/arrow master
> > > > 2. apache/arrow master can release if their integration tests pass
> > > against
> > > > the latest released rust crate
> > > >
> > > > The common scenario is that the integration tests in apache/arrow
> against
> > > > Rust pass, and thus
> > > > apache/arrow would just need to bundle the latest rust release.
> > > >
> > > > If tests in apache/arrow fail, then some change in apache/arrow
> > > > caused our latest release to stop integrating (since we
> > > integration-tested
> > > > that version against master prior to our release).
> > > > This implies that a current Rust release is out of spec and we thus
> must
> > > > release a patch
> > > > asap to correct for this (just like we would need to push a commit to
> > > > apache/arrow asap).
> > > > Once that patch is released, apache/arrow becomes green again and
> > > > apache/arrow can bundle these on the signed apache arrow release.
> > > >
> > > > In the unlikely event that the latest release is unable to pass
> > > integration
> > > > tests *and* despite the best efforts Rust is unable to release a
> patch in
> > > > time, we *may* still bundle a previous release of the Rust crate,
> thereby
> > > > not blocking the whole
> > > > release (i.e. this allows us to fall back to a previous release
> without a
> > > > mass revert on the apache/arrow repo).
> > > >
> > > > > * If Rust runs against the latest nightly of Arrow the how will
> Rust
> > > > release without a new Arrow release?
> > > >
> > > > Not sure if this answers, but Rust does not compile or link against
> any
> > > > implementation, so there are
> > > > no ABI contracts. Its "only" contract is the spec (in-memory, IPC,
> > > flight,
> > > > C data interface, etc).
> > > >
> > > > A related point is that when we release a Rust version, we can upload
> > > > "integration test artifacts" separately (the same binaries that we
> > > > currently use in our integration
> > > > tests or a docker image with them), that apache/arrow can use to run
> > > > integration tests.
> > > > This would allow our CI at apache/arrow to download these artifacts
> and
> > > run
> > > > tests as usual via archery and CLI,
> > > > without having to compile them. This would alleviate some of the
> > > challenges
> > > > around integration testing whereby every implementation is currently
> > > built
> > > > on every run and in sequence.
> > > >
> > > > If someone thinks that it is useful, I would be happy to open a JIRA
> on
> > > > this and draft a google docs
> > > > to work out a technical design.
> > > >
> > > > Best,
> > > > Jorge
> > > >
> > > >
> > > > On Sat, Apr 10, 2021 at 1:57 AM Weston Pace <weston.p...@gmail.com>
> > > wrote:
> > > >
> > > > > > I'm assuming the idea is that the existing integration tests will
> > > > remain
> > > > > in apache/arrow. Will you also run the integration test suites on
> your
> > > > rust
> > > > > repository CI checks?
> > > > >
> > > > > Furthermore, against what version will these tests run?
> > > > >
> > > > > * If Arrow runs against the latest release of Rust then it will lag
> > > > > behind and issues may be detected later.
> > > > > * If Arrow runs against the latest nightly of Rust then things will
> > > > > get tricky at release time (all Arrow integrations tests pass but
> Rust
> > > > > isn't ready to cut a new release and Arrow tests fail against the
> > > > > latest released Rust).
> > > > >
> > > > > Assuming Rust is also running integration tests against Arrow
> > > > > (probably a good idea) you get a similar problem (this one might be
> > > > > trickier given the relative frequencies)...
> > > > >
> > > > > * If Rust runs against the latest release of Arrow then it will lag
> > > > > behind (several months).  There will be a "catching up" period
> after
> > > > > Arrow releases.
> > > > > * If Rust runs against the latest nightly of Arrow the how will
> Rust
> > > > > release without a new Arrow release?
> > > > >
> > > > > Note, these problems technically exist now with the concept that
> any
> > > > > language can release a patch at any time.  Also, since Rust isn't
> > > > > directly compiling against other Arrow libs and we are only talking
> > > > > about interoperability it's probably not going to be too big of a
> > > > > deal.  Still, worth giving some thought ahead of time.
> > > > >
> > > > > On Fri, Apr 9, 2021 at 1:11 PM Micah Kornfield <
> emkornfi...@gmail.com>
> > > > > wrote:
> > > > > >
> > > > > > >
> > > > > > > With this explanation do you still have a concern? There is no
> > > > > suggestion
> > > > > > > of making releases that depend on GitHub hashes.
> > > > > >
> > > > > > No, I don't think so.  IIUC you are saying the crates dependency
> does
> > > > not
> > > > > > imply the crate artifacts are published elsewhere.  This sounds
> > > inline
> > > > > with
> > > > > > policies to me.  For some reason I thought the notion of crates
> > > implied
> > > > > > publishing to Rusts package management system.
> > > > > >
> > > > > > On Fri, Apr 9, 2021 at 4:07 PM Andy Grove <andygrov...@gmail.com
> >
> > > > wrote:
> > > > > >
> > > > > > > Hi Micah,
> > > > > > >
> > > > > > > During development, the Rust crates have local dependencies on
> each
> > > > > other
> > > > > > > based on relative file system paths. At release time, we change
> > > these
> > > > > to
> > > > > > > versioned dependencies before publishing, because it isn't
> possible
> > > > to
> > > > > > > publish a crate that depends on non-published crates.
> > > > > > >
> > > > > > > With the code in separate repositories, we would still need an
> > > > > equivalent
> > > > > > > mechanism for DataFusion to use the Arrow code that is under
> > > > > development
> > > > > > > but we would point to a GitHub hash rather than a relative
> path. We
> > > > > should
> > > > > > > still update to use versioned dependencies when releasing.
> > > > > > >
> > > > > > > I will revise the text in the document to better explain what
> this
> > > > > means.
> > > > > > >
> > > > > > > With this explanation do you still have a concern? There is no
> > > > > suggestion
> > > > > > > of making releases that depend on GitHub hashes.
> > > > > > >
> > > > > > > Thanks,
> > > > > > >
> > > > > > > Andy.
> > > > > > >
> > > > > > >
> > > > > > >
> > > > > > > On Fri, Apr 9, 2021 at 4:57 PM Micah Kornfield <
> > > > emkornfi...@gmail.com>
> > > > > > > wrote:
> > > > > > >
> > > > > > >> >
> > > > > > >> > " Crates can depend on GitHub commit hashes between
> releases"
> > > > > > >>
> > > > > > >>
> > > > > > >> This sounds  like it might not align with ASF release policies
> > > [1].
> > > > > > >>
> > > > > > >> [1]
> > > > >
> https://www.apache.org/legal/release-policy.html#release-definition
> > > > > > >>
> > > > > > >> On Fri, Apr 9, 2021 at 1:34 PM Neal Richardson <
> > > > > > >> neal.p.richard...@gmail.com>
> > > > > > >> wrote:
> > > > > > >>
> > > > > > >> > Thanks, Andy. Two areas of concern I think we should have
> some
> > > > > answer
> > > > > > >> for
> > > > > > >> > before going forward with this (and I make no opinions as to
> > > what
> > > > > the
> > > > > > >> > "right" answers are, just raising them for discussion):
> > > > > > >> >
> > > > > > >> > 1. Integration testing: what is our workflow for ensuring
> that
> > > our
> > > > > > >> > implementations are integration tested, and what do we do
> when
> > > > > changes
> > > > > > >> > (whether in apache/arrow or in apache/arrow-rs) introduce
> > > > > > >> > regressions/failures? I'm assuming the idea is that the
> existing
> > > > > > >> > integration tests will remain in apache/arrow. Will you
> also run
> > > > the
> > > > > > >> > integration test suites on your rust repository CI checks?
> > > > > > >> > 2. Versioning: one rationale from our current policy of
> > > "everyone
> > > > > > >> releases
> > > > > > >> > together" is that you don't have to guess as much whether
> (for
> > > > > example)
> > > > > > >> > Arrow Java 3.0 and Arrow Rust 3.0 are compatible and using
> the
> > > > same
> > > > > > >> format.
> > > > > > >> > It's kind of a heuristic for what library versions were
> > > > integration
> > > > > > >> tested
> > > > > > >> > with each other. It sounds like (but maybe I misunderstand)
> that
> > > > > y'all
> > > > > > >> are
> > > > > > >> > looking to break from that. But if Arrow C++ goes to
> version 7.0
> > > > by
> > > > > the
> > > > > > >> end
> > > > > > >> > of the year and arrow-rs chooses to go to 15.4, or 3.12, or
> > > > > whatever,
> > > > > > >> does
> > > > > > >> > that create confusion or doubt that works against the Arrow
> goal
> > > > of
> > > > > easy
> > > > > > >> > interoperability?
> > > > > > >> >
> > > > > > >> > Neal
> > > > > > >> >
> > > > > > >> > On Fri, Apr 9, 2021 at 8:18 AM Andy Grove <
> > > andygrov...@gmail.com>
> > > > > > >> wrote:
> > > > > > >> >
> > > > > > >> > > Following on from the email thread "Rust sync meeting" I
> would
> > > > > like to
> > > > > > >> > > start a new discussion about moving the Rust components
> out to
> > > > new
> > > > > > >> GitHub
> > > > > > >> > > repositories and using a new process for issues and
> release
> > > > > > >> management.
> > > > > > >> > >
> > > > > > >> > > I have started a Google document [1] with details and to
> track
> > > > the
> > > > > > >> work
> > > > > > >> > > required for this effort but I will summarize the key
> points
> > > of
> > > > > the
> > > > > > >> > > proposal here:
> > > > > > >> > >
> > > > > > >> > >
> > > > > > >> > >    -
> > > > > > >> > >
> > > > > > >> > >    Move existing Rust code into two new repositories
> > > > > > >> > >    -
> > > > > > >> > >
> > > > > > >> > >       apache/arrow-rs
> > > > > > >> > >       -
> > > > > > >> > >
> > > > > > >> > >          Arrow + Parquet crates
> > > > > > >> > >          -
> > > > > > >> > >
> > > > > > >> > >       apache/datafusion
> > > > > > >> > >       -
> > > > > > >> > >
> > > > > > >> > >          DataFusion + Ballista crates (which are expected
> to
> > > > > merge to
> > > > > > >> > some
> > > > > > >> > >          degree over time)
> > > > > > >> > >          -
> > > > > > >> > >
> > > > > > >> > >          TPC-H benchmarks
> > > > > > >> > >          -
> > > > > > >> > >
> > > > > > >> > >       Use GitHub issues for issue tracking
> > > > > > >> > >       -
> > > > > > >> > >
> > > > > > >> > >    Decouple release process
> > > > > > >> > >    -
> > > > > > >> > >
> > > > > > >> > >       Crates are released individually
> > > > > > >> > >       -
> > > > > > >> > >
> > > > > > >> > >       A vote on the source release of the released crate
> is
> > > held
> > > > > over
> > > > > > >> the
> > > > > > >> > >       mailing list as usual.
> > > > > > >> > >       -
> > > > > > >> > >
> > > > > > >> > >       Rust does not need to release a new version when the
> > > rest
> > > > of
> > > > > > >> Arrow
> > > > > > >> > >       releases; we bundle our latest released crates to
> the
> > > > signed
> > > > > > >> tar.
> > > > > > >> > >       -
> > > > > > >> > >
> > > > > > >> > >       Crates can depend on GitHub commit hashes between
> > > releases
> > > > > > >> > >
> > > > > > >> > >
> > > > > > >> > > The Google document may be the best place to collaborate
> on
> > > the
> > > > > > >> proposal
> > > > > > >> > > but I can update the document based on any comments in
> this
> > > > email
> > > > > > >> thread
> > > > > > >> > as
> > > > > > >> > > well.
> > > > > > >> > >
> > > > > > >> > > Note that I have excluded discussion about arrow2/parquet2
> > > from
> > > > > this
> > > > > > >> > > proposal and I believe we should discuss that separately
> as a
> > > > > > >> follow-on
> > > > > > >> > > discussion.
> > > > > > >> > >
> > > > > > >> > > I look forward to hearing opinions on this both from
> current
> > > > Rust
> > > > > > >> > > maintainers and contributors and also from the wider Arrow
> > > > > community.
> > > > > > >> > >
> > > > > > >> > > Thanks,
> > > > > > >> > >
> > > > > > >> > > Andy.
> > > > > > >> > >
> > > > > > >> > > [1]
> > > > > > >> > >
> > > > > > >> > >
> > > > > > >> >
> > > > > > >>
> > > > >
> > > >
> > >
> https://docs.google.com/document/d/1TyrUP8_UWXqk97a8Hvb1d0UYWigch0HAephIjW7soSI/edit?usp=sharing
> > > > > > >> > >
> > > > > > >> >
> > > > > > >>
> > > > > > >
> > > > >
> > > >
> > >
>

Re: [DISCUSS] [Rust] Move Rust components to new repos and process

Reply via email to