Re: [DISCUSS] Donation of a Spark native engine based on DataFusion & Arrow

Jacques Nadeau Thu, 18 Jan 2024 17:24:10 -0800

Yes, that was roughly what I was requesting (I was suggesting a single PR
with many commits that would be merged with the history).


It's hard to provide a more concrete opinion on this without seeing the
quantity and complexity of the code. If it's 5,000 lines of code, it
probably doesn't matter. If it's 500,000, it's probably pretty important.
If 10 active Arrow/Datafusion committers are already substantial
contributors to the code also makes a difference versus only a fairly
disjunct collection of people who are relatively inactive Arrow community
members.

Don't take this as lack of excitement! The potential for contribution is
awesome and exciting!

Part of making the contribution successful is making it as approachable as
possible to the rest of the community. I just want to find every way
possible that we can do that.

Looking forward to seeing the code.

On Wed, Jan 17, 2024 at 10:13 AM Chao Sun <sunc...@apache.org> wrote:

> Hi Jacques,
>
> Do you mean instead of a single PR, we modify (e.g., git commit amend)
> all the commits that we have internally to remove any sensitive
> information, and open PRs for them against the above repo?
>
> I understand this will help readability and maintenance of the code,
> but it will be a lot of work (we have ~1000 commits) and much more
> difficult to pass our legal review (our company has pretty strict
> policies in open source and all the commits need to be checked before
> they can go outside). In addition, we already carefully added plenty
> of comments in the codebase for things that require non-trivial
> efforts to understand.
>
> Given that all of our team members will be actively maintaining and
> contributing to this project (since it's being widely used internally
> already), we'd be happy to help further improve readability &
> maintainability of the codebase and resolving issues raised from the
> community. Will this work for you? really appreciate if you understand
> our situation.
>
> Thanks,
> Chao
>
> On Wed, Jan 17, 2024 at 11:30 AM Jacques Nadeau <jacq...@apache.org>
> wrote:
> >
> > Thanks for the quick response Chao.
> >
> > My experience on these things is that maintaining commit history for
> large
> > codebases can be invaluable for tracking down issues. (Hey, why is this
> > code written this way-- oh, it was part of x patch that was trying to
> > achieve y).
> >
> > In the past, I've used git commit replay type tools and filtering of
> commit
> > messages, subdirectories, etc. to get something prepped for external
> > consumption. My experience is that spending a few days now to do this
> kind
> > of thing saves far more days in the future (and leads to higher quality).
> >
> > On Wed, Jan 17, 2024 at 9:18 AM Chao Sun <sunc...@apache.org> wrote:
> >
> > > Hi Andy and Jacques,
> > >
> > > Thanks for setting the repo up. Yes we are working on cleaning up the
> > > internal repo and preparing to open a PR in the next few days.
> > >
> > > It's a bit difficult to retain the original commit history in the PR
> > > though since some of them contain internal info which we need to
> > > remove upon open sourcing. How about we just add a summary in the PR
> > > itself, and add everyone that has contributed to it as co-author to
> > > the PR?
> > >
> > > Chao
> > >
> > > On Wed, Jan 17, 2024 at 11:09 AM Jacques Nadeau <jacq...@apache.org>
> > > wrote:
> > > >
> > > > Hey Chao, it would be great for you to share the code some place with
> > > > commit history. (PR to the repo that Andy made or something else.)
> > > >
> > > > On Mon, Jan 15, 2024 at 7:38 AM Andy Grove <andygrov...@gmail.com>
> > > wrote:
> > > >
> > > > > Hi Chao,
> > > > >
> > > > > I have created https://github.com/apache/arrow-datafusion-comet
> and
> > > you
> > > > > should be able to create a PR against the repo.
> > > > >
> > > > > Thanks,
> > > > >
> > > > > Andy.
> > > > >
> > > > > Andy.
> > > > >
> > > > > On Fri, Jan 12, 2024 at 3:45 PM Chao Sun <sunc...@apache.org>
> wrote:
> > > > >
> > > > > > Thanks all for the positive support!
> > > > > >
> > > > > > Andy, we plan to name the project Comet (BTW if you have better
> > > > > > suggestions please let us know). Could you help to create a repo
> > > named
> > > > > > arrow-datafusion-comet or arrow-comet? We'll clean up our
> internal
> > > > > > repo and prepare for the donation in the next few days. Thanks
> for
> > > the
> > > > > > help!
> > > > > >
> > > > > > Best,
> > > > > > Chao
> > > > > >
> > > > > >
> > > > > >
> > > > > > On Fri, Jan 12, 2024 at 7:09 AM Andy Grove <
> andygrov...@gmail.com>
> > > > > wrote:
> > > > > > >
> > > > > > > I think the next step here would be to create a new repo so
> that
> > > Chao
> > > > > can
> > > > > > > create a PR for the contribution, and then we can proceed to a
> > > vote.
> > > > > > >
> > > > > > > Chao - do you have a proposal for the name of the project?
> Given
> > > that
> > > > > > this
> > > > > > > is being donated to Apache Arrow, the repo name will start with
> > > > > "arrow-".
> > > > > > > Also, given that this is more of a DataFusion sub-project, I
> think
> > > it
> > > > > > would
> > > > > > > make sense to prefix the repo name with "arrow-datafusion-" and
> > > then
> > > > > > rename
> > > > > > > to "datafusion-" once we move the DataFusion projects to the
> new
> > > > > > top-level
> > > > > > > project.
> > > > > > >
> > > > > > > If the vote passes, we must complete the IP clearance process
> > > before
> > > > > the
> > > > > > PR
> > > > > > > is accepted [1].
> > > > > > >
> > > > > > > [1] https://incubator.apache.org/ip-clearance/
> > > > > > >
> > > > > > >
> > > > > > >
> > > > > > > On Fri, Jan 12, 2024 at 12:36 AM Albert <zinki...@gmail.com>
> > > wrote:
> > > > > > >
> > > > > > > > Like Andrew Lamb mentioned, blaze-rs has similar goals, I'd
> > > really be
> > > > > > > > interested to know some comparisons when the donations are
> made.
> > > > > > > > All in all, I look forward to the new native project for
> spark
> > > > > > > > acceleration.
> > > > > > > >
> > > > > > > > On Thu, Jan 11, 2024 at 9:50 PM Andrew Lamb <
> > > al...@influxdata.com>
> > > > > > wrote:
> > > > > > > >
> > > > > > > > > I am very supportive of this donation. I know of at least
> one
> > > other
> > > > > > > > > DataFusion-based project, blaze-rs[1], which has the same
> > > design
> > > > > > goal and
> > > > > > > > > bringing this project into the ASF may help consolidate
> these
> > > > > efforts
> > > > > > > > >
> > > > > > > > > As Andy said, I believe it was very valuable to have a
> major
> > > > > consumer
> > > > > > > > > project (e.g. DataFusion) to help drive the definition and
> > > > > > implementation
> > > > > > > > > of arrow-rs implementation. We never achieved the same
> synergy
> > > with
> > > > > > > > > Ballista and DataFusion but I think it is more likely with
> a
> > > more
> > > > > > > > actively
> > > > > > > > > maintained Spark accelerator.
> > > > > > > > >
> > > > > > > > > I am not sure it affects this discussion, but the Gluten
> > > project,
> > > > > > based
> > > > > > > > on
> > > > > > > > > Velox, was accepted yesterday[2] into the Apache
> Incubator[2].
> > > > > > While the
> > > > > > > > > functionality may be similar, the technology (Rust vs
> C/C++)
> > > and
> > > > > the
> > > > > > > > > communities are different so having both in the same (big)
> > > tent of
> > > > > > the
> > > > > > > > ASF
> > > > > > > > > doesn't seem concerning to me.
> > > > > > > > >
> > > > > > > > > Also, as Chao says, I think this new sub project would
> > > naturally
> > > > > > move to
> > > > > > > > a
> > > > > > > > > new DataFusion top level project when we get there (we
> plan a
> > > > > > proposed
> > > > > > > > > resolution April ASF board meeting)
> > > > > > > > >
> > > > > > > > > Looking forward to seeing more!
> > > > > > > > > Andrew
> > > > > > > > >
> > > > > > > > > [1]: https://github.com/blaze-init/blaze
> > > > > > > > > [2]:
> > > > > > https://lists.apache.org/thread/6lrozds10jn9gknj9rf74lqbh7j55pq6
> > > > > > > > >
> > > > > > > > > On Wed, Jan 10, 2024 at 5:10 PM Andy Grove <
> > > andygrov...@gmail.com>
> > > > > > > > wrote:
> > > > > > > > >
> > > > > > > > > > Hi Chao,
> > > > > > > > > >
> > > > > > > > > > This sounds like a really interesting project. I am
> > > interested in
> > > > > > > > seeing
> > > > > > > > > > how it compares to Spark RAPIDS (the project that I work
> on
> > > at
> > > > > > NVIDIA)
> > > > > > > > > and
> > > > > > > > > > Intel's Gluten project (that works with Velox).
> > > > > > > > > >
> > > > > > > > > > I can see the following benefits of having this project
> being
> > > > > under
> > > > > > > > > Apache
> > > > > > > > > > Arrow governance:
> > > > > > > > > >
> > > > > > > > > > - Assuming that this is a drop-in replacement that
> doesn't
> > > > > require
> > > > > > > > users
> > > > > > > > > to
> > > > > > > > > > change their code (as I imagine is the case), then it
> could
> > > lead
> > > > > to
> > > > > > > > > greater
> > > > > > > > > > adoption of DataFusion, especially for more demanding use
> > > cases
> > > > > > where
> > > > > > > > > > processing on a single node is not possible.
> > > > > > > > > > - Given that it has a deep integration with the Rust
> > > > > > implementation of
> > > > > > > > > > Arrow as well as DataFusion, and given the overlap of
> > > committers
> > > > > > > > between
> > > > > > > > > > these projects, having them under the same governance and
> > > > > > communication
> > > > > > > > > > channels will generally be more efficient than if this
> > > project is
> > > > > > > > > separate.
> > > > > > > > > > - Hopefully this leads to more upstream contributions to
> > > > > > DataFusion,
> > > > > > > > > > perhaps even allowing other projects such as Ballista to
> > > benefit
> > > > > > from
> > > > > > > > > > Spark-compatible operators and expressions in the future.
> > > > > > > > > > - Having another project that uses DataFusion as a
> dependency
> > > > > could
> > > > > > > > help
> > > > > > > > > > with stabilizing the public APIs and generally driving
> more
> > > > > > innovation.
> > > > > > > > > >
> > > > > > > > > > Given these points, I would be supportive of a donation.
> I
> > > see it
> > > > > > as
> > > > > > > > > being
> > > > > > > > > > similar to the Ballista project, which is already part of
> > > Arrow
> > > > > > (and we
> > > > > > > > > > plan to move along with DataFusion once it becomes a
> > > top-level
> > > > > > > > project).
> > > > > > > > > >
> > > > > > > > > > Thanks,
> > > > > > > > > >
> > > > > > > > > > Andy.
> > > > > > > > > >
> > > > > > > > > > On Wed, Jan 10, 2024 at 2:28 PM Chao Sun <
> sunc...@apache.org
> > > >
> > > > > > wrote:
> > > > > > > > > >
> > > > > > > > > > > Hi all,
> > > > > > > > > > >
> > > > > > > > > > > We have been working on a native execution engine for
> > > Apache
> > > > > > Spark
> > > > > > > > > > > that is heavily based on DataFusion and Arrow. Our
> goal is
> > > to
> > > > > > > > > > > accelerate Spark query execution via delegating Spark's
> > > > > physical
> > > > > > plan
> > > > > > > > > > > execution to DataFusion's highly modular execution
> > > framework,
> > > > > > while
> > > > > > > > > > > still maintaining the same semantics to Spark users
> (i.e.,
> > > no
> > > > > > Spark
> > > > > > > > > > > behavior change from the end users' point of view).
> > > Several of
> > > > > > us are
> > > > > > > > > > > Spark and/or Arrow committers. At the moment, the
> project
> > > is
> > > > > > under
> > > > > > > > > > > active development and not yet feature complete.
> However,
> > > some
> > > > > > of the
> > > > > > > > > > > existing functionalities are relatively mature and have
> > > been
> > > > > put
> > > > > > in
> > > > > > > > > > > production for a while now.
> > > > > > > > > > >
> > > > > > > > > > > Given the current momentum towards accelerating Spark
> > > through
> > > > > > native
> > > > > > > > > > > vectorized execution, we believe open sourcing this
> work
> > > will
> > > > > > benefit
> > > > > > > > > > > other Spark users too. In addition, we think the
> project
> > > itself
> > > > > > can
> > > > > > > > > > > also leverage the vibrant and strong community behind
> > > Arrow and
> > > > > > > > > > > DataFusion, and grow faster. Because of this, we are
> > > exploring
> > > > > > the
> > > > > > > > > > > possibility of contributing this project to the Apache
> > > Software
> > > > > > > > > > > Foundation (ASF) under the Apache Arrow project
> umbrella.
> > > > > > > > > > >
> > > > > > > > > > > We'd very much like to hear your opinion on this.
> Thanks.
> > > > > > > > > > >
> > > > > > > > > > > Best,
> > > > > > > > > > > Chao
> > > > > > > > > > >
> > > > > > > > > >
> > > > > > > > >
> > > > > > > >
> > > > > > > >
> > > > > > > > --
> > > > > > > > ~~~~~~~~~~~~~~~
> > > > > > > > no mistakes
> > > > > > > > ~~~~~~~~~~~~~~~~~~
> > > > > > > >
> > > > > >
> > > > >
> > >
>

Re: [DISCUSS] Donation of a Spark native engine based on DataFusion & Arrow

Reply via email to