Re: [DISCUSS] Donation of a Spark native engine based on DataFusion & Arrow

Chao Sun Wed, 17 Jan 2024 12:13:19 -0800

Hi Jacques,

Do you mean instead of a single PR, we modify (e.g., git commit amend)
all the commits that we have internally to remove any sensitive
information, and open PRs for them against the above repo?


I understand this will help readability and maintenance of the code,
but it will be a lot of work (we have ~1000 commits) and much more
difficult to pass our legal review (our company has pretty strict
policies in open source and all the commits need to be checked before
they can go outside). In addition, we already carefully added plenty
of comments in the codebase for things that require non-trivial
efforts to understand.

Given that all of our team members will be actively maintaining and
contributing to this project (since it's being widely used internally
already), we'd be happy to help further improve readability &
maintainability of the codebase and resolving issues raised from the
community. Will this work for you? really appreciate if you understand
our situation.

Thanks,
Chao

On Wed, Jan 17, 2024 at 11:30 AM Jacques Nadeau <[email protected]> wrote:
>
> Thanks for the quick response Chao.
>
> My experience on these things is that maintaining commit history for large
> codebases can be invaluable for tracking down issues. (Hey, why is this
> code written this way-- oh, it was part of x patch that was trying to
> achieve y).
>
> In the past, I've used git commit replay type tools and filtering of commit
> messages, subdirectories, etc. to get something prepped for external
> consumption. My experience is that spending a few days now to do this kind
> of thing saves far more days in the future (and leads to higher quality).
>
> On Wed, Jan 17, 2024 at 9:18 AM Chao Sun <[email protected]> wrote:
>
> > Hi Andy and Jacques,
> >
> > Thanks for setting the repo up. Yes we are working on cleaning up the
> > internal repo and preparing to open a PR in the next few days.
> >
> > It's a bit difficult to retain the original commit history in the PR
> > though since some of them contain internal info which we need to
> > remove upon open sourcing. How about we just add a summary in the PR
> > itself, and add everyone that has contributed to it as co-author to
> > the PR?
> >
> > Chao
> >
> > On Wed, Jan 17, 2024 at 11:09 AM Jacques Nadeau <[email protected]>
> > wrote:
> > >
> > > Hey Chao, it would be great for you to share the code some place with
> > > commit history. (PR to the repo that Andy made or something else.)
> > >
> > > On Mon, Jan 15, 2024 at 7:38 AM Andy Grove <[email protected]>
> > wrote:
> > >
> > > > Hi Chao,
> > > >
> > > > I have created https://github.com/apache/arrow-datafusion-comet and
> > you
> > > > should be able to create a PR against the repo.
> > > >
> > > > Thanks,
> > > >
> > > > Andy.
> > > >
> > > > Andy.
> > > >
> > > > On Fri, Jan 12, 2024 at 3:45 PM Chao Sun <[email protected]> wrote:
> > > >
> > > > > Thanks all for the positive support!
> > > > >
> > > > > Andy, we plan to name the project Comet (BTW if you have better
> > > > > suggestions please let us know). Could you help to create a repo
> > named
> > > > > arrow-datafusion-comet or arrow-comet? We'll clean up our internal
> > > > > repo and prepare for the donation in the next few days. Thanks for
> > the
> > > > > help!
> > > > >
> > > > > Best,
> > > > > Chao
> > > > >
> > > > >
> > > > >
> > > > > On Fri, Jan 12, 2024 at 7:09 AM Andy Grove <[email protected]>
> > > > wrote:
> > > > > >
> > > > > > I think the next step here would be to create a new repo so that
> > Chao
> > > > can
> > > > > > create a PR for the contribution, and then we can proceed to a
> > vote.
> > > > > >
> > > > > > Chao - do you have a proposal for the name of the project? Given
> > that
> > > > > this
> > > > > > is being donated to Apache Arrow, the repo name will start with
> > > > "arrow-".
> > > > > > Also, given that this is more of a DataFusion sub-project, I think
> > it
> > > > > would
> > > > > > make sense to prefix the repo name with "arrow-datafusion-" and
> > then
> > > > > rename
> > > > > > to "datafusion-" once we move the DataFusion projects to the new
> > > > > top-level
> > > > > > project.
> > > > > >
> > > > > > If the vote passes, we must complete the IP clearance process
> > before
> > > > the
> > > > > PR
> > > > > > is accepted [1].
> > > > > >
> > > > > > [1] https://incubator.apache.org/ip-clearance/
> > > > > >
> > > > > >
> > > > > >
> > > > > > On Fri, Jan 12, 2024 at 12:36 AM Albert <[email protected]>
> > wrote:
> > > > > >
> > > > > > > Like Andrew Lamb mentioned, blaze-rs has similar goals, I'd
> > really be
> > > > > > > interested to know some comparisons when the donations are made.
> > > > > > > All in all, I look forward to the new native project for spark
> > > > > > > acceleration.
> > > > > > >
> > > > > > > On Thu, Jan 11, 2024 at 9:50 PM Andrew Lamb <
> > [email protected]>
> > > > > wrote:
> > > > > > >
> > > > > > > > I am very supportive of this donation. I know of at least one
> > other
> > > > > > > > DataFusion-based project, blaze-rs[1], which has the same
> > design
> > > > > goal and
> > > > > > > > bringing this project into the ASF may help consolidate these
> > > > efforts
> > > > > > > >
> > > > > > > > As Andy said, I believe it was very valuable to have a major
> > > > consumer
> > > > > > > > project (e.g. DataFusion) to help drive the definition and
> > > > > implementation
> > > > > > > > of arrow-rs implementation. We never achieved the same synergy
> > with
> > > > > > > > Ballista and DataFusion but I think it is more likely with a
> > more
> > > > > > > actively
> > > > > > > > maintained Spark accelerator.
> > > > > > > >
> > > > > > > > I am not sure it affects this discussion, but the Gluten
> > project,
> > > > > based
> > > > > > > on
> > > > > > > > Velox, was accepted yesterday[2] into the Apache Incubator[2].
> > > > > While the
> > > > > > > > functionality may be similar, the technology (Rust vs C/C++)
> > and
> > > > the
> > > > > > > > communities are different so having both in the same (big)
> > tent of
> > > > > the
> > > > > > > ASF
> > > > > > > > doesn't seem concerning to me.
> > > > > > > >
> > > > > > > > Also, as Chao says, I think this new sub project would
> > naturally
> > > > > move to
> > > > > > > a
> > > > > > > > new DataFusion top level project when we get there (we plan a
> > > > > proposed
> > > > > > > > resolution April ASF board meeting)
> > > > > > > >
> > > > > > > > Looking forward to seeing more!
> > > > > > > > Andrew
> > > > > > > >
> > > > > > > > [1]: https://github.com/blaze-init/blaze
> > > > > > > > [2]:
> > > > > https://lists.apache.org/thread/6lrozds10jn9gknj9rf74lqbh7j55pq6
> > > > > > > >
> > > > > > > > On Wed, Jan 10, 2024 at 5:10 PM Andy Grove <
> > [email protected]>
> > > > > > > wrote:
> > > > > > > >
> > > > > > > > > Hi Chao,
> > > > > > > > >
> > > > > > > > > This sounds like a really interesting project. I am
> > interested in
> > > > > > > seeing
> > > > > > > > > how it compares to Spark RAPIDS (the project that I work on
> > at
> > > > > NVIDIA)
> > > > > > > > and
> > > > > > > > > Intel's Gluten project (that works with Velox).
> > > > > > > > >
> > > > > > > > > I can see the following benefits of having this project being
> > > > under
> > > > > > > > Apache
> > > > > > > > > Arrow governance:
> > > > > > > > >
> > > > > > > > > - Assuming that this is a drop-in replacement that doesn't
> > > > require
> > > > > > > users
> > > > > > > > to
> > > > > > > > > change their code (as I imagine is the case), then it could
> > lead
> > > > to
> > > > > > > > greater
> > > > > > > > > adoption of DataFusion, especially for more demanding use
> > cases
> > > > > where
> > > > > > > > > processing on a single node is not possible.
> > > > > > > > > - Given that it has a deep integration with the Rust
> > > > > implementation of
> > > > > > > > > Arrow as well as DataFusion, and given the overlap of
> > committers
> > > > > > > between
> > > > > > > > > these projects, having them under the same governance and
> > > > > communication
> > > > > > > > > channels will generally be more efficient than if this
> > project is
> > > > > > > > separate.
> > > > > > > > > - Hopefully this leads to more upstream contributions to
> > > > > DataFusion,
> > > > > > > > > perhaps even allowing other projects such as Ballista to
> > benefit
> > > > > from
> > > > > > > > > Spark-compatible operators and expressions in the future.
> > > > > > > > > - Having another project that uses DataFusion as a dependency
> > > > could
> > > > > > > help
> > > > > > > > > with stabilizing the public APIs and generally driving more
> > > > > innovation.
> > > > > > > > >
> > > > > > > > > Given these points, I would be supportive of a donation. I
> > see it
> > > > > as
> > > > > > > > being
> > > > > > > > > similar to the Ballista project, which is already part of
> > Arrow
> > > > > (and we
> > > > > > > > > plan to move along with DataFusion once it becomes a
> > top-level
> > > > > > > project).
> > > > > > > > >
> > > > > > > > > Thanks,
> > > > > > > > >
> > > > > > > > > Andy.
> > > > > > > > >
> > > > > > > > > On Wed, Jan 10, 2024 at 2:28 PM Chao Sun <[email protected]
> > >
> > > > > wrote:
> > > > > > > > >
> > > > > > > > > > Hi all,
> > > > > > > > > >
> > > > > > > > > > We have been working on a native execution engine for
> > Apache
> > > > > Spark
> > > > > > > > > > that is heavily based on DataFusion and Arrow. Our goal is
> > to
> > > > > > > > > > accelerate Spark query execution via delegating Spark's
> > > > physical
> > > > > plan
> > > > > > > > > > execution to DataFusion's highly modular execution
> > framework,
> > > > > while
> > > > > > > > > > still maintaining the same semantics to Spark users (i.e.,
> > no
> > > > > Spark
> > > > > > > > > > behavior change from the end users' point of view).
> > Several of
> > > > > us are
> > > > > > > > > > Spark and/or Arrow committers. At the moment, the project
> > is
> > > > > under
> > > > > > > > > > active development and not yet feature complete. However,
> > some
> > > > > of the
> > > > > > > > > > existing functionalities are relatively mature and have
> > been
> > > > put
> > > > > in
> > > > > > > > > > production for a while now.
> > > > > > > > > >
> > > > > > > > > > Given the current momentum towards accelerating Spark
> > through
> > > > > native
> > > > > > > > > > vectorized execution, we believe open sourcing this work
> > will
> > > > > benefit
> > > > > > > > > > other Spark users too. In addition, we think the project
> > itself
> > > > > can
> > > > > > > > > > also leverage the vibrant and strong community behind
> > Arrow and
> > > > > > > > > > DataFusion, and grow faster. Because of this, we are
> > exploring
> > > > > the
> > > > > > > > > > possibility of contributing this project to the Apache
> > Software
> > > > > > > > > > Foundation (ASF) under the Apache Arrow project umbrella.
> > > > > > > > > >
> > > > > > > > > > We'd very much like to hear your opinion on this. Thanks.
> > > > > > > > > >
> > > > > > > > > > Best,
> > > > > > > > > > Chao
> > > > > > > > > >
> > > > > > > > >
> > > > > > > >
> > > > > > >
> > > > > > >
> > > > > > > --
> > > > > > > ~~~~~~~~~~~~~~~
> > > > > > > no mistakes
> > > > > > > ~~~~~~~~~~~~~~~~~~
> > > > > > >
> > > > >
> > > >
> >

Re: [DISCUSS] Donation of a Spark native engine based on DataFusion & Arrow

Reply via email to