Hi Chao, Very cool. I think this is something that a lot of people are interested in. I think the main questions I have are: 1. Would Spark itself not be a reasonable place for this work? 2. Do you anticipate this would move with DataFusion to its own top-level project [1] if that happens or stay within the Arrow project?
Thanks, Micah [1] https://lists.apache.org/thread/c150t1s1x0kcb3r03cjyx31kqs5oc341 On Wed, Jan 10, 2024 at 1:28 PM Chao Sun <sunc...@apache.org> wrote: > Hi all, > > We have been working on a native execution engine for Apache Spark > that is heavily based on DataFusion and Arrow. Our goal is to > accelerate Spark query execution via delegating Spark's physical plan > execution to DataFusion's highly modular execution framework, while > still maintaining the same semantics to Spark users (i.e., no Spark > behavior change from the end users' point of view). Several of us are > Spark and/or Arrow committers. At the moment, the project is under > active development and not yet feature complete. However, some of the > existing functionalities are relatively mature and have been put in > production for a while now. > > Given the current momentum towards accelerating Spark through native > vectorized execution, we believe open sourcing this work will benefit > other Spark users too. In addition, we think the project itself can > also leverage the vibrant and strong community behind Arrow and > DataFusion, and grow faster. Because of this, we are exploring the > possibility of contributing this project to the Apache Software > Foundation (ASF) under the Apache Arrow project umbrella. > > We'd very much like to hear your opinion on this. Thanks. > > Best, > Chao >