Hi all, We have been working on a native execution engine for Apache Spark that is heavily based on DataFusion and Arrow. Our goal is to accelerate Spark query execution via delegating Spark's physical plan execution to DataFusion's highly modular execution framework, while still maintaining the same semantics to Spark users (i.e., no Spark behavior change from the end users' point of view). Several of us are Spark and/or Arrow committers. At the moment, the project is under active development and not yet feature complete. However, some of the existing functionalities are relatively mature and have been put in production for a while now.
Given the current momentum towards accelerating Spark through native vectorized execution, we believe open sourcing this work will benefit other Spark users too. In addition, we think the project itself can also leverage the vibrant and strong community behind Arrow and DataFusion, and grow faster. Because of this, we are exploring the possibility of contributing this project to the Apache Software Foundation (ASF) under the Apache Arrow project umbrella. We'd very much like to hear your opinion on this. Thanks. Best, Chao