I wanted to give a quick update to add some context to the work I am doing to add parallel query execution to DataFusion since I have been working on this largely in isolation.
The current query execution code in DataFusion 0.14 is single-threaded and can only run against a single CSV or Parquet file and this prevents much real-world adoption. I have been working on ARROW-5227 for a while now, to implement new query execution code designed from the ground up to support partitions and multiple threads, so that queries can utilize cores for better performance. This work involved introducing a physical query plan that can be created from the logical query plan. The physical query plan is based on traits and is therefore extensible by other projects, so additional operators and expressions can be added without requiring a new release of Arrow. Once this new code supports the same level of functionality as the existing code, I plan on deprecating or removing the previous code that executed logical plans directly, obviously with a discussion on the mailing list first. Once the following PRs are merged, it will be possible to run projection, selection, and aggregate queries across partitioned data sources, using multiple cores. https://issues.apache.org/jira/browse/ARROW-6089 https://issues.apache.org/jira/browse/ARROW-6090 https://issues.apache.org/jira/browse/ARROW-6563 It would be nice to get these into the 0.15 release as a preview but I realize this is a bit last minute. Please let me know if you have any questions on any of this. Thanks, Andy.