I wanted to give a quick update to add some context to the work I am doing
to add parallel query execution to DataFusion since I have been working on
this largely in isolation.

The current query execution code in DataFusion 0.14 is single-threaded and
can only run against a single CSV or Parquet file and this prevents much
real-world adoption.

I have been working on ARROW-5227 for a while now, to implement new query
execution code designed from the ground up to support partitions and
multiple threads, so that queries can utilize cores for better performance.
This work involved introducing a physical query plan that can be created
from the logical query plan. The physical query plan is based on traits and
is therefore extensible by other projects, so additional operators and
expressions can be added without requiring a new release of Arrow.

Once this new code supports the same level of functionality as the existing
code, I plan on deprecating or removing the previous code that executed
logical plans directly, obviously with a discussion on the mailing list
first.

Once the following PRs are merged, it will be possible to run projection,
selection, and aggregate queries across partitioned data sources, using
multiple cores.

https://issues.apache.org/jira/browse/ARROW-6089

https://issues.apache.org/jira/browse/ARROW-6090

https://issues.apache.org/jira/browse/ARROW-6563

It would be nice to get these into the 0.15 release as a preview but I
realize this is a bit last minute.

Please let me know if you have any questions on any of this.


Thanks,

Andy.

Reply via email to