[Python] Python based Query Engine for Arrow

Tom Scheffers Fri, 12 Feb 2021 08:41:04 -0800

Dear devs,

I am really interested in an in-memory query interface to Arrow tables
(like DataFusion is for Rust), preferably in Python. In my opinion, there
are three routes: 1. create a wrapper/interface to DataFusion directly, 2.
copy Arrow to pandas and use an existing framework (like Ibis) and 3.
build/extend something new based on pyarrow (with small conversions back
and forth to numpy or pandas).


The Arrow / DataFusion route currently lacks some capabilities, like
parquet files directly from S3, but also the push down of predicates.
Therefore, I would rather wait for things to mature. Besides, the C++
branch of Arrow seems to be more mature and integrates nicely with Python.

The pandas route is probably more convenient, however it will be much less
efficient. Columnar storage, predicate push downs and statistics
optimizations are the main reason for using Arrow, which will not be fully
utilized in this route.

Is there already something like DataFusion on the roadmap for C++ (and thus
Python)? Or is there an Ibis like engine which acts directly on Pyarrow? I
would like to help on advancements into this direction, but struggle in
finding where to start.

Thanks for your help.

Kind regards,

Tom

[Python] Python based Query Engine for Arrow

Reply via email to