Re: [Python] Python based Query Engine for Arrow

Micah Kornfield Fri, 12 Feb 2021 08:45:38 -0800

Welcome Tom,

> Is there already something like DataFusion on the roadmap for C++ (and thus
> Python)?



Yes it is [1] and the components are being developed.  In terms of
contributions others might have a better idea but I think the two big
pieces of functionality missing from a kernel/operator perspective are:
Aggregates and Joins.

There is also the work of tying together datasets withs kernels and
materializing the output.

[1]
https://docs.google.com/document/d/10RoUZmiMQRi_J1FcPeVAUAMJ6d_ZuiEbaM2Y33sNPu4/edit

On Fri, Feb 12, 2021 at 8:41 AM Tom Scheffers <[email protected]>
wrote:

> Dear devs,
>
> I am really interested in an in-memory query interface to Arrow tables
> (like DataFusion is for Rust), preferably in Python. In my opinion, there
> are three routes: 1. create a wrapper/interface to DataFusion directly, 2.
> copy Arrow to pandas and use an existing framework (like Ibis) and 3.
> build/extend something new based on pyarrow (with small conversions back
> and forth to numpy or pandas).
>
> The Arrow / DataFusion route currently lacks some capabilities, like
> parquet files directly from S3, but also the push down of predicates.
> Therefore, I would rather wait for things to mature. Besides, the C++
> branch of Arrow seems to be more mature and integrates nicely with Python.
>
> The pandas route is probably more convenient, however it will be much less
> efficient. Columnar storage, predicate push downs and statistics
> optimizations are the main reason for using Arrow, which will not be fully
> utilized in this route.
>
> Is there already something like DataFusion on the roadmap for C++ (and thus
> Python)? Or is there an Ibis like engine which acts directly on Pyarrow? I
> would like to help on advancements into this direction, but struggle in
> finding where to start.
>
> Thanks for your help.
>
> Kind regards,
>
> Tom
>

Re: [Python] Python based Query Engine for Arrow

Reply via email to