I'm actively building an engineering team to work on this -- so anyone
who would like to work on this as part of their day job can reach out
to me to discuss. We are doing some research about what aspects of
prior art in columnar database systems we can pull into the Arrow C++
project (to make sure we are building something that is close to
state-of-the-art) and will post some design docs about various topics
as we collect our thoughts. The threading improvement proposal that
Weston posted the other day is closely related to this.

On Fri, Feb 12, 2021 at 11:09 AM Jorge Cardoso Leitão
<[email protected]> wrote:
>
> Hi, Tom,
>
> This does not address the question directly, but for what is worth, I had
> the same issue and thus released a Python binding for DataFusion
> <https://pypi.org/project/datafusion/>. It allows e.g. to create a pyarrow
> RecordBatch by reading from s3 (via pyarrow), and use it as a source to
> DataFusion's plan via SQL or DataFrame API. Because it uses the C data
> interface, there is virtually no cost in moving from and to
> datafusion/pyarrow. It supports UDFs and UDAF in native pyarrow arrays,
> which means that there is no performance hit when using a UDF with a
> pyarrow/C++ kernel also. Performance decrates when you need to map the
> pyarrow array to some other format (e.g. numpy), typically to push it to
> sklearn, scipy, etc.
>
> `pip install datafusion`, but fyi this is *not* production ready and many
> of the pyarrow types are not supported yet. :)
>
> Best,
> Jorge
>
>
>
>
> On Fri, Feb 12, 2021 at 5:41 PM Tom Scheffers <[email protected]>
> wrote:
>
> > Dear devs,
> >
> > I am really interested in an in-memory query interface to Arrow tables
> > (like DataFusion is for Rust), preferably in Python. In my opinion, there
> > are three routes: 1. create a wrapper/interface to DataFusion directly, 2.
> > copy Arrow to pandas and use an existing framework (like Ibis) and 3.
> > build/extend something new based on pyarrow (with small conversions back
> > and forth to numpy or pandas).
> >
> > The Arrow / DataFusion route currently lacks some capabilities, like
> > parquet files directly from S3, but also the push down of predicates.
> > Therefore, I would rather wait for things to mature. Besides, the C++
> > branch of Arrow seems to be more mature and integrates nicely with Python.
> >
> > The pandas route is probably more convenient, however it will be much less
> > efficient. Columnar storage, predicate push downs and statistics
> > optimizations are the main reason for using Arrow, which will not be fully
> > utilized in this route.
> >
> > Is there already something like DataFusion on the roadmap for C++ (and thus
> > Python)? Or is there an Ibis like engine which acts directly on Pyarrow? I
> > would like to help on advancements into this direction, but struggle in
> > finding where to start.
> >
> > Thanks for your help.
> >
> > Kind regards,
> >
> > Tom
> >

Reply via email to