Welcome Tom, > Is there already something like DataFusion on the roadmap for C++ (and thus > Python)?
Yes it is [1] and the components are being developed. In terms of contributions others might have a better idea but I think the two big pieces of functionality missing from a kernel/operator perspective are: Aggregates and Joins. There is also the work of tying together datasets withs kernels and materializing the output. [1] https://docs.google.com/document/d/10RoUZmiMQRi_J1FcPeVAUAMJ6d_ZuiEbaM2Y33sNPu4/edit On Fri, Feb 12, 2021 at 8:41 AM Tom Scheffers <[email protected]> wrote: > Dear devs, > > I am really interested in an in-memory query interface to Arrow tables > (like DataFusion is for Rust), preferably in Python. In my opinion, there > are three routes: 1. create a wrapper/interface to DataFusion directly, 2. > copy Arrow to pandas and use an existing framework (like Ibis) and 3. > build/extend something new based on pyarrow (with small conversions back > and forth to numpy or pandas). > > The Arrow / DataFusion route currently lacks some capabilities, like > parquet files directly from S3, but also the push down of predicates. > Therefore, I would rather wait for things to mature. Besides, the C++ > branch of Arrow seems to be more mature and integrates nicely with Python. > > The pandas route is probably more convenient, however it will be much less > efficient. Columnar storage, predicate push downs and statistics > optimizations are the main reason for using Arrow, which will not be fully > utilized in this route. > > Is there already something like DataFusion on the roadmap for C++ (and thus > Python)? Or is there an Ibis like engine which acts directly on Pyarrow? I > would like to help on advancements into this direction, but struggle in > finding where to start. > > Thanks for your help. > > Kind regards, > > Tom >
