Dear devs, I am really interested in an in-memory query interface to Arrow tables (like DataFusion is for Rust), preferably in Python. In my opinion, there are three routes: 1. create a wrapper/interface to DataFusion directly, 2. copy Arrow to pandas and use an existing framework (like Ibis) and 3. build/extend something new based on pyarrow (with small conversions back and forth to numpy or pandas).
The Arrow / DataFusion route currently lacks some capabilities, like parquet files directly from S3, but also the push down of predicates. Therefore, I would rather wait for things to mature. Besides, the C++ branch of Arrow seems to be more mature and integrates nicely with Python. The pandas route is probably more convenient, however it will be much less efficient. Columnar storage, predicate push downs and statistics optimizations are the main reason for using Arrow, which will not be fully utilized in this route. Is there already something like DataFusion on the roadmap for C++ (and thus Python)? Or is there an Ibis like engine which acts directly on Pyarrow? I would like to help on advancements into this direction, but struggle in finding where to start. Thanks for your help. Kind regards, Tom
