Hi, Over the past few weeks I have been running an experiment whose main goal is to run a query in (Rust's) DataFusion and use Python on it so that we can embed the Python's ecosystem on the query (a-la pyspark) (details here <https://github.com/jorgecarleitao/datafusion-python>).
I am super happy to say that with the code on PR 8401 <https://github.com/apache/arrow/pull/8401>, I was able to achieve its main goal: a DataFusion query that uses a Python UDF on it, using the C data interface to perform zero-copies between C++/Python and Rust. The Python UDF's signatures are dead simple: f(pa.Array, ...) -> pa.Array. This is a pure arrow implementation (no Python dependencies other than pyarrow) and achieves near optimal execution performance under the constraints, with the full flexibility of Python and its ecosystem to perform non-trivial transformations. Folks can of course convert Pyarrow Arrays to other Python formats within the UDF (e.g. Pandas, or numpy), but also for interpolation, ML, cuda, etc. Finally, the coolest thing about this is that a single execution passes through most Rust's code base (DataFusion, Arrow, Parquet), but also through a lot of the Python, C and C++ code base. IMO this reaffirms the success of this project, on which the different parts stick to the contract (spec, c data interface) and enable stuff like this. So, big kudos to all of you! Best, Jorge