Hi,

Over the past few weeks I have been running an experiment whose main goal
is to run a query in (Rust's) DataFusion and use Python on it so that we
can embed the Python's ecosystem on the query (a-la pyspark) (details here
<https://github.com/jorgecarleitao/datafusion-python>).

I am super happy to say that with the code on PR 8401
<https://github.com/apache/arrow/pull/8401>, I was able to achieve its main
goal: a DataFusion query that uses a Python UDF on it, using the C data
interface to perform zero-copies between C++/Python and Rust. The Python
UDF's signatures are dead simple: f(pa.Array, ...) -> pa.Array.

This is a pure arrow implementation (no Python dependencies other than
pyarrow) and achieves near optimal execution performance under the
constraints, with the full flexibility of Python and its ecosystem to
perform non-trivial transformations. Folks can of course convert Pyarrow
Arrays to other Python formats within the UDF (e.g. Pandas, or numpy), but
also for interpolation, ML, cuda, etc.

Finally, the coolest thing about this is that a single execution passes
through most Rust's code base (DataFusion, Arrow, Parquet), but also
through a lot of the Python, C and C++ code base. IMO this reaffirms the
success of this project, on which the different parts stick to the contract
(spec, c data interface) and enable stuff like this. So, big kudos to all
of you!

Best,
Jorge

Reply via email to