There are two examples: an example in DataFusion [1], and an example in python [2].
In DataFusion, the performance is the same because the UDF is compiled as Rust. It can even be compiled with SIMD intrinsics. In Python, it depends what is used inside the UDF: * If only pyarrow.compute functions are used, then it should be the same; this is because datafusion -> pyarrow is just sharing pointers. There may be some degradation due to the GIL, but I would say that this is an implementation detail particular to Python. * If not pyarrow, then there are some performance considerations as we need to perform arrow -> [favourite in-memory like Pandas or numpy] -> arrow. IMO the important aspect here is that even only in Rust, the moment we iterate over some values using e.g. `Option<f64>`, we lose some benefits of the arrow format. In this context, imo the penalty is not so much about which programming language is used, but whether the kernels are written to leverage the arrow spec or not. If not, there is a serialization/deserialization penalty due to in-memory roundtrip "arrow -> to other in-memory format -> kernel -> to arrow -> arrow". fwiw, this is why I am of the opinion that "arrow" is something more than just the in-memory format; the kernels need to be written in a specific way for maximum performance, and thus I can see part of the Arrow mission to maintain a curated set of kernels that leverage the format, so that folks do not have to know all the spec details to benefit from it. Best, Jorge [1] https://github.com/apache/arrow-datafusion/blob/master/datafusion-examples/examples/simple_udf.rs [2] https://github.com/apache/arrow-datafusion/tree/master/python#udfs On Wed, May 19, 2021 at 5:49 PM Arun Sharma <a...@sharma-home.net> wrote: > On Tue, May 18, 2021 at 11:58 PM Antoine Pitrou <anto...@python.org> > wrote: > > > > > > > Le 19/05/2021 à 03:28, Arun Sharma a écrit : > > > > > Say we're talking arrow + datafusion (which is written in Rust). It > > > sounded like your goal is to ensure that users of different language > > > ecosystems get the same performance and feature set as rust. Let me > know > > if > > > I misunderstood. > > > > For the record, are you aware of https://pypi.org/project/datafusion/ ? > > > > Thank you for the link. I knew it was possible, but I was not aware of this > specific package. > > The UDF/UDAF examples in that page seem relevant to what I'm discussing. Is > there any data on how these perform relative to writing the same code in > rust? > > Perhaps all of this is a non-problem and I could be looking for a nail to > use my new shiny hammer on. That's also good to know :) > > -Arun >