There are two examples: an example in DataFusion [1], and an example in
python [2].

In DataFusion, the performance is the same because the UDF is compiled as
Rust. It can even be compiled with SIMD intrinsics.

In Python, it depends what is used inside the UDF:

* If only pyarrow.compute functions are used, then it should be the same;
this is because datafusion -> pyarrow is just sharing pointers. There may
be some degradation due to the GIL, but I would say that this is an
implementation detail particular to Python.
* If not pyarrow, then there are some performance considerations as we need
to perform arrow -> [favourite in-memory like Pandas or numpy] -> arrow.

IMO the important aspect here is that even only in Rust, the moment we
iterate over some values using e.g. `Option<f64>`, we lose some benefits of
the arrow format. In this context, imo the penalty is not so much about
which programming language is used, but whether the kernels are written to
leverage the arrow spec or not. If not, there is a
serialization/deserialization penalty due to in-memory roundtrip "arrow ->
to other in-memory format -> kernel -> to arrow -> arrow".

fwiw, this is why I am of the opinion that "arrow" is something more than
just the in-memory format; the kernels need to be written in a specific way
for maximum performance, and thus I can see part of the Arrow mission to
maintain a curated set of kernels that leverage the format, so that folks
do not have to know all the spec details to benefit from it.

Best,
Jorge


[1]
https://github.com/apache/arrow-datafusion/blob/master/datafusion-examples/examples/simple_udf.rs
[2] https://github.com/apache/arrow-datafusion/tree/master/python#udfs

On Wed, May 19, 2021 at 5:49 PM Arun Sharma <a...@sharma-home.net> wrote:

> On Tue, May 18, 2021 at 11:58 PM Antoine Pitrou <anto...@python.org>
> wrote:
>
> >
> >
> > Le 19/05/2021 à 03:28, Arun Sharma a écrit :
> >
> > > Say we're talking arrow + datafusion (which is written in Rust).  It
> > > sounded like your goal is to ensure that users of different language
> > > ecosystems get the same performance and feature set as rust. Let me
> know
> > if
> > > I misunderstood.
> >
> > For the record, are you aware of https://pypi.org/project/datafusion/ ?
> >
>
> Thank you for the link. I knew it was possible, but I was not aware of this
> specific package.
>
> The UDF/UDAF examples in that page seem relevant to what I'm discussing. Is
> there any data on how these perform relative to writing the same code in
> rust?
>
> Perhaps all of this is a non-problem and I could be looking for a nail to
> use my new shiny hammer on. That's also good to know :)
>
>  -Arun
>

Reply via email to