Is this UDF implementation based on DataFusion? If so, it makes sense for it to be part of the DataFusion project.

OTOH, if it can work with any data in the Arrow format, then it would sound weird to maintain it in the DataFusion repo IMHO.

Regards

Antoine.


Le 28/06/2024 à 21:52, Andrew Lamb a écrit :
To be clear, if the arrow community thinks this would be better organized /
administered in the Apache DataFusion project (especially if it is aligned
with Rust) I think it would be good to discuss donating there

On Fri, Jun 28, 2024 at 3:17 PM Andrew Lamb <al...@influxdata.com> wrote:

I think there are two aspects:
1. The actual mechanics of implementing functions
2. The actual library of udf functions (e.g. sin, cos, nullif, etc)

I agree 2 is not something that belongs naturally in the arrow project and
is better aligned with query engines

However I think 1 is worth considering.

As I understand it, the problem arrow_udf solves is avoiding some of the
boilerplate  required to make vectorized udfs. So instead of writing a
special eval_gcd function like this

```
fn gcd(l: i64, r: i64) -> i64 {
  // do gcd calculation
}

// implement vectorized version
fn eval_gcd(left: &ArrayRef, right: &ArrayRef) -> ArrayRef {
   let left = left.as_primitive<Int64Type>();
   let right = right.as_primitive<Int64Type>();
   res = binary(left, right, |l, r| gcd(l, r));
   Arc::new(res)
}
```

The user simply annotates the scalar function and have the library code
gen the array version
```
#[function("gcd(int64, int64) -> int64", output = "eval_gcd")]
fn gcd(l: i64, r: i64) -> i64 {
  // do gcd calculation
}
```

We have a lot of boilerplate / non idea macro stuff in DataFusion that I
think this would help a lot.

Andrew


On Fri, Jun 28, 2024 at 3:08 PM Raphael Taylor-Davies
<r.taylordav...@googlemail.com.invalid> wrote:

I wonder if the DataFusion project might be a more natural home for this
functionality? UDFs are more of a query engine concept, whereas arrow-rs is
more focused on purely physical execution?

On 28 June 2024 19:41:39 BST, Runji Wang <wangrunji0...@163.com> wrote:
Hi Felipe,

Vectorization will be applied whenever possible. When all input and
output types of a function are primitive (int16, int32, int64, float32,
float64) and do not involve any Option or Result, the macro will
automatically generate code based on unary <
https://docs.rs/arrow/latest/arrow/compute/fn.unary.html> or binary <
https://docs.rs/arrow/latest/arrow/compute/fn.binary.html> kernels,
which potentially allows for vectorization.

Both examples you showed are not vectorized. The `div` function is due
to the Result output, while `gcd` is due to the loop in its implementation.
However, if the function is simple enough, like an `add` function:

#[function("add(int, int) -> int")]
fn add(a: i32, b: i32) -> i32 {
    a + b
}

It can be auto-vectorized by llvm.

Runji


On 2024/06/28 17:13:16 Felipe Oliveira Carvalho wrote:
On Fri, Jun 28, 2024 at 11:07 AM Andrew Lamb <al...@influxdata.com>
wrote:

Hi Xuanwo,

Sorry for the delay in responding. I think  the ability to easily
write
functions that "feel" like native functions in whatever language and
be
able to generate arrow / vectorized versions of them is quite
valuable.
This is my understanding of what this proposal is about.

My understanding is that it's not vectorized. From the examples in
risingwavelabs/arrow-udf, <https://github.com/risingwavelabs/arrow-udf>
it
looks like the macros generate code that gathers values from columns
into
local scalars that are passed as scalar parameters to user functions.
Is
the hope here that rustc/llvm will auto-vectorize the code?

#[function("gcd(int, int) -> int")]
fn gcd(mut a: i32, mut b: i32) -> i32 {
     while b != 0 {
         (a, b) = (b, a % b);
     }
     a
}

#[function("div(int, int) -> int")]
fn div(x: i32, y: i32) -> Result<i32, &'static str> {
     if y == 0 {
         return Err("division by zero");
     }
     Ok(x / y)
}

I left some additional comments on the markdown.

One thing that might be worth doing is articulate some other
potential
locations for where the code might go. One option, as I think you
propose,
is to make its own repository.  Another option could be to donate
the code
and put the various language bindings in the same repo as the arrow
language implementations (e.g arrow-rs, arrow for python, etc) which
would
likely make it easier to maintain and discover.

I am curious about what other devs / users feel about this?

Andrew



On Thu, Jun 20, 2024 at 3:04 AM Xuanwo <xu...@apache.org> wrote:

Hello, everyone.

I start this thread to disscuss the donation of a User-Defined
Function
Framework for Apache Arrow.

Feel free to review and leave your comments here. For live review,
please
visit:

https://hackmd.io/@xuanwo/apache-arrow-udf

The original content also pasted here for a quick reading:

------

## Abstract

Arrow UDF is a User-Defined Function Framework for Apache Arrow.

## Proposal

Arrow UDF allows user to easily create and run user-defined
functions
(UDF) in Rust, Python, Java or JavaScript based on Apache Arrow.
The
functions can be executed natively, or in WebAssembly, or in a
remote
server via Arrow Flight.

Arrow UDF was originally designed to be used by the RisingWave
project
but
is now being used by Databend and several database startups.

We believe that the Arrow UDF project will provide diversity value
to
the
entire Arrow community.

## Background

Arrow UDF is being developed by an open-source community from day
one
and
is owned by RisingWaveLabs. The project has been launched in
December
2023.

## Initial Goals

By transferring ownership of the project to the Apache Arrow,
Arrow UDF
expects to ensure its neutrality and further encourage and
facilitate
the
adoption of Arrow UDF by the community.

## Current Status

Contributors: 5

Users:

-   [RisingWave]: A Distributed SQL Database for Stream Processing.
-   [Databend]: An open-source cloud data warehouse that serves as
a
cost-effective alternative to Snowflake.

## Documentation

The document of Arrow UDF is hosted at
https://docs.rs/arrow-udf/latest/arrow_udf/.

## Initial Source

The project currently holds a GitHub repository and multiple
packages:

- https://github.com/risingwavelabs/arrow-udf

Rust:

- https://crates.io/arrow-udf/
- https://crates.io/arrow-udf-python/
- https://crates.io/arrow-udf-js/
- https://crates.io/arrow-udf-js-deno/
- https://crates.io/arrow-udf-wasm/

Python:

- https://pypi.org/project/arrow-udf/

Those packge will retain its name, while the repository will be
moved to
apache org.

## Required Resources

### Mailing Lists

We can reuse the existing mailing lists that arrow have.

### Git Repositories

From

- https://github.com/risingwavelabs/arrow-udf

To

- https://gitbox.apache.org/asf/repos/arrow-udf
- https://github.com/apache/arrow-udf

### Issue Tracking

The project would like to continue using GitHub Issues.

### Other Resources

The project has already chosen GitHub actions as continuous
integration
tools.

## Initial Committers

- Runji Wang wangrunji0...@163.com
- Giovanny Gutiérrez
- sundy-li sund...@apache.org
- Xuanwo xua...@apache.org
- Max Justus Spransy maxjus...@gmail.com

[RisingWave]: https://github.com/risingwavelabs/risingwave
[Databend]: https://github.com/datafuselabs/databend

Xuanwo





Reply via email to