Re: [DISCUSS] Donation of a User-Defined Function Framework for Apache Arrow

Antoine Pitrou Fri, 28 Jun 2024 14:07:06 -0700

Is this UDF implementation based on DataFusion? If so, it makes sensefor it to be part of the DataFusion project.

OTOH, if it can work with any data in the Arrow format, then it wouldsound weird to maintain it in the DataFusion repo IMHO.


Regards

Antoine.


Le 28/06/2024 à 21:52, Andrew Lamb a écrit :

To be clear, if the arrow community thinks this would be better organized /
administered in the Apache DataFusion project (especially if it is aligned
with Rust) I think it would be good to discuss donating there

On Fri, Jun 28, 2024 at 3:17 PM Andrew Lamb <al...@influxdata.com> wrote:

I think there are two aspects:
1. The actual mechanics of implementing functions
2. The actual library of udf functions (e.g. sin, cos, nullif, etc)

I agree 2 is not something that belongs naturally in the arrow project and
is better aligned with query engines

However I think 1 is worth considering.

As I understand it, the problem arrow_udf solves is avoiding some of the
boilerplate  required to make vectorized udfs. So instead of writing a
special eval_gcd function like this

```
fn gcd(l: i64, r: i64) -> i64 {
  // do gcd calculation
}

// implement vectorized version
fn eval_gcd(left: &ArrayRef, right: &ArrayRef) -> ArrayRef {
   let left = left.as_primitive<Int64Type>();
   let right = right.as_primitive<Int64Type>();
   res = binary(left, right, |l, r| gcd(l, r));
   Arc::new(res)
}
```

The user simply annotates the scalar function and have the library code
gen the array version
```
#[function("gcd(int64, int64) -> int64", output = "eval_gcd")]
fn gcd(l: i64, r: i64) -> i64 {
  // do gcd calculation
}
```

We have a lot of boilerplate / non idea macro stuff in DataFusion that I
think this would help a lot.

Andrew


On Fri, Jun 28, 2024 at 3:08 PM Raphael Taylor-Davies
<r.taylordav...@googlemail.com.invalid> wrote:

I wonder if the DataFusion project might be a more natural home for this
functionality? UDFs are more of a query engine concept, whereas arrow-rs is
more focused on purely physical execution?

On 28 June 2024 19:41:39 BST, Runji Wang <wangrunji0...@163.com> wrote:

Hi Felipe,

Vectorization will be applied whenever possible. When all input and

output types of a function are primitive (int16, int32, int64, float32,
float64) and do not involve any Option or Result, the macro will
automatically generate code based on unary <
https://docs.rs/arrow/latest/arrow/compute/fn.unary.html> or binary <
https://docs.rs/arrow/latest/arrow/compute/fn.binary.html> kernels,
which potentially allows for vectorization.


Both examples you showed are not vectorized. The `div` function is due

to the Result output, while `gcd` is due to the loop in its implementation.
However, if the function is simple enough, like an `add` function:


#[function("add(int, int) -> int")]
fn add(a: i32, b: i32) -> i32 {
    a + b
}

It can be auto-vectorized by llvm.

Runji


On 2024/06/28 17:13:16 Felipe Oliveira Carvalho wrote:

On Fri, Jun 28, 2024 at 11:07 AM Andrew Lamb <al...@influxdata.com>

wrote:


Hi Xuanwo,

Sorry for the delay in responding. I think  the ability to easily

write

functions that "feel" like native functions in whatever language and

be

able to generate arrow / vectorized versions of them is quite

valuable.

This is my understanding of what this proposal is about.


My understanding is that it's not vectorized. From the examples in
risingwavelabs/arrow-udf, <https://github.com/risingwavelabs/arrow-udf>

it

looks like the macros generate code that gathers values from columns

into

local scalars that are passed as scalar parameters to user functions.

Is

the hope here that rustc/llvm will auto-vectorize the code?

#[function("gcd(int, int) -> int")]
fn gcd(mut a: i32, mut b: i32) -> i32 {
     while b != 0 {
         (a, b) = (b, a % b);
     }
     a
}

#[function("div(int, int) -> int")]
fn div(x: i32, y: i32) -> Result<i32, &'static str> {
     if y == 0 {
         return Err("division by zero");
     }
     Ok(x / y)
}

I left some additional comments on the markdown.

One thing that might be worth doing is articulate some other

potential

locations for where the code might go. One option, as I think you

propose,

is to make its own repository.  Another option could be to donate

the code

and put the various language bindings in the same repo as the arrow
language implementations (e.g arrow-rs, arrow for python, etc) which

would

likely make it easier to maintain and discover.

I am curious about what other devs / users feel about this?

Andrew



On Thu, Jun 20, 2024 at 3:04 AM Xuanwo <xu...@apache.org> wrote:

Hello, everyone.

I start this thread to disscuss the donation of a User-Defined

Function

Framework for Apache Arrow.

Feel free to review and leave your comments here. For live review,

please

visit:

https://hackmd.io/@xuanwo/apache-arrow-udf

The original content also pasted here for a quick reading:

------

## Abstract

Arrow UDF is a User-Defined Function Framework for Apache Arrow.

## Proposal

Arrow UDF allows user to easily create and run user-defined

functions

(UDF) in Rust, Python, Java or JavaScript based on Apache Arrow.

The

functions can be executed natively, or in WebAssembly, or in a

remote

server via Arrow Flight.

Arrow UDF was originally designed to be used by the RisingWave

project

but

is now being used by Databend and several database startups.

We believe that the Arrow UDF project will provide diversity value

to

the

entire Arrow community.

## Background

Arrow UDF is being developed by an open-source community from day

one

and

is owned by RisingWaveLabs. The project has been launched in

December

2023.


## Initial Goals

By transferring ownership of the project to the Apache Arrow,

Arrow UDF

expects to ensure its neutrality and further encourage and

facilitate

the

adoption of Arrow UDF by the community.

## Current Status

Contributors: 5

Users:

-   [RisingWave]: A Distributed SQL Database for Stream Processing.
-   [Databend]: An open-source cloud data warehouse that serves as

cost-effective alternative to Snowflake.

## Documentation

The document of Arrow UDF is hosted at
https://docs.rs/arrow-udf/latest/arrow_udf/.

## Initial Source

The project currently holds a GitHub repository and multiple

packages:


- https://github.com/risingwavelabs/arrow-udf

Rust:

- https://crates.io/arrow-udf/
- https://crates.io/arrow-udf-python/
- https://crates.io/arrow-udf-js/
- https://crates.io/arrow-udf-js-deno/
- https://crates.io/arrow-udf-wasm/

Python:

- https://pypi.org/project/arrow-udf/

Those packge will retain its name, while the repository will be

moved to

apache org.

## Required Resources

### Mailing Lists

We can reuse the existing mailing lists that arrow have.

### Git Repositories

From

- https://github.com/risingwavelabs/arrow-udf

To

- https://gitbox.apache.org/asf/repos/arrow-udf
- https://github.com/apache/arrow-udf

### Issue Tracking

The project would like to continue using GitHub Issues.

### Other Resources

The project has already chosen GitHub actions as continuous

integration

tools.

## Initial Committers

- Runji Wang wangrunji0...@163.com
- Giovanny Gutiérrez
- sundy-li sund...@apache.org
- Xuanwo xua...@apache.org
- Max Justus Spransy maxjus...@gmail.com

[RisingWave]: https://github.com/risingwavelabs/risingwave
[Databend]: https://github.com/datafuselabs/databend

Xuanwo

Re: [DISCUSS] Donation of a User-Defined Function Framework for Apache Arrow

Reply via email to