In some ways, the problem of a UDF framework is larger than Arrow. UDFs need to 
give the same results, and execute efficiently, regardless of the platform 
(e.g. Arrow), hosting language, and UDF language.

At SIGMOD there was a paper from TU Berlin that addresses this problem: "Query 
Compilation Without Regrets” by Phillipp Grulich and others[1]. I believe that 
they were motivated by the problem of writing efficient UDFs in Flink, but the 
approach would apply to Arrow.

UDFs are not the only problem. If we wish to support the built-in functions of 
the 5 or 6 most popular dialects of SQL (Postgres, MySQL, Oracle, …) the number 
of functions is in the thousands. Implementing those functions for just {Rust, 
C, Java} is a major software engineering effort.

Perhaps we should be looking for a framework that will solve the problems of 
not just Arrow but also DataFusion and Substrait.

Julian

[1] https://doi.org/10.1145/3654968  

> On Jun 28, 2024, at 10:13 AM, Felipe Oliveira Carvalho <felipe...@gmail.com> 
> wrote:
> 
> On Fri, Jun 28, 2024 at 11:07 AM Andrew Lamb <al...@influxdata.com> wrote:
>> 
>> Hi Xuanwo,
>> 
>> Sorry for the delay in responding. I think  the ability to easily write
>> functions that "feel" like native functions in whatever language and be
>> able to generate arrow / vectorized versions of them is quite valuable.
>> This is my understanding of what this proposal is about.
> 
> My understanding is that it's not vectorized. From the examples in
> risingwavelabs/arrow-udf, <https://github.com/risingwavelabs/arrow-udf> it
> looks like the macros generate code that gathers values from columns into
> local scalars that are passed as scalar parameters to user functions. Is
> the hope here that rustc/llvm will auto-vectorize the code?
> 
> #[function("gcd(int, int) -> int")]
> fn gcd(mut a: i32, mut b: i32) -> i32 {
>    while b != 0 {
>        (a, b) = (b, a % b);
>    }
>    a
> }
> 
> #[function("div(int, int) -> int")]
> fn div(x: i32, y: i32) -> Result<i32, &'static str> {
>    if y == 0 {
>        return Err("division by zero");
>    }
>    Ok(x / y)
> }
> 
>> I left some additional comments on the markdown.
>> 
>> One thing that might be worth doing is articulate some other potential
>> locations for where the code might go. One option, as I think you propose,
>> is to make its own repository.  Another option could be to donate the code
>> and put the various language bindings in the same repo as the arrow
>> language implementations (e.g arrow-rs, arrow for python, etc) which would
>> likely make it easier to maintain and discover.
>> 
>> I am curious about what other devs / users feel about this?
>> 
>> Andrew
>> 
>> 
>> 
>> On Thu, Jun 20, 2024 at 3:04 AM Xuanwo <xua...@apache.org> wrote:
>> 
>>> Hello, everyone.
>>> 
>>> I start this thread to disscuss the donation of a User-Defined Function
>>> Framework for Apache Arrow.
>>> 
>>> Feel free to review and leave your comments here. For live review,
> please
>>> visit:
>>> 
>>> https://hackmd.io/@xuanwo/apache-arrow-udf
>>> 
>>> The original content also pasted here for a quick reading:
>>> 
>>> ------
>>> 
>>> ## Abstract
>>> 
>>> Arrow UDF is a User-Defined Function Framework for Apache Arrow.
>>> 
>>> ## Proposal
>>> 
>>> Arrow UDF allows user to easily create and run user-defined functions
>>> (UDF) in Rust, Python, Java or JavaScript based on Apache Arrow. The
>>> functions can be executed natively, or in WebAssembly, or in a remote
>>> server via Arrow Flight.
>>> 
>>> Arrow UDF was originally designed to be used by the RisingWave project
> but
>>> is now being used by Databend and several database startups.
>>> 
>>> We believe that the Arrow UDF project will provide diversity value to
> the
>>> entire Arrow community.
>>> 
>>> ## Background
>>> 
>>> Arrow UDF is being developed by an open-source community from day one
> and
>>> is owned by RisingWaveLabs. The project has been launched in December
> 2023.
>>> 
>>> ## Initial Goals
>>> 
>>> By transferring ownership of the project to the Apache Arrow, Arrow UDF
>>> expects to ensure its neutrality and further encourage and facilitate
> the
>>> adoption of Arrow UDF by the community.
>>> 
>>> ## Current Status
>>> 
>>> Contributors: 5
>>> 
>>> Users:
>>> 
>>> -   [RisingWave]: A Distributed SQL Database for Stream Processing.
>>> -   [Databend]: An open-source cloud data warehouse that serves as a
>>> cost-effective alternative to Snowflake.
>>> 
>>> ## Documentation
>>> 
>>> The document of Arrow UDF is hosted at
>>> https://docs.rs/arrow-udf/latest/arrow_udf/.
>>> 
>>> ## Initial Source
>>> 
>>> The project currently holds a GitHub repository and multiple packages:
>>> 
>>> - https://github.com/risingwavelabs/arrow-udf
>>> 
>>> Rust:
>>> 
>>> - https://crates.io/arrow-udf/
>>> - https://crates.io/arrow-udf-python/
>>> - https://crates.io/arrow-udf-js/
>>> - https://crates.io/arrow-udf-js-deno/
>>> - https://crates.io/arrow-udf-wasm/
>>> 
>>> Python:
>>> 
>>> - https://pypi.org/project/arrow-udf/
>>> 
>>> Those packge will retain its name, while the repository will be moved to
>>> apache org.
>>> 
>>> ## Required Resources
>>> 
>>> ### Mailing Lists
>>> 
>>> We can reuse the existing mailing lists that arrow have.
>>> 
>>> ### Git Repositories
>>> 
>>> From
>>> 
>>> - https://github.com/risingwavelabs/arrow-udf
>>> 
>>> To
>>> 
>>> - https://gitbox.apache.org/asf/repos/arrow-udf
>>> - https://github.com/apache/arrow-udf
>>> 
>>> ### Issue Tracking
>>> 
>>> The project would like to continue using GitHub Issues.
>>> 
>>> ### Other Resources
>>> 
>>> The project has already chosen GitHub actions as continuous integration
>>> tools.
>>> 
>>> ## Initial Committers
>>> 
>>> - Runji Wang wangrunji0...@163.com
>>> - Giovanny Gutiérrez
>>> - sundy-li sund...@apache.org
>>> - Xuanwo xua...@apache.org
>>> - Max Justus Spransy maxjus...@gmail.com
>>> 
>>> [RisingWave]: https://github.com/risingwavelabs/risingwave
>>> [Databend]: https://github.com/datafuselabs/databend
>>> 
>>> Xuanwo
>>> 

Reply via email to