An update here is that one of the DataFusion contributors, @xinlifoobar, did a very neat prototype of using arrow-udf in DataFusion[1] and wrote up their findings[2]
The major findings are that it would be possible, though it would take some additional work (e.g. single values, making the function registry optional, and adding datatype support) Andrew [1]: https://github.com/apache/datafusion/pull/11488 [2]: https://github.com/apache/datafusion/issues/11413#issuecomment-2230502588 On Thu, Jul 4, 2024 at 3:35 PM Felipe Oliveira Carvalho <felipe...@gmail.com> wrote: > Hi Andrew, > > During the Arrow Community Meeting I asked Xuanwo many questions trying to > clarify my understanding of what they mean by "UDF". > > To me and you it seems to mean "user defined compute kernels", but in the > context of these libraries it's *also that* plus the ability to call these > functions by making calls to Flight services. The ultimate goal being > federated querying (my words, not them) from databases that can talk to > Flight to run these UDFs running on them. > > IMO it's too big of a problem to solve in the context of Arrow, but I > suggested that they modularized it more (i.e. separate the kernel-writing > parts from the parts that make the Flight wiring) and try to incorporate > the kernel-writing parts (like the Rust macros) into the arrow-LANG > implementation so that their framework for writing Flight services for UDFs > could be much smaller and specific to the strategy for connecting querying > engines they see fit. > > Others suggested work on standardizing the protocols used to call these > UDFs, but IMO that is a huge undertaking. There are simply too many aspects > to calling UDFs across different services. Standardizing the practices > around this will take a lot of experimenting. > > -- > Felipe > > On Wed, Jul 3, 2024 at 3:43 PM Andrew Lamb <al...@influxdata.com> wrote: > > > What does everyone think about renaming this library to something like > > `arrow-auto-vectorizer` or `arrow-functions` to emphasize its role with > > codegen of vectorized implementations? > > > > In discussing this proposal internally, it took a while to explain what > the > > usecase of the library is > > > > From my understanding, the use case is "Automatically generate vectorized > > kernels from scalar functions". > > > > While this can be used for User Defined Functions (UDFs), there are many > > other uses too (like "built in functions" in processing engines) > > > > Andrew > > > > > > On Mon, Jul 1, 2024 at 8:10 AM Xuanwo <xua...@apache.org> wrote: > > > > > I have cross-posted the proposal to datafusion community to collect > more > > > feedback: > > > > > > https://github.com/apache/datafusion/discussions/11192 > > > > > > On Mon, Jul 1, 2024, at 19:31, Andrew Lamb wrote: > > > > I have been thinking about this project more, and the more I think > > about > > > it > > > > the more I like it. > > > > > > > > For example of the kind of leverage a library like this might bring, > we > > > > might consider changing the implementation of Arrow UDF to re-use the > > > > underlying buffers when possible (e.g. via unary_mut[1]). This would > > > likely > > > > provide an across the board efficiency improvement for no costs to > > > > downstream crates. > > > > > > > > Andrew > > > > > > > > [1]: > > > > > > > > > > https://docs.rs/arrow/latest/arrow/array/struct.PrimitiveArray.html#method.unary_mut > > > > > > > > On Sat, Jun 29, 2024 at 1:47 AM Xuanwo <xua...@apache.org> wrote: > > > > > > > >> > That said, wherever it ends up, there should be the agreement of > > > >> > individuals to accept maintenance of it. Since it's in rust, that > > > would > > > >> > generally fall to the arrow-rs contributors and/or the DataFusion > > > >> > contributors IMO. > > > >> > > > > >> > It would be good for it to be part of the community, but only if > > it's > > > not > > > >> > going to end up just bitrotting somewhere. > > > >> > > > >> Thanks Matt. This concern does make sense. > > > >> > > > >> Arrow UDF is extensively used within RisingWave and Databend. We, > the > > > >> initial > > > >> committers from both RisingWave and Databend, are eager to take > > > >> responsibility > > > >> for maintaining these crates. > > > >> > > > >> Additionally, some of us are involved in other Apache Projects, so > we > > > >> understand > > > >> how the Apache Way functions. We will focus on community growth to > > > ensure > > > >> this > > > >> project remains active. > > > >> > > > >> On Sat, Jun 29, 2024, at 13:29, Matt Topol wrote: > > > >> >> This UDF implementation doesn’t depend on DataFusion. It can work > > > with > > > >> > any data in the arrow format. > > > >> > > > > >> > Given this I'm in agreement with Antoine that it would be weird > for > > > it to > > > >> > be maintained within the DataFusion repo as opposed to it's own > repo > > > (as > > > >> > we've done in the past for things like nanoarrow and > > > arrow-experiments). > > > >> > > > > >> > That said, wherever it ends up, there should be the agreement of > > > >> > individuals to accept maintenance of it. Since it's in rust, that > > > would > > > >> > generally fall to the arrow-rs contributors and/or the DataFusion > > > >> > contributors IMO. > > > >> > > > > >> > It would be good for it to be part of the community, but only if > > it's > > > not > > > >> > going to end up just bitrotting somewhere. > > > >> > > > > >> > --Matt > > > >> > > > > >> > On Fri, Jun 28, 2024, 8:49 PM Xuanwo <xua...@apache.org> wrote: > > > >> > > > > >> >> Hi, > > > >> >> > > > >> >> This UDF implementation doesn’t depend on DataFusion. It can work > > > with > > > >> any > > > >> >> data in the arrow format. > > > >> >> > > > >> >> It has the potential power to make users write ONE UDF function > > that > > > >> works > > > >> >> for different query engines as we showed up in databend and > > > risingwave. > > > >> >> > > > >> >> So I personally think it should be part of arrow community. > > > >> >> > > > >> >> On Sat, Jun 29, 2024, at 05:06, Antoine Pitrou wrote: > > > >> >> > Is this UDF implementation based on DataFusion? If so, it makes > > > sense > > > >> >> > for it to be part of the DataFusion project. > > > >> >> > > > > >> >> > OTOH, if it can work with any data in the Arrow format, then it > > > would > > > >> >> > sound weird to maintain it in the DataFusion repo IMHO. > > > >> >> > > > > >> >> > Regards > > > >> >> > > > > >> >> > Antoine. > > > >> >> > > > > >> >> > > > > >> >> > Le 28/06/2024 à 21:52, Andrew Lamb a écrit : > > > >> >> >> To be clear, if the arrow community thinks this would be > better > > > >> >> organized / > > > >> >> >> administered in the Apache DataFusion project (especially if > it > > is > > > >> >> aligned > > > >> >> >> with Rust) I think it would be good to discuss donating there > > > >> >> >> > > > >> >> >> On Fri, Jun 28, 2024 at 3:17 PM Andrew Lamb < > > al...@influxdata.com > > > > > > > >> >> wrote: > > > >> >> >> > > > >> >> >>> I think there are two aspects: > > > >> >> >>> 1. The actual mechanics of implementing functions > > > >> >> >>> 2. The actual library of udf functions (e.g. sin, cos, > nullif, > > > etc) > > > >> >> >>> > > > >> >> >>> I agree 2 is not something that belongs naturally in the > arrow > > > >> project > > > >> >> and > > > >> >> >>> is better aligned with query engines > > > >> >> >>> > > > >> >> >>> However I think 1 is worth considering. > > > >> >> >>> > > > >> >> >>> As I understand it, the problem arrow_udf solves is avoiding > > > some of > > > >> >> the > > > >> >> >>> boilerplate required to make vectorized udfs. So instead of > > > >> writing a > > > >> >> >>> special eval_gcd function like this > > > >> >> >>> > > > >> >> >>> ``` > > > >> >> >>> fn gcd(l: i64, r: i64) -> i64 { > > > >> >> >>> // do gcd calculation > > > >> >> >>> } > > > >> >> >>> > > > >> >> >>> // implement vectorized version > > > >> >> >>> fn eval_gcd(left: &ArrayRef, right: &ArrayRef) -> ArrayRef { > > > >> >> >>> let left = left.as_primitive<Int64Type>(); > > > >> >> >>> let right = right.as_primitive<Int64Type>(); > > > >> >> >>> res = binary(left, right, |l, r| gcd(l, r)); > > > >> >> >>> Arc::new(res) > > > >> >> >>> } > > > >> >> >>> ``` > > > >> >> >>> > > > >> >> >>> The user simply annotates the scalar function and have the > > > library > > > >> code > > > >> >> >>> gen the array version > > > >> >> >>> ``` > > > >> >> >>> #[function("gcd(int64, int64) -> int64", output = > "eval_gcd")] > > > >> >> >>> fn gcd(l: i64, r: i64) -> i64 { > > > >> >> >>> // do gcd calculation > > > >> >> >>> } > > > >> >> >>> ``` > > > >> >> >>> > > > >> >> >>> We have a lot of boilerplate / non idea macro stuff in > > DataFusion > > > >> that > > > >> >> I > > > >> >> >>> think this would help a lot. > > > >> >> >>> > > > >> >> >>> Andrew > > > >> >> >>> > > > >> >> >>> > > > >> >> >>> On Fri, Jun 28, 2024 at 3:08 PM Raphael Taylor-Davies > > > >> >> >>> <r.taylordav...@googlemail.com.invalid> wrote: > > > >> >> >>> > > > >> >> >>>> I wonder if the DataFusion project might be a more natural > > home > > > for > > > >> >> this > > > >> >> >>>> functionality? UDFs are more of a query engine concept, > > whereas > > > >> >> arrow-rs is > > > >> >> >>>> more focused on purely physical execution? > > > >> >> >>>> > > > >> >> >>>> On 28 June 2024 19:41:39 BST, Runji Wang < > > wangrunji0...@163.com > > > > > > > >> >> wrote: > > > >> >> >>>>> Hi Felipe, > > > >> >> >>>>> > > > >> >> >>>>> Vectorization will be applied whenever possible. When all > > input > > > >> and > > > >> >> >>>> output types of a function are primitive (int16, int32, > int64, > > > >> >> float32, > > > >> >> >>>> float64) and do not involve any Option or Result, the macro > > will > > > >> >> >>>> automatically generate code based on unary < > > > >> >> >>>> https://docs.rs/arrow/latest/arrow/compute/fn.unary.html> > or > > > >> binary < > > > >> >> >>>> https://docs.rs/arrow/latest/arrow/compute/fn.binary.html> > > > >> kernels, > > > >> >> >>>> which potentially allows for vectorization. > > > >> >> >>>>> > > > >> >> >>>>> Both examples you showed are not vectorized. The `div` > > > function is > > > >> >> due > > > >> >> >>>> to the Result output, while `gcd` is due to the loop in its > > > >> >> implementation. > > > >> >> >>>> However, if the function is simple enough, like an `add` > > > function: > > > >> >> >>>>> > > > >> >> >>>>> #[function("add(int, int) -> int")] > > > >> >> >>>>> fn add(a: i32, b: i32) -> i32 { > > > >> >> >>>>> a + b > > > >> >> >>>>> } > > > >> >> >>>>> > > > >> >> >>>>> It can be auto-vectorized by llvm. > > > >> >> >>>>> > > > >> >> >>>>> Runji > > > >> >> >>>>> > > > >> >> >>>>> > > > >> >> >>>>> On 2024/06/28 17:13:16 Felipe Oliveira Carvalho wrote: > > > >> >> >>>>>> On Fri, Jun 28, 2024 at 11:07 AM Andrew Lamb < > > > >> al...@influxdata.com> > > > >> >> >>>> wrote: > > > >> >> >>>>>>> > > > >> >> >>>>>>> Hi Xuanwo, > > > >> >> >>>>>>> > > > >> >> >>>>>>> Sorry for the delay in responding. I think the ability > to > > > >> easily > > > >> >> >>>> write > > > >> >> >>>>>>> functions that "feel" like native functions in whatever > > > language > > > >> >> and > > > >> >> >>>> be > > > >> >> >>>>>>> able to generate arrow / vectorized versions of them is > > quite > > > >> >> >>>> valuable. > > > >> >> >>>>>>> This is my understanding of what this proposal is about. > > > >> >> >>>>>> > > > >> >> >>>>>> My understanding is that it's not vectorized. From the > > > examples > > > >> in > > > >> >> >>>>>> risingwavelabs/arrow-udf, < > > > >> >> https://github.com/risingwavelabs/arrow-udf> > > > >> >> >>>> it > > > >> >> >>>>>> looks like the macros generate code that gathers values > from > > > >> columns > > > >> >> >>>> into > > > >> >> >>>>>> local scalars that are passed as scalar parameters to user > > > >> >> functions. > > > >> >> >>>> Is > > > >> >> >>>>>> the hope here that rustc/llvm will auto-vectorize the > code? > > > >> >> >>>>>> > > > >> >> >>>>>> #[function("gcd(int, int) -> int")] > > > >> >> >>>>>> fn gcd(mut a: i32, mut b: i32) -> i32 { > > > >> >> >>>>>> while b != 0 { > > > >> >> >>>>>> (a, b) = (b, a % b); > > > >> >> >>>>>> } > > > >> >> >>>>>> a > > > >> >> >>>>>> } > > > >> >> >>>>>> > > > >> >> >>>>>> #[function("div(int, int) -> int")] > > > >> >> >>>>>> fn div(x: i32, y: i32) -> Result<i32, &'static str> { > > > >> >> >>>>>> if y == 0 { > > > >> >> >>>>>> return Err("division by zero"); > > > >> >> >>>>>> } > > > >> >> >>>>>> Ok(x / y) > > > >> >> >>>>>> } > > > >> >> >>>>>> > > > >> >> >>>>>>> I left some additional comments on the markdown. > > > >> >> >>>>>>> > > > >> >> >>>>>>> One thing that might be worth doing is articulate some > > other > > > >> >> >>>> potential > > > >> >> >>>>>>> locations for where the code might go. One option, as I > > think > > > >> you > > > >> >> >>>> propose, > > > >> >> >>>>>>> is to make its own repository. Another option could be > to > > > >> donate > > > >> >> >>>> the code > > > >> >> >>>>>>> and put the various language bindings in the same repo as > > the > > > >> arrow > > > >> >> >>>>>>> language implementations (e.g arrow-rs, arrow for python, > > > etc) > > > >> >> which > > > >> >> >>>> would > > > >> >> >>>>>>> likely make it easier to maintain and discover. > > > >> >> >>>>>>> > > > >> >> >>>>>>> I am curious about what other devs / users feel about > this? > > > >> >> >>>>>>> > > > >> >> >>>>>>> Andrew > > > >> >> >>>>>>> > > > >> >> >>>>>>> > > > >> >> >>>>>>> > > > >> >> >>>>>>> On Thu, Jun 20, 2024 at 3:04 AM Xuanwo <xu...@apache.org > > > > > >> wrote: > > > >> >> >>>>>>> > > > >> >> >>>>>>>> Hello, everyone. > > > >> >> >>>>>>>> > > > >> >> >>>>>>>> I start this thread to disscuss the donation of a > > > User-Defined > > > >> >> >>>> Function > > > >> >> >>>>>>>> Framework for Apache Arrow. > > > >> >> >>>>>>>> > > > >> >> >>>>>>>> Feel free to review and leave your comments here. For > live > > > >> review, > > > >> >> >>>>>> please > > > >> >> >>>>>>>> visit: > > > >> >> >>>>>>>> > > > >> >> >>>>>>>> https://hackmd.io/@xuanwo/apache-arrow-udf > > > >> >> >>>>>>>> > > > >> >> >>>>>>>> The original content also pasted here for a quick > reading: > > > >> >> >>>>>>>> > > > >> >> >>>>>>>> ------ > > > >> >> >>>>>>>> > > > >> >> >>>>>>>> ## Abstract > > > >> >> >>>>>>>> > > > >> >> >>>>>>>> Arrow UDF is a User-Defined Function Framework for > Apache > > > >> Arrow. > > > >> >> >>>>>>>> > > > >> >> >>>>>>>> ## Proposal > > > >> >> >>>>>>>> > > > >> >> >>>>>>>> Arrow UDF allows user to easily create and run > > user-defined > > > >> >> >>>> functions > > > >> >> >>>>>>>> (UDF) in Rust, Python, Java or JavaScript based on > Apache > > > >> Arrow. > > > >> >> >>>> The > > > >> >> >>>>>>>> functions can be executed natively, or in WebAssembly, > or > > > in a > > > >> >> >>>> remote > > > >> >> >>>>>>>> server via Arrow Flight. > > > >> >> >>>>>>>> > > > >> >> >>>>>>>> Arrow UDF was originally designed to be used by the > > > RisingWave > > > >> >> >>>> project > > > >> >> >>>>>> but > > > >> >> >>>>>>>> is now being used by Databend and several database > > startups. > > > >> >> >>>>>>>> > > > >> >> >>>>>>>> We believe that the Arrow UDF project will provide > > diversity > > > >> value > > > >> >> >>>> to > > > >> >> >>>>>> the > > > >> >> >>>>>>>> entire Arrow community. > > > >> >> >>>>>>>> > > > >> >> >>>>>>>> ## Background > > > >> >> >>>>>>>> > > > >> >> >>>>>>>> Arrow UDF is being developed by an open-source community > > > from > > > >> day > > > >> >> >>>> one > > > >> >> >>>>>> and > > > >> >> >>>>>>>> is owned by RisingWaveLabs. The project has been > launched > > in > > > >> >> >>>> December > > > >> >> >>>>>> 2023. > > > >> >> >>>>>>>> > > > >> >> >>>>>>>> ## Initial Goals > > > >> >> >>>>>>>> > > > >> >> >>>>>>>> By transferring ownership of the project to the Apache > > > Arrow, > > > >> >> >>>> Arrow UDF > > > >> >> >>>>>>>> expects to ensure its neutrality and further encourage > and > > > >> >> >>>> facilitate > > > >> >> >>>>>> the > > > >> >> >>>>>>>> adoption of Arrow UDF by the community. > > > >> >> >>>>>>>> > > > >> >> >>>>>>>> ## Current Status > > > >> >> >>>>>>>> > > > >> >> >>>>>>>> Contributors: 5 > > > >> >> >>>>>>>> > > > >> >> >>>>>>>> Users: > > > >> >> >>>>>>>> > > > >> >> >>>>>>>> - [RisingWave]: A Distributed SQL Database for Stream > > > >> >> Processing. > > > >> >> >>>>>>>> - [Databend]: An open-source cloud data warehouse that > > > >> serves as > > > >> >> >>>> a > > > >> >> >>>>>>>> cost-effective alternative to Snowflake. > > > >> >> >>>>>>>> > > > >> >> >>>>>>>> ## Documentation > > > >> >> >>>>>>>> > > > >> >> >>>>>>>> The document of Arrow UDF is hosted at > > > >> >> >>>>>>>> https://docs.rs/arrow-udf/latest/arrow_udf/. > > > >> >> >>>>>>>> > > > >> >> >>>>>>>> ## Initial Source > > > >> >> >>>>>>>> > > > >> >> >>>>>>>> The project currently holds a GitHub repository and > > multiple > > > >> >> >>>> packages: > > > >> >> >>>>>>>> > > > >> >> >>>>>>>> - https://github.com/risingwavelabs/arrow-udf > > > >> >> >>>>>>>> > > > >> >> >>>>>>>> Rust: > > > >> >> >>>>>>>> > > > >> >> >>>>>>>> - https://crates.io/arrow-udf/ > > > >> >> >>>>>>>> - https://crates.io/arrow-udf-python/ > > > >> >> >>>>>>>> - https://crates.io/arrow-udf-js/ > > > >> >> >>>>>>>> - https://crates.io/arrow-udf-js-deno/ > > > >> >> >>>>>>>> - https://crates.io/arrow-udf-wasm/ > > > >> >> >>>>>>>> > > > >> >> >>>>>>>> Python: > > > >> >> >>>>>>>> > > > >> >> >>>>>>>> - https://pypi.org/project/arrow-udf/ > > > >> >> >>>>>>>> > > > >> >> >>>>>>>> Those packge will retain its name, while the repository > > > will be > > > >> >> >>>> moved to > > > >> >> >>>>>>>> apache org. > > > >> >> >>>>>>>> > > > >> >> >>>>>>>> ## Required Resources > > > >> >> >>>>>>>> > > > >> >> >>>>>>>> ### Mailing Lists > > > >> >> >>>>>>>> > > > >> >> >>>>>>>> We can reuse the existing mailing lists that arrow have. > > > >> >> >>>>>>>> > > > >> >> >>>>>>>> ### Git Repositories > > > >> >> >>>>>>>> > > > >> >> >>>>>>>> From > > > >> >> >>>>>>>> > > > >> >> >>>>>>>> - https://github.com/risingwavelabs/arrow-udf > > > >> >> >>>>>>>> > > > >> >> >>>>>>>> To > > > >> >> >>>>>>>> > > > >> >> >>>>>>>> - https://gitbox.apache.org/asf/repos/arrow-udf > > > >> >> >>>>>>>> - https://github.com/apache/arrow-udf > > > >> >> >>>>>>>> > > > >> >> >>>>>>>> ### Issue Tracking > > > >> >> >>>>>>>> > > > >> >> >>>>>>>> The project would like to continue using GitHub Issues. > > > >> >> >>>>>>>> > > > >> >> >>>>>>>> ### Other Resources > > > >> >> >>>>>>>> > > > >> >> >>>>>>>> The project has already chosen GitHub actions as > > continuous > > > >> >> >>>> integration > > > >> >> >>>>>>>> tools. > > > >> >> >>>>>>>> > > > >> >> >>>>>>>> ## Initial Committers > > > >> >> >>>>>>>> > > > >> >> >>>>>>>> - Runji Wang wangrunji0...@163.com > > > >> >> >>>>>>>> - Giovanny Gutiérrez > > > >> >> >>>>>>>> - sundy-li sund...@apache.org > > > >> >> >>>>>>>> - Xuanwo xua...@apache.org > > > >> >> >>>>>>>> - Max Justus Spransy maxjus...@gmail.com > > > >> >> >>>>>>>> > > > >> >> >>>>>>>> [RisingWave]: > > https://github.com/risingwavelabs/risingwave > > > >> >> >>>>>>>> [Databend]: https://github.com/datafuselabs/databend > > > >> >> >>>>>>>> > > > >> >> >>>>>>>> Xuanwo > > > >> >> >>>>>>>> > > > >> >> >>>>>> > > > >> >> >>> > > > >> >> >>> > > > >> >> >> > > > >> >> > > > >> >> -- > > > >> >> Xuanwo > > > >> >> > > > >> > > > >> -- > > > >> Xuanwo > > > >> > > > > > > -- > > > Xuanwo > > > > > > https://xuanwo.io/ > > > > > >