What does everyone think about renaming this library to something like `arrow-auto-vectorizer` or `arrow-functions` to emphasize its role with codegen of vectorized implementations?
In discussing this proposal internally, it took a while to explain what the usecase of the library is >From my understanding, the use case is "Automatically generate vectorized kernels from scalar functions". While this can be used for User Defined Functions (UDFs), there are many other uses too (like "built in functions" in processing engines) Andrew On Mon, Jul 1, 2024 at 8:10 AM Xuanwo <xua...@apache.org> wrote: > I have cross-posted the proposal to datafusion community to collect more > feedback: > > https://github.com/apache/datafusion/discussions/11192 > > On Mon, Jul 1, 2024, at 19:31, Andrew Lamb wrote: > > I have been thinking about this project more, and the more I think about > it > > the more I like it. > > > > For example of the kind of leverage a library like this might bring, we > > might consider changing the implementation of Arrow UDF to re-use the > > underlying buffers when possible (e.g. via unary_mut[1]). This would > likely > > provide an across the board efficiency improvement for no costs to > > downstream crates. > > > > Andrew > > > > [1]: > > > https://docs.rs/arrow/latest/arrow/array/struct.PrimitiveArray.html#method.unary_mut > > > > On Sat, Jun 29, 2024 at 1:47 AM Xuanwo <xua...@apache.org> wrote: > > > >> > That said, wherever it ends up, there should be the agreement of > >> > individuals to accept maintenance of it. Since it's in rust, that > would > >> > generally fall to the arrow-rs contributors and/or the DataFusion > >> > contributors IMO. > >> > > >> > It would be good for it to be part of the community, but only if it's > not > >> > going to end up just bitrotting somewhere. > >> > >> Thanks Matt. This concern does make sense. > >> > >> Arrow UDF is extensively used within RisingWave and Databend. We, the > >> initial > >> committers from both RisingWave and Databend, are eager to take > >> responsibility > >> for maintaining these crates. > >> > >> Additionally, some of us are involved in other Apache Projects, so we > >> understand > >> how the Apache Way functions. We will focus on community growth to > ensure > >> this > >> project remains active. > >> > >> On Sat, Jun 29, 2024, at 13:29, Matt Topol wrote: > >> >> This UDF implementation doesn’t depend on DataFusion. It can work > with > >> > any data in the arrow format. > >> > > >> > Given this I'm in agreement with Antoine that it would be weird for > it to > >> > be maintained within the DataFusion repo as opposed to it's own repo > (as > >> > we've done in the past for things like nanoarrow and > arrow-experiments). > >> > > >> > That said, wherever it ends up, there should be the agreement of > >> > individuals to accept maintenance of it. Since it's in rust, that > would > >> > generally fall to the arrow-rs contributors and/or the DataFusion > >> > contributors IMO. > >> > > >> > It would be good for it to be part of the community, but only if it's > not > >> > going to end up just bitrotting somewhere. > >> > > >> > --Matt > >> > > >> > On Fri, Jun 28, 2024, 8:49 PM Xuanwo <xua...@apache.org> wrote: > >> > > >> >> Hi, > >> >> > >> >> This UDF implementation doesn’t depend on DataFusion. It can work > with > >> any > >> >> data in the arrow format. > >> >> > >> >> It has the potential power to make users write ONE UDF function that > >> works > >> >> for different query engines as we showed up in databend and > risingwave. > >> >> > >> >> So I personally think it should be part of arrow community. > >> >> > >> >> On Sat, Jun 29, 2024, at 05:06, Antoine Pitrou wrote: > >> >> > Is this UDF implementation based on DataFusion? If so, it makes > sense > >> >> > for it to be part of the DataFusion project. > >> >> > > >> >> > OTOH, if it can work with any data in the Arrow format, then it > would > >> >> > sound weird to maintain it in the DataFusion repo IMHO. > >> >> > > >> >> > Regards > >> >> > > >> >> > Antoine. > >> >> > > >> >> > > >> >> > Le 28/06/2024 à 21:52, Andrew Lamb a écrit : > >> >> >> To be clear, if the arrow community thinks this would be better > >> >> organized / > >> >> >> administered in the Apache DataFusion project (especially if it is > >> >> aligned > >> >> >> with Rust) I think it would be good to discuss donating there > >> >> >> > >> >> >> On Fri, Jun 28, 2024 at 3:17 PM Andrew Lamb <al...@influxdata.com > > > >> >> wrote: > >> >> >> > >> >> >>> I think there are two aspects: > >> >> >>> 1. The actual mechanics of implementing functions > >> >> >>> 2. The actual library of udf functions (e.g. sin, cos, nullif, > etc) > >> >> >>> > >> >> >>> I agree 2 is not something that belongs naturally in the arrow > >> project > >> >> and > >> >> >>> is better aligned with query engines > >> >> >>> > >> >> >>> However I think 1 is worth considering. > >> >> >>> > >> >> >>> As I understand it, the problem arrow_udf solves is avoiding > some of > >> >> the > >> >> >>> boilerplate required to make vectorized udfs. So instead of > >> writing a > >> >> >>> special eval_gcd function like this > >> >> >>> > >> >> >>> ``` > >> >> >>> fn gcd(l: i64, r: i64) -> i64 { > >> >> >>> // do gcd calculation > >> >> >>> } > >> >> >>> > >> >> >>> // implement vectorized version > >> >> >>> fn eval_gcd(left: &ArrayRef, right: &ArrayRef) -> ArrayRef { > >> >> >>> let left = left.as_primitive<Int64Type>(); > >> >> >>> let right = right.as_primitive<Int64Type>(); > >> >> >>> res = binary(left, right, |l, r| gcd(l, r)); > >> >> >>> Arc::new(res) > >> >> >>> } > >> >> >>> ``` > >> >> >>> > >> >> >>> The user simply annotates the scalar function and have the > library > >> code > >> >> >>> gen the array version > >> >> >>> ``` > >> >> >>> #[function("gcd(int64, int64) -> int64", output = "eval_gcd")] > >> >> >>> fn gcd(l: i64, r: i64) -> i64 { > >> >> >>> // do gcd calculation > >> >> >>> } > >> >> >>> ``` > >> >> >>> > >> >> >>> We have a lot of boilerplate / non idea macro stuff in DataFusion > >> that > >> >> I > >> >> >>> think this would help a lot. > >> >> >>> > >> >> >>> Andrew > >> >> >>> > >> >> >>> > >> >> >>> On Fri, Jun 28, 2024 at 3:08 PM Raphael Taylor-Davies > >> >> >>> <r.taylordav...@googlemail.com.invalid> wrote: > >> >> >>> > >> >> >>>> I wonder if the DataFusion project might be a more natural home > for > >> >> this > >> >> >>>> functionality? UDFs are more of a query engine concept, whereas > >> >> arrow-rs is > >> >> >>>> more focused on purely physical execution? > >> >> >>>> > >> >> >>>> On 28 June 2024 19:41:39 BST, Runji Wang <wangrunji0...@163.com > > > >> >> wrote: > >> >> >>>>> Hi Felipe, > >> >> >>>>> > >> >> >>>>> Vectorization will be applied whenever possible. When all input > >> and > >> >> >>>> output types of a function are primitive (int16, int32, int64, > >> >> float32, > >> >> >>>> float64) and do not involve any Option or Result, the macro will > >> >> >>>> automatically generate code based on unary < > >> >> >>>> https://docs.rs/arrow/latest/arrow/compute/fn.unary.html> or > >> binary < > >> >> >>>> https://docs.rs/arrow/latest/arrow/compute/fn.binary.html> > >> kernels, > >> >> >>>> which potentially allows for vectorization. > >> >> >>>>> > >> >> >>>>> Both examples you showed are not vectorized. The `div` > function is > >> >> due > >> >> >>>> to the Result output, while `gcd` is due to the loop in its > >> >> implementation. > >> >> >>>> However, if the function is simple enough, like an `add` > function: > >> >> >>>>> > >> >> >>>>> #[function("add(int, int) -> int")] > >> >> >>>>> fn add(a: i32, b: i32) -> i32 { > >> >> >>>>> a + b > >> >> >>>>> } > >> >> >>>>> > >> >> >>>>> It can be auto-vectorized by llvm. > >> >> >>>>> > >> >> >>>>> Runji > >> >> >>>>> > >> >> >>>>> > >> >> >>>>> On 2024/06/28 17:13:16 Felipe Oliveira Carvalho wrote: > >> >> >>>>>> On Fri, Jun 28, 2024 at 11:07 AM Andrew Lamb < > >> al...@influxdata.com> > >> >> >>>> wrote: > >> >> >>>>>>> > >> >> >>>>>>> Hi Xuanwo, > >> >> >>>>>>> > >> >> >>>>>>> Sorry for the delay in responding. I think the ability to > >> easily > >> >> >>>> write > >> >> >>>>>>> functions that "feel" like native functions in whatever > language > >> >> and > >> >> >>>> be > >> >> >>>>>>> able to generate arrow / vectorized versions of them is quite > >> >> >>>> valuable. > >> >> >>>>>>> This is my understanding of what this proposal is about. > >> >> >>>>>> > >> >> >>>>>> My understanding is that it's not vectorized. From the > examples > >> in > >> >> >>>>>> risingwavelabs/arrow-udf, < > >> >> https://github.com/risingwavelabs/arrow-udf> > >> >> >>>> it > >> >> >>>>>> looks like the macros generate code that gathers values from > >> columns > >> >> >>>> into > >> >> >>>>>> local scalars that are passed as scalar parameters to user > >> >> functions. > >> >> >>>> Is > >> >> >>>>>> the hope here that rustc/llvm will auto-vectorize the code? > >> >> >>>>>> > >> >> >>>>>> #[function("gcd(int, int) -> int")] > >> >> >>>>>> fn gcd(mut a: i32, mut b: i32) -> i32 { > >> >> >>>>>> while b != 0 { > >> >> >>>>>> (a, b) = (b, a % b); > >> >> >>>>>> } > >> >> >>>>>> a > >> >> >>>>>> } > >> >> >>>>>> > >> >> >>>>>> #[function("div(int, int) -> int")] > >> >> >>>>>> fn div(x: i32, y: i32) -> Result<i32, &'static str> { > >> >> >>>>>> if y == 0 { > >> >> >>>>>> return Err("division by zero"); > >> >> >>>>>> } > >> >> >>>>>> Ok(x / y) > >> >> >>>>>> } > >> >> >>>>>> > >> >> >>>>>>> I left some additional comments on the markdown. > >> >> >>>>>>> > >> >> >>>>>>> One thing that might be worth doing is articulate some other > >> >> >>>> potential > >> >> >>>>>>> locations for where the code might go. One option, as I think > >> you > >> >> >>>> propose, > >> >> >>>>>>> is to make its own repository. Another option could be to > >> donate > >> >> >>>> the code > >> >> >>>>>>> and put the various language bindings in the same repo as the > >> arrow > >> >> >>>>>>> language implementations (e.g arrow-rs, arrow for python, > etc) > >> >> which > >> >> >>>> would > >> >> >>>>>>> likely make it easier to maintain and discover. > >> >> >>>>>>> > >> >> >>>>>>> I am curious about what other devs / users feel about this? > >> >> >>>>>>> > >> >> >>>>>>> Andrew > >> >> >>>>>>> > >> >> >>>>>>> > >> >> >>>>>>> > >> >> >>>>>>> On Thu, Jun 20, 2024 at 3:04 AM Xuanwo <xu...@apache.org> > >> wrote: > >> >> >>>>>>> > >> >> >>>>>>>> Hello, everyone. > >> >> >>>>>>>> > >> >> >>>>>>>> I start this thread to disscuss the donation of a > User-Defined > >> >> >>>> Function > >> >> >>>>>>>> Framework for Apache Arrow. > >> >> >>>>>>>> > >> >> >>>>>>>> Feel free to review and leave your comments here. For live > >> review, > >> >> >>>>>> please > >> >> >>>>>>>> visit: > >> >> >>>>>>>> > >> >> >>>>>>>> https://hackmd.io/@xuanwo/apache-arrow-udf > >> >> >>>>>>>> > >> >> >>>>>>>> The original content also pasted here for a quick reading: > >> >> >>>>>>>> > >> >> >>>>>>>> ------ > >> >> >>>>>>>> > >> >> >>>>>>>> ## Abstract > >> >> >>>>>>>> > >> >> >>>>>>>> Arrow UDF is a User-Defined Function Framework for Apache > >> Arrow. > >> >> >>>>>>>> > >> >> >>>>>>>> ## Proposal > >> >> >>>>>>>> > >> >> >>>>>>>> Arrow UDF allows user to easily create and run user-defined > >> >> >>>> functions > >> >> >>>>>>>> (UDF) in Rust, Python, Java or JavaScript based on Apache > >> Arrow. > >> >> >>>> The > >> >> >>>>>>>> functions can be executed natively, or in WebAssembly, or > in a > >> >> >>>> remote > >> >> >>>>>>>> server via Arrow Flight. > >> >> >>>>>>>> > >> >> >>>>>>>> Arrow UDF was originally designed to be used by the > RisingWave > >> >> >>>> project > >> >> >>>>>> but > >> >> >>>>>>>> is now being used by Databend and several database startups. > >> >> >>>>>>>> > >> >> >>>>>>>> We believe that the Arrow UDF project will provide diversity > >> value > >> >> >>>> to > >> >> >>>>>> the > >> >> >>>>>>>> entire Arrow community. > >> >> >>>>>>>> > >> >> >>>>>>>> ## Background > >> >> >>>>>>>> > >> >> >>>>>>>> Arrow UDF is being developed by an open-source community > from > >> day > >> >> >>>> one > >> >> >>>>>> and > >> >> >>>>>>>> is owned by RisingWaveLabs. The project has been launched in > >> >> >>>> December > >> >> >>>>>> 2023. > >> >> >>>>>>>> > >> >> >>>>>>>> ## Initial Goals > >> >> >>>>>>>> > >> >> >>>>>>>> By transferring ownership of the project to the Apache > Arrow, > >> >> >>>> Arrow UDF > >> >> >>>>>>>> expects to ensure its neutrality and further encourage and > >> >> >>>> facilitate > >> >> >>>>>> the > >> >> >>>>>>>> adoption of Arrow UDF by the community. > >> >> >>>>>>>> > >> >> >>>>>>>> ## Current Status > >> >> >>>>>>>> > >> >> >>>>>>>> Contributors: 5 > >> >> >>>>>>>> > >> >> >>>>>>>> Users: > >> >> >>>>>>>> > >> >> >>>>>>>> - [RisingWave]: A Distributed SQL Database for Stream > >> >> Processing. > >> >> >>>>>>>> - [Databend]: An open-source cloud data warehouse that > >> serves as > >> >> >>>> a > >> >> >>>>>>>> cost-effective alternative to Snowflake. > >> >> >>>>>>>> > >> >> >>>>>>>> ## Documentation > >> >> >>>>>>>> > >> >> >>>>>>>> The document of Arrow UDF is hosted at > >> >> >>>>>>>> https://docs.rs/arrow-udf/latest/arrow_udf/. > >> >> >>>>>>>> > >> >> >>>>>>>> ## Initial Source > >> >> >>>>>>>> > >> >> >>>>>>>> The project currently holds a GitHub repository and multiple > >> >> >>>> packages: > >> >> >>>>>>>> > >> >> >>>>>>>> - https://github.com/risingwavelabs/arrow-udf > >> >> >>>>>>>> > >> >> >>>>>>>> Rust: > >> >> >>>>>>>> > >> >> >>>>>>>> - https://crates.io/arrow-udf/ > >> >> >>>>>>>> - https://crates.io/arrow-udf-python/ > >> >> >>>>>>>> - https://crates.io/arrow-udf-js/ > >> >> >>>>>>>> - https://crates.io/arrow-udf-js-deno/ > >> >> >>>>>>>> - https://crates.io/arrow-udf-wasm/ > >> >> >>>>>>>> > >> >> >>>>>>>> Python: > >> >> >>>>>>>> > >> >> >>>>>>>> - https://pypi.org/project/arrow-udf/ > >> >> >>>>>>>> > >> >> >>>>>>>> Those packge will retain its name, while the repository > will be > >> >> >>>> moved to > >> >> >>>>>>>> apache org. > >> >> >>>>>>>> > >> >> >>>>>>>> ## Required Resources > >> >> >>>>>>>> > >> >> >>>>>>>> ### Mailing Lists > >> >> >>>>>>>> > >> >> >>>>>>>> We can reuse the existing mailing lists that arrow have. > >> >> >>>>>>>> > >> >> >>>>>>>> ### Git Repositories > >> >> >>>>>>>> > >> >> >>>>>>>> From > >> >> >>>>>>>> > >> >> >>>>>>>> - https://github.com/risingwavelabs/arrow-udf > >> >> >>>>>>>> > >> >> >>>>>>>> To > >> >> >>>>>>>> > >> >> >>>>>>>> - https://gitbox.apache.org/asf/repos/arrow-udf > >> >> >>>>>>>> - https://github.com/apache/arrow-udf > >> >> >>>>>>>> > >> >> >>>>>>>> ### Issue Tracking > >> >> >>>>>>>> > >> >> >>>>>>>> The project would like to continue using GitHub Issues. > >> >> >>>>>>>> > >> >> >>>>>>>> ### Other Resources > >> >> >>>>>>>> > >> >> >>>>>>>> The project has already chosen GitHub actions as continuous > >> >> >>>> integration > >> >> >>>>>>>> tools. > >> >> >>>>>>>> > >> >> >>>>>>>> ## Initial Committers > >> >> >>>>>>>> > >> >> >>>>>>>> - Runji Wang wangrunji0...@163.com > >> >> >>>>>>>> - Giovanny Gutiérrez > >> >> >>>>>>>> - sundy-li sund...@apache.org > >> >> >>>>>>>> - Xuanwo xua...@apache.org > >> >> >>>>>>>> - Max Justus Spransy maxjus...@gmail.com > >> >> >>>>>>>> > >> >> >>>>>>>> [RisingWave]: https://github.com/risingwavelabs/risingwave > >> >> >>>>>>>> [Databend]: https://github.com/datafuselabs/databend > >> >> >>>>>>>> > >> >> >>>>>>>> Xuanwo > >> >> >>>>>>>> > >> >> >>>>>> > >> >> >>> > >> >> >>> > >> >> >> > >> >> > >> >> -- > >> >> Xuanwo > >> >> > >> > >> -- > >> Xuanwo > >> > > -- > Xuanwo > > https://xuanwo.io/ >