Re: [DISCUSS] Donation of a User-Defined Function Framework for Apache Arrow

Andrew Lamb Wed, 17 Jul 2024 03:08:15 -0700

An update here is that one of the DataFusion contributors, @xinlifoobar,
did a very neat prototype of using arrow-udf in DataFusion[1] and wrote up
their findings[2]


The major findings are that it would be possible, though it would take some
additional work (e.g. single values, making the function registry optional,
and adding datatype support)

Andrew

[1]: https://github.com/apache/datafusion/pull/11488
[2]:
https://github.com/apache/datafusion/issues/11413#issuecomment-2230502588



On Thu, Jul 4, 2024 at 3:35 PM Felipe Oliveira Carvalho <felipe...@gmail.com>
wrote:

> Hi Andrew,
>
> During the Arrow Community Meeting I asked Xuanwo many questions trying to
> clarify my understanding of what they mean by "UDF".
>
> To me and you it seems to mean "user defined compute kernels", but in the
> context of these libraries it's *also that* plus the ability to call these
> functions by making calls to Flight services. The ultimate goal being
> federated querying (my words, not them) from databases that can talk to
> Flight to run these UDFs running on them.
>
> IMO it's too big of a problem to solve in the context of Arrow, but I
> suggested that they modularized it more (i.e. separate the kernel-writing
> parts from the parts that make the Flight wiring) and try to incorporate
> the kernel-writing parts (like the Rust macros) into the arrow-LANG
> implementation so that their framework for writing Flight services for UDFs
> could be much smaller and specific to the strategy for connecting querying
> engines they see fit.
>
> Others suggested work on standardizing the protocols used to call these
> UDFs, but IMO that is a huge undertaking. There are simply too many aspects
> to calling UDFs across different services. Standardizing the practices
> around this will take a lot of experimenting.
>
> --
> Felipe
>
> On Wed, Jul 3, 2024 at 3:43 PM Andrew Lamb <al...@influxdata.com> wrote:
>
> > What does everyone think about renaming this library to something like
> > `arrow-auto-vectorizer` or `arrow-functions` to emphasize its role with
> > codegen of vectorized implementations?
> >
> > In discussing this proposal internally, it took a while to explain what
> the
> > usecase of the library is
> >
> > From my understanding, the use case is "Automatically generate vectorized
> > kernels from scalar functions".
> >
> > While this can be used for User Defined Functions (UDFs), there are many
> > other uses too (like "built in functions" in processing engines)
> >
> > Andrew
> >
> >
> > On Mon, Jul 1, 2024 at 8:10 AM Xuanwo <xua...@apache.org> wrote:
> >
> > > I have cross-posted the proposal to datafusion community to collect
> more
> > > feedback:
> > >
> > > https://github.com/apache/datafusion/discussions/11192
> > >
> > > On Mon, Jul 1, 2024, at 19:31, Andrew Lamb wrote:
> > > > I have been thinking about this project more, and the more I think
> > about
> > > it
> > > > the more I like it.
> > > >
> > > > For example of the kind of leverage a library like this might bring,
> we
> > > > might consider changing the implementation of Arrow UDF to re-use the
> > > > underlying buffers when possible (e.g. via unary_mut[1]). This would
> > > likely
> > > > provide an across the board efficiency improvement for no costs to
> > > > downstream crates.
> > > >
> > > > Andrew
> > > >
> > > > [1]:
> > > >
> > >
> >
> https://docs.rs/arrow/latest/arrow/array/struct.PrimitiveArray.html#method.unary_mut
> > > >
> > > > On Sat, Jun 29, 2024 at 1:47 AM Xuanwo <xua...@apache.org> wrote:
> > > >
> > > >> > That said, wherever it ends up, there should be the agreement of
> > > >> > individuals to accept maintenance of it. Since it's in rust, that
> > > would
> > > >> > generally fall to the arrow-rs contributors and/or the DataFusion
> > > >> > contributors IMO.
> > > >> >
> > > >> > It would be good for it to be part of the community, but only if
> > it's
> > > not
> > > >> > going to end up just bitrotting somewhere.
> > > >>
> > > >> Thanks Matt. This concern does make sense.
> > > >>
> > > >> Arrow UDF is extensively used within RisingWave and Databend. We,
> the
> > > >> initial
> > > >> committers from both RisingWave and Databend, are eager to take
> > > >> responsibility
> > > >> for maintaining these crates.
> > > >>
> > > >> Additionally, some of us are involved in other Apache Projects, so
> we
> > > >> understand
> > > >> how the Apache Way functions. We will focus on community growth to
> > > ensure
> > > >> this
> > > >> project remains active.
> > > >>
> > > >> On Sat, Jun 29, 2024, at 13:29, Matt Topol wrote:
> > > >> >> This UDF implementation doesn’t depend on DataFusion. It can work
> > > with
> > > >> > any data in the arrow format.
> > > >> >
> > > >> > Given this I'm in agreement with Antoine that it would be weird
> for
> > > it to
> > > >> > be maintained within the DataFusion repo as opposed to it's own
> repo
> > > (as
> > > >> > we've done in the past for things like nanoarrow and
> > > arrow-experiments).
> > > >> >
> > > >> > That said, wherever it ends up, there should be the agreement of
> > > >> > individuals to accept maintenance of it. Since it's in rust, that
> > > would
> > > >> > generally fall to the arrow-rs contributors and/or the DataFusion
> > > >> > contributors IMO.
> > > >> >
> > > >> > It would be good for it to be part of the community, but only if
> > it's
> > > not
> > > >> > going to end up just bitrotting somewhere.
> > > >> >
> > > >> > --Matt
> > > >> >
> > > >> > On Fri, Jun 28, 2024, 8:49 PM Xuanwo <xua...@apache.org> wrote:
> > > >> >
> > > >> >> Hi,
> > > >> >>
> > > >> >> This UDF implementation doesn’t depend on DataFusion. It can work
> > > with
> > > >> any
> > > >> >> data in the arrow format.
> > > >> >>
> > > >> >> It has the potential power to make users write ONE UDF function
> > that
> > > >> works
> > > >> >> for different query engines as we showed up in databend and
> > > risingwave.
> > > >> >>
> > > >> >> So I personally think it should be part of arrow community.
> > > >> >>
> > > >> >> On Sat, Jun 29, 2024, at 05:06, Antoine Pitrou wrote:
> > > >> >> > Is this UDF implementation based on DataFusion? If so, it makes
> > > sense
> > > >> >> > for it to be part of the DataFusion project.
> > > >> >> >
> > > >> >> > OTOH, if it can work with any data in the Arrow format, then it
> > > would
> > > >> >> > sound weird to maintain it in the DataFusion repo IMHO.
> > > >> >> >
> > > >> >> > Regards
> > > >> >> >
> > > >> >> > Antoine.
> > > >> >> >
> > > >> >> >
> > > >> >> > Le 28/06/2024 à 21:52, Andrew Lamb a écrit :
> > > >> >> >> To be clear, if the arrow community thinks this would be
> better
> > > >> >> organized /
> > > >> >> >> administered in the Apache DataFusion project (especially if
> it
> > is
> > > >> >> aligned
> > > >> >> >> with Rust) I think it would be good to discuss donating there
> > > >> >> >>
> > > >> >> >> On Fri, Jun 28, 2024 at 3:17 PM Andrew Lamb <
> > al...@influxdata.com
> > > >
> > > >> >> wrote:
> > > >> >> >>
> > > >> >> >>> I think there are two aspects:
> > > >> >> >>> 1. The actual mechanics of implementing functions
> > > >> >> >>> 2. The actual library of udf functions (e.g. sin, cos,
> nullif,
> > > etc)
> > > >> >> >>>
> > > >> >> >>> I agree 2 is not something that belongs naturally in the
> arrow
> > > >> project
> > > >> >> and
> > > >> >> >>> is better aligned with query engines
> > > >> >> >>>
> > > >> >> >>> However I think 1 is worth considering.
> > > >> >> >>>
> > > >> >> >>> As I understand it, the problem arrow_udf solves is avoiding
> > > some of
> > > >> >> the
> > > >> >> >>> boilerplate  required to make vectorized udfs. So instead of
> > > >> writing a
> > > >> >> >>> special eval_gcd function like this
> > > >> >> >>>
> > > >> >> >>> ```
> > > >> >> >>> fn gcd(l: i64, r: i64) -> i64 {
> > > >> >> >>>   // do gcd calculation
> > > >> >> >>> }
> > > >> >> >>>
> > > >> >> >>> // implement vectorized version
> > > >> >> >>> fn eval_gcd(left: &ArrayRef, right: &ArrayRef) -> ArrayRef {
> > > >> >> >>>    let left = left.as_primitive<Int64Type>();
> > > >> >> >>>    let right = right.as_primitive<Int64Type>();
> > > >> >> >>>    res = binary(left, right, |l, r| gcd(l, r));
> > > >> >> >>>    Arc::new(res)
> > > >> >> >>> }
> > > >> >> >>> ```
> > > >> >> >>>
> > > >> >> >>> The user simply annotates the scalar function and have the
> > > library
> > > >> code
> > > >> >> >>> gen the array version
> > > >> >> >>> ```
> > > >> >> >>> #[function("gcd(int64, int64) -> int64", output =
> "eval_gcd")]
> > > >> >> >>> fn gcd(l: i64, r: i64) -> i64 {
> > > >> >> >>>   // do gcd calculation
> > > >> >> >>> }
> > > >> >> >>> ```
> > > >> >> >>>
> > > >> >> >>> We have a lot of boilerplate / non idea macro stuff in
> > DataFusion
> > > >> that
> > > >> >> I
> > > >> >> >>> think this would help a lot.
> > > >> >> >>>
> > > >> >> >>> Andrew
> > > >> >> >>>
> > > >> >> >>>
> > > >> >> >>> On Fri, Jun 28, 2024 at 3:08 PM Raphael Taylor-Davies
> > > >> >> >>> <r.taylordav...@googlemail.com.invalid> wrote:
> > > >> >> >>>
> > > >> >> >>>> I wonder if the DataFusion project might be a more natural
> > home
> > > for
> > > >> >> this
> > > >> >> >>>> functionality? UDFs are more of a query engine concept,
> > whereas
> > > >> >> arrow-rs is
> > > >> >> >>>> more focused on purely physical execution?
> > > >> >> >>>>
> > > >> >> >>>> On 28 June 2024 19:41:39 BST, Runji Wang <
> > wangrunji0...@163.com
> > > >
> > > >> >> wrote:
> > > >> >> >>>>> Hi Felipe,
> > > >> >> >>>>>
> > > >> >> >>>>> Vectorization will be applied whenever possible. When all
> > input
> > > >> and
> > > >> >> >>>> output types of a function are primitive (int16, int32,
> int64,
> > > >> >> float32,
> > > >> >> >>>> float64) and do not involve any Option or Result, the macro
> > will
> > > >> >> >>>> automatically generate code based on unary <
> > > >> >> >>>> https://docs.rs/arrow/latest/arrow/compute/fn.unary.html>
> or
> > > >> binary <
> > > >> >> >>>> https://docs.rs/arrow/latest/arrow/compute/fn.binary.html>
> > > >> kernels,
> > > >> >> >>>> which potentially allows for vectorization.
> > > >> >> >>>>>
> > > >> >> >>>>> Both examples you showed are not vectorized. The `div`
> > > function is
> > > >> >> due
> > > >> >> >>>> to the Result output, while `gcd` is due to the loop in its
> > > >> >> implementation.
> > > >> >> >>>> However, if the function is simple enough, like an `add`
> > > function:
> > > >> >> >>>>>
> > > >> >> >>>>> #[function("add(int, int) -> int")]
> > > >> >> >>>>> fn add(a: i32, b: i32) -> i32 {
> > > >> >> >>>>>     a + b
> > > >> >> >>>>> }
> > > >> >> >>>>>
> > > >> >> >>>>> It can be auto-vectorized by llvm.
> > > >> >> >>>>>
> > > >> >> >>>>> Runji
> > > >> >> >>>>>
> > > >> >> >>>>>
> > > >> >> >>>>> On 2024/06/28 17:13:16 Felipe Oliveira Carvalho wrote:
> > > >> >> >>>>>> On Fri, Jun 28, 2024 at 11:07 AM Andrew Lamb <
> > > >> al...@influxdata.com>
> > > >> >> >>>> wrote:
> > > >> >> >>>>>>>
> > > >> >> >>>>>>> Hi Xuanwo,
> > > >> >> >>>>>>>
> > > >> >> >>>>>>> Sorry for the delay in responding. I think  the ability
> to
> > > >> easily
> > > >> >> >>>> write
> > > >> >> >>>>>>> functions that "feel" like native functions in whatever
> > > language
> > > >> >> and
> > > >> >> >>>> be
> > > >> >> >>>>>>> able to generate arrow / vectorized versions of them is
> > quite
> > > >> >> >>>> valuable.
> > > >> >> >>>>>>> This is my understanding of what this proposal is about.
> > > >> >> >>>>>>
> > > >> >> >>>>>> My understanding is that it's not vectorized. From the
> > > examples
> > > >> in
> > > >> >> >>>>>> risingwavelabs/arrow-udf, <
> > > >> >> https://github.com/risingwavelabs/arrow-udf>
> > > >> >> >>>> it
> > > >> >> >>>>>> looks like the macros generate code that gathers values
> from
> > > >> columns
> > > >> >> >>>> into
> > > >> >> >>>>>> local scalars that are passed as scalar parameters to user
> > > >> >> functions.
> > > >> >> >>>> Is
> > > >> >> >>>>>> the hope here that rustc/llvm will auto-vectorize the
> code?
> > > >> >> >>>>>>
> > > >> >> >>>>>> #[function("gcd(int, int) -> int")]
> > > >> >> >>>>>> fn gcd(mut a: i32, mut b: i32) -> i32 {
> > > >> >> >>>>>>      while b != 0 {
> > > >> >> >>>>>>          (a, b) = (b, a % b);
> > > >> >> >>>>>>      }
> > > >> >> >>>>>>      a
> > > >> >> >>>>>> }
> > > >> >> >>>>>>
> > > >> >> >>>>>> #[function("div(int, int) -> int")]
> > > >> >> >>>>>> fn div(x: i32, y: i32) -> Result<i32, &'static str> {
> > > >> >> >>>>>>      if y == 0 {
> > > >> >> >>>>>>          return Err("division by zero");
> > > >> >> >>>>>>      }
> > > >> >> >>>>>>      Ok(x / y)
> > > >> >> >>>>>> }
> > > >> >> >>>>>>
> > > >> >> >>>>>>> I left some additional comments on the markdown.
> > > >> >> >>>>>>>
> > > >> >> >>>>>>> One thing that might be worth doing is articulate some
> > other
> > > >> >> >>>> potential
> > > >> >> >>>>>>> locations for where the code might go. One option, as I
> > think
> > > >> you
> > > >> >> >>>> propose,
> > > >> >> >>>>>>> is to make its own repository.  Another option could be
> to
> > > >> donate
> > > >> >> >>>> the code
> > > >> >> >>>>>>> and put the various language bindings in the same repo as
> > the
> > > >> arrow
> > > >> >> >>>>>>> language implementations (e.g arrow-rs, arrow for python,
> > > etc)
> > > >> >> which
> > > >> >> >>>> would
> > > >> >> >>>>>>> likely make it easier to maintain and discover.
> > > >> >> >>>>>>>
> > > >> >> >>>>>>> I am curious about what other devs / users feel about
> this?
> > > >> >> >>>>>>>
> > > >> >> >>>>>>> Andrew
> > > >> >> >>>>>>>
> > > >> >> >>>>>>>
> > > >> >> >>>>>>>
> > > >> >> >>>>>>> On Thu, Jun 20, 2024 at 3:04 AM Xuanwo <xu...@apache.org
> >
> > > >> wrote:
> > > >> >> >>>>>>>
> > > >> >> >>>>>>>> Hello, everyone.
> > > >> >> >>>>>>>>
> > > >> >> >>>>>>>> I start this thread to disscuss the donation of a
> > > User-Defined
> > > >> >> >>>> Function
> > > >> >> >>>>>>>> Framework for Apache Arrow.
> > > >> >> >>>>>>>>
> > > >> >> >>>>>>>> Feel free to review and leave your comments here. For
> live
> > > >> review,
> > > >> >> >>>>>> please
> > > >> >> >>>>>>>> visit:
> > > >> >> >>>>>>>>
> > > >> >> >>>>>>>> https://hackmd.io/@xuanwo/apache-arrow-udf
> > > >> >> >>>>>>>>
> > > >> >> >>>>>>>> The original content also pasted here for a quick
> reading:
> > > >> >> >>>>>>>>
> > > >> >> >>>>>>>> ------
> > > >> >> >>>>>>>>
> > > >> >> >>>>>>>> ## Abstract
> > > >> >> >>>>>>>>
> > > >> >> >>>>>>>> Arrow UDF is a User-Defined Function Framework for
> Apache
> > > >> Arrow.
> > > >> >> >>>>>>>>
> > > >> >> >>>>>>>> ## Proposal
> > > >> >> >>>>>>>>
> > > >> >> >>>>>>>> Arrow UDF allows user to easily create and run
> > user-defined
> > > >> >> >>>> functions
> > > >> >> >>>>>>>> (UDF) in Rust, Python, Java or JavaScript based on
> Apache
> > > >> Arrow.
> > > >> >> >>>> The
> > > >> >> >>>>>>>> functions can be executed natively, or in WebAssembly,
> or
> > > in a
> > > >> >> >>>> remote
> > > >> >> >>>>>>>> server via Arrow Flight.
> > > >> >> >>>>>>>>
> > > >> >> >>>>>>>> Arrow UDF was originally designed to be used by the
> > > RisingWave
> > > >> >> >>>> project
> > > >> >> >>>>>> but
> > > >> >> >>>>>>>> is now being used by Databend and several database
> > startups.
> > > >> >> >>>>>>>>
> > > >> >> >>>>>>>> We believe that the Arrow UDF project will provide
> > diversity
> > > >> value
> > > >> >> >>>> to
> > > >> >> >>>>>> the
> > > >> >> >>>>>>>> entire Arrow community.
> > > >> >> >>>>>>>>
> > > >> >> >>>>>>>> ## Background
> > > >> >> >>>>>>>>
> > > >> >> >>>>>>>> Arrow UDF is being developed by an open-source community
> > > from
> > > >> day
> > > >> >> >>>> one
> > > >> >> >>>>>> and
> > > >> >> >>>>>>>> is owned by RisingWaveLabs. The project has been
> launched
> > in
> > > >> >> >>>> December
> > > >> >> >>>>>> 2023.
> > > >> >> >>>>>>>>
> > > >> >> >>>>>>>> ## Initial Goals
> > > >> >> >>>>>>>>
> > > >> >> >>>>>>>> By transferring ownership of the project to the Apache
> > > Arrow,
> > > >> >> >>>> Arrow UDF
> > > >> >> >>>>>>>> expects to ensure its neutrality and further encourage
> and
> > > >> >> >>>> facilitate
> > > >> >> >>>>>> the
> > > >> >> >>>>>>>> adoption of Arrow UDF by the community.
> > > >> >> >>>>>>>>
> > > >> >> >>>>>>>> ## Current Status
> > > >> >> >>>>>>>>
> > > >> >> >>>>>>>> Contributors: 5
> > > >> >> >>>>>>>>
> > > >> >> >>>>>>>> Users:
> > > >> >> >>>>>>>>
> > > >> >> >>>>>>>> -   [RisingWave]: A Distributed SQL Database for Stream
> > > >> >> Processing.
> > > >> >> >>>>>>>> -   [Databend]: An open-source cloud data warehouse that
> > > >> serves as
> > > >> >> >>>> a
> > > >> >> >>>>>>>> cost-effective alternative to Snowflake.
> > > >> >> >>>>>>>>
> > > >> >> >>>>>>>> ## Documentation
> > > >> >> >>>>>>>>
> > > >> >> >>>>>>>> The document of Arrow UDF is hosted at
> > > >> >> >>>>>>>> https://docs.rs/arrow-udf/latest/arrow_udf/.
> > > >> >> >>>>>>>>
> > > >> >> >>>>>>>> ## Initial Source
> > > >> >> >>>>>>>>
> > > >> >> >>>>>>>> The project currently holds a GitHub repository and
> > multiple
> > > >> >> >>>> packages:
> > > >> >> >>>>>>>>
> > > >> >> >>>>>>>> - https://github.com/risingwavelabs/arrow-udf
> > > >> >> >>>>>>>>
> > > >> >> >>>>>>>> Rust:
> > > >> >> >>>>>>>>
> > > >> >> >>>>>>>> - https://crates.io/arrow-udf/
> > > >> >> >>>>>>>> - https://crates.io/arrow-udf-python/
> > > >> >> >>>>>>>> - https://crates.io/arrow-udf-js/
> > > >> >> >>>>>>>> - https://crates.io/arrow-udf-js-deno/
> > > >> >> >>>>>>>> - https://crates.io/arrow-udf-wasm/
> > > >> >> >>>>>>>>
> > > >> >> >>>>>>>> Python:
> > > >> >> >>>>>>>>
> > > >> >> >>>>>>>> - https://pypi.org/project/arrow-udf/
> > > >> >> >>>>>>>>
> > > >> >> >>>>>>>> Those packge will retain its name, while the repository
> > > will be
> > > >> >> >>>> moved to
> > > >> >> >>>>>>>> apache org.
> > > >> >> >>>>>>>>
> > > >> >> >>>>>>>> ## Required Resources
> > > >> >> >>>>>>>>
> > > >> >> >>>>>>>> ### Mailing Lists
> > > >> >> >>>>>>>>
> > > >> >> >>>>>>>> We can reuse the existing mailing lists that arrow have.
> > > >> >> >>>>>>>>
> > > >> >> >>>>>>>> ### Git Repositories
> > > >> >> >>>>>>>>
> > > >> >> >>>>>>>> From
> > > >> >> >>>>>>>>
> > > >> >> >>>>>>>> - https://github.com/risingwavelabs/arrow-udf
> > > >> >> >>>>>>>>
> > > >> >> >>>>>>>> To
> > > >> >> >>>>>>>>
> > > >> >> >>>>>>>> - https://gitbox.apache.org/asf/repos/arrow-udf
> > > >> >> >>>>>>>> - https://github.com/apache/arrow-udf
> > > >> >> >>>>>>>>
> > > >> >> >>>>>>>> ### Issue Tracking
> > > >> >> >>>>>>>>
> > > >> >> >>>>>>>> The project would like to continue using GitHub Issues.
> > > >> >> >>>>>>>>
> > > >> >> >>>>>>>> ### Other Resources
> > > >> >> >>>>>>>>
> > > >> >> >>>>>>>> The project has already chosen GitHub actions as
> > continuous
> > > >> >> >>>> integration
> > > >> >> >>>>>>>> tools.
> > > >> >> >>>>>>>>
> > > >> >> >>>>>>>> ## Initial Committers
> > > >> >> >>>>>>>>
> > > >> >> >>>>>>>> - Runji Wang wangrunji0...@163.com
> > > >> >> >>>>>>>> - Giovanny Gutiérrez
> > > >> >> >>>>>>>> - sundy-li sund...@apache.org
> > > >> >> >>>>>>>> - Xuanwo xua...@apache.org
> > > >> >> >>>>>>>> - Max Justus Spransy maxjus...@gmail.com
> > > >> >> >>>>>>>>
> > > >> >> >>>>>>>> [RisingWave]:
> > https://github.com/risingwavelabs/risingwave
> > > >> >> >>>>>>>> [Databend]: https://github.com/datafuselabs/databend
> > > >> >> >>>>>>>>
> > > >> >> >>>>>>>> Xuanwo
> > > >> >> >>>>>>>>
> > > >> >> >>>>>>
> > > >> >> >>>
> > > >> >> >>>
> > > >> >> >>
> > > >> >>
> > > >> >> --
> > > >> >> Xuanwo
> > > >> >>
> > > >>
> > > >> --
> > > >> Xuanwo
> > > >>
> > >
> > > --
> > > Xuanwo
> > >
> > > https://xuanwo.io/
> > >
> >
>

Re: [DISCUSS] Donation of a User-Defined Function Framework for Apache Arrow

Reply via email to