Re: [DISCUSS] Donation of a User-Defined Function Framework for Apache Arrow

Felipe Oliveira Carvalho Thu, 04 Jul 2024 12:35:53 -0700

Hi Andrew,

During the Arrow Community Meeting I asked Xuanwo many questions trying to
clarify my understanding of what they mean by "UDF".


To me and you it seems to mean "user defined compute kernels", but in the
context of these libraries it's *also that* plus the ability to call these
functions by making calls to Flight services. The ultimate goal being
federated querying (my words, not them) from databases that can talk to
Flight to run these UDFs running on them.

IMO it's too big of a problem to solve in the context of Arrow, but I
suggested that they modularized it more (i.e. separate the kernel-writing
parts from the parts that make the Flight wiring) and try to incorporate
the kernel-writing parts (like the Rust macros) into the arrow-LANG
implementation so that their framework for writing Flight services for UDFs
could be much smaller and specific to the strategy for connecting querying
engines they see fit.

Others suggested work on standardizing the protocols used to call these
UDFs, but IMO that is a huge undertaking. There are simply too many aspects
to calling UDFs across different services. Standardizing the practices
around this will take a lot of experimenting.

--
Felipe

On Wed, Jul 3, 2024 at 3:43 PM Andrew Lamb <al...@influxdata.com> wrote:

> What does everyone think about renaming this library to something like
> `arrow-auto-vectorizer` or `arrow-functions` to emphasize its role with
> codegen of vectorized implementations?
>
> In discussing this proposal internally, it took a while to explain what the
> usecase of the library is
>
> From my understanding, the use case is "Automatically generate vectorized
> kernels from scalar functions".
>
> While this can be used for User Defined Functions (UDFs), there are many
> other uses too (like "built in functions" in processing engines)
>
> Andrew
>
>
> On Mon, Jul 1, 2024 at 8:10 AM Xuanwo <xua...@apache.org> wrote:
>
> > I have cross-posted the proposal to datafusion community to collect more
> > feedback:
> >
> > https://github.com/apache/datafusion/discussions/11192
> >
> > On Mon, Jul 1, 2024, at 19:31, Andrew Lamb wrote:
> > > I have been thinking about this project more, and the more I think
> about
> > it
> > > the more I like it.
> > >
> > > For example of the kind of leverage a library like this might bring, we
> > > might consider changing the implementation of Arrow UDF to re-use the
> > > underlying buffers when possible (e.g. via unary_mut[1]). This would
> > likely
> > > provide an across the board efficiency improvement for no costs to
> > > downstream crates.
> > >
> > > Andrew
> > >
> > > [1]:
> > >
> >
> https://docs.rs/arrow/latest/arrow/array/struct.PrimitiveArray.html#method.unary_mut
> > >
> > > On Sat, Jun 29, 2024 at 1:47 AM Xuanwo <xua...@apache.org> wrote:
> > >
> > >> > That said, wherever it ends up, there should be the agreement of
> > >> > individuals to accept maintenance of it. Since it's in rust, that
> > would
> > >> > generally fall to the arrow-rs contributors and/or the DataFusion
> > >> > contributors IMO.
> > >> >
> > >> > It would be good for it to be part of the community, but only if
> it's
> > not
> > >> > going to end up just bitrotting somewhere.
> > >>
> > >> Thanks Matt. This concern does make sense.
> > >>
> > >> Arrow UDF is extensively used within RisingWave and Databend. We, the
> > >> initial
> > >> committers from both RisingWave and Databend, are eager to take
> > >> responsibility
> > >> for maintaining these crates.
> > >>
> > >> Additionally, some of us are involved in other Apache Projects, so we
> > >> understand
> > >> how the Apache Way functions. We will focus on community growth to
> > ensure
> > >> this
> > >> project remains active.
> > >>
> > >> On Sat, Jun 29, 2024, at 13:29, Matt Topol wrote:
> > >> >> This UDF implementation doesn’t depend on DataFusion. It can work
> > with
> > >> > any data in the arrow format.
> > >> >
> > >> > Given this I'm in agreement with Antoine that it would be weird for
> > it to
> > >> > be maintained within the DataFusion repo as opposed to it's own repo
> > (as
> > >> > we've done in the past for things like nanoarrow and
> > arrow-experiments).
> > >> >
> > >> > That said, wherever it ends up, there should be the agreement of
> > >> > individuals to accept maintenance of it. Since it's in rust, that
> > would
> > >> > generally fall to the arrow-rs contributors and/or the DataFusion
> > >> > contributors IMO.
> > >> >
> > >> > It would be good for it to be part of the community, but only if
> it's
> > not
> > >> > going to end up just bitrotting somewhere.
> > >> >
> > >> > --Matt
> > >> >
> > >> > On Fri, Jun 28, 2024, 8:49 PM Xuanwo <xua...@apache.org> wrote:
> > >> >
> > >> >> Hi,
> > >> >>
> > >> >> This UDF implementation doesn’t depend on DataFusion. It can work
> > with
> > >> any
> > >> >> data in the arrow format.
> > >> >>
> > >> >> It has the potential power to make users write ONE UDF function
> that
> > >> works
> > >> >> for different query engines as we showed up in databend and
> > risingwave.
> > >> >>
> > >> >> So I personally think it should be part of arrow community.
> > >> >>
> > >> >> On Sat, Jun 29, 2024, at 05:06, Antoine Pitrou wrote:
> > >> >> > Is this UDF implementation based on DataFusion? If so, it makes
> > sense
> > >> >> > for it to be part of the DataFusion project.
> > >> >> >
> > >> >> > OTOH, if it can work with any data in the Arrow format, then it
> > would
> > >> >> > sound weird to maintain it in the DataFusion repo IMHO.
> > >> >> >
> > >> >> > Regards
> > >> >> >
> > >> >> > Antoine.
> > >> >> >
> > >> >> >
> > >> >> > Le 28/06/2024 à 21:52, Andrew Lamb a écrit :
> > >> >> >> To be clear, if the arrow community thinks this would be better
> > >> >> organized /
> > >> >> >> administered in the Apache DataFusion project (especially if it
> is
> > >> >> aligned
> > >> >> >> with Rust) I think it would be good to discuss donating there
> > >> >> >>
> > >> >> >> On Fri, Jun 28, 2024 at 3:17 PM Andrew Lamb <
> al...@influxdata.com
> > >
> > >> >> wrote:
> > >> >> >>
> > >> >> >>> I think there are two aspects:
> > >> >> >>> 1. The actual mechanics of implementing functions
> > >> >> >>> 2. The actual library of udf functions (e.g. sin, cos, nullif,
> > etc)
> > >> >> >>>
> > >> >> >>> I agree 2 is not something that belongs naturally in the arrow
> > >> project
> > >> >> and
> > >> >> >>> is better aligned with query engines
> > >> >> >>>
> > >> >> >>> However I think 1 is worth considering.
> > >> >> >>>
> > >> >> >>> As I understand it, the problem arrow_udf solves is avoiding
> > some of
> > >> >> the
> > >> >> >>> boilerplate  required to make vectorized udfs. So instead of
> > >> writing a
> > >> >> >>> special eval_gcd function like this
> > >> >> >>>
> > >> >> >>> ```
> > >> >> >>> fn gcd(l: i64, r: i64) -> i64 {
> > >> >> >>>   // do gcd calculation
> > >> >> >>> }
> > >> >> >>>
> > >> >> >>> // implement vectorized version
> > >> >> >>> fn eval_gcd(left: &ArrayRef, right: &ArrayRef) -> ArrayRef {
> > >> >> >>>    let left = left.as_primitive<Int64Type>();
> > >> >> >>>    let right = right.as_primitive<Int64Type>();
> > >> >> >>>    res = binary(left, right, |l, r| gcd(l, r));
> > >> >> >>>    Arc::new(res)
> > >> >> >>> }
> > >> >> >>> ```
> > >> >> >>>
> > >> >> >>> The user simply annotates the scalar function and have the
> > library
> > >> code
> > >> >> >>> gen the array version
> > >> >> >>> ```
> > >> >> >>> #[function("gcd(int64, int64) -> int64", output = "eval_gcd")]
> > >> >> >>> fn gcd(l: i64, r: i64) -> i64 {
> > >> >> >>>   // do gcd calculation
> > >> >> >>> }
> > >> >> >>> ```
> > >> >> >>>
> > >> >> >>> We have a lot of boilerplate / non idea macro stuff in
> DataFusion
> > >> that
> > >> >> I
> > >> >> >>> think this would help a lot.
> > >> >> >>>
> > >> >> >>> Andrew
> > >> >> >>>
> > >> >> >>>
> > >> >> >>> On Fri, Jun 28, 2024 at 3:08 PM Raphael Taylor-Davies
> > >> >> >>> <r.taylordav...@googlemail.com.invalid> wrote:
> > >> >> >>>
> > >> >> >>>> I wonder if the DataFusion project might be a more natural
> home
> > for
> > >> >> this
> > >> >> >>>> functionality? UDFs are more of a query engine concept,
> whereas
> > >> >> arrow-rs is
> > >> >> >>>> more focused on purely physical execution?
> > >> >> >>>>
> > >> >> >>>> On 28 June 2024 19:41:39 BST, Runji Wang <
> wangrunji0...@163.com
> > >
> > >> >> wrote:
> > >> >> >>>>> Hi Felipe,
> > >> >> >>>>>
> > >> >> >>>>> Vectorization will be applied whenever possible. When all
> input
> > >> and
> > >> >> >>>> output types of a function are primitive (int16, int32, int64,
> > >> >> float32,
> > >> >> >>>> float64) and do not involve any Option or Result, the macro
> will
> > >> >> >>>> automatically generate code based on unary <
> > >> >> >>>> https://docs.rs/arrow/latest/arrow/compute/fn.unary.html> or
> > >> binary <
> > >> >> >>>> https://docs.rs/arrow/latest/arrow/compute/fn.binary.html>
> > >> kernels,
> > >> >> >>>> which potentially allows for vectorization.
> > >> >> >>>>>
> > >> >> >>>>> Both examples you showed are not vectorized. The `div`
> > function is
> > >> >> due
> > >> >> >>>> to the Result output, while `gcd` is due to the loop in its
> > >> >> implementation.
> > >> >> >>>> However, if the function is simple enough, like an `add`
> > function:
> > >> >> >>>>>
> > >> >> >>>>> #[function("add(int, int) -> int")]
> > >> >> >>>>> fn add(a: i32, b: i32) -> i32 {
> > >> >> >>>>>     a + b
> > >> >> >>>>> }
> > >> >> >>>>>
> > >> >> >>>>> It can be auto-vectorized by llvm.
> > >> >> >>>>>
> > >> >> >>>>> Runji
> > >> >> >>>>>
> > >> >> >>>>>
> > >> >> >>>>> On 2024/06/28 17:13:16 Felipe Oliveira Carvalho wrote:
> > >> >> >>>>>> On Fri, Jun 28, 2024 at 11:07 AM Andrew Lamb <
> > >> al...@influxdata.com>
> > >> >> >>>> wrote:
> > >> >> >>>>>>>
> > >> >> >>>>>>> Hi Xuanwo,
> > >> >> >>>>>>>
> > >> >> >>>>>>> Sorry for the delay in responding. I think  the ability to
> > >> easily
> > >> >> >>>> write
> > >> >> >>>>>>> functions that "feel" like native functions in whatever
> > language
> > >> >> and
> > >> >> >>>> be
> > >> >> >>>>>>> able to generate arrow / vectorized versions of them is
> quite
> > >> >> >>>> valuable.
> > >> >> >>>>>>> This is my understanding of what this proposal is about.
> > >> >> >>>>>>
> > >> >> >>>>>> My understanding is that it's not vectorized. From the
> > examples
> > >> in
> > >> >> >>>>>> risingwavelabs/arrow-udf, <
> > >> >> https://github.com/risingwavelabs/arrow-udf>
> > >> >> >>>> it
> > >> >> >>>>>> looks like the macros generate code that gathers values from
> > >> columns
> > >> >> >>>> into
> > >> >> >>>>>> local scalars that are passed as scalar parameters to user
> > >> >> functions.
> > >> >> >>>> Is
> > >> >> >>>>>> the hope here that rustc/llvm will auto-vectorize the code?
> > >> >> >>>>>>
> > >> >> >>>>>> #[function("gcd(int, int) -> int")]
> > >> >> >>>>>> fn gcd(mut a: i32, mut b: i32) -> i32 {
> > >> >> >>>>>>      while b != 0 {
> > >> >> >>>>>>          (a, b) = (b, a % b);
> > >> >> >>>>>>      }
> > >> >> >>>>>>      a
> > >> >> >>>>>> }
> > >> >> >>>>>>
> > >> >> >>>>>> #[function("div(int, int) -> int")]
> > >> >> >>>>>> fn div(x: i32, y: i32) -> Result<i32, &'static str> {
> > >> >> >>>>>>      if y == 0 {
> > >> >> >>>>>>          return Err("division by zero");
> > >> >> >>>>>>      }
> > >> >> >>>>>>      Ok(x / y)
> > >> >> >>>>>> }
> > >> >> >>>>>>
> > >> >> >>>>>>> I left some additional comments on the markdown.
> > >> >> >>>>>>>
> > >> >> >>>>>>> One thing that might be worth doing is articulate some
> other
> > >> >> >>>> potential
> > >> >> >>>>>>> locations for where the code might go. One option, as I
> think
> > >> you
> > >> >> >>>> propose,
> > >> >> >>>>>>> is to make its own repository.  Another option could be to
> > >> donate
> > >> >> >>>> the code
> > >> >> >>>>>>> and put the various language bindings in the same repo as
> the
> > >> arrow
> > >> >> >>>>>>> language implementations (e.g arrow-rs, arrow for python,
> > etc)
> > >> >> which
> > >> >> >>>> would
> > >> >> >>>>>>> likely make it easier to maintain and discover.
> > >> >> >>>>>>>
> > >> >> >>>>>>> I am curious about what other devs / users feel about this?
> > >> >> >>>>>>>
> > >> >> >>>>>>> Andrew
> > >> >> >>>>>>>
> > >> >> >>>>>>>
> > >> >> >>>>>>>
> > >> >> >>>>>>> On Thu, Jun 20, 2024 at 3:04 AM Xuanwo <xu...@apache.org>
> > >> wrote:
> > >> >> >>>>>>>
> > >> >> >>>>>>>> Hello, everyone.
> > >> >> >>>>>>>>
> > >> >> >>>>>>>> I start this thread to disscuss the donation of a
> > User-Defined
> > >> >> >>>> Function
> > >> >> >>>>>>>> Framework for Apache Arrow.
> > >> >> >>>>>>>>
> > >> >> >>>>>>>> Feel free to review and leave your comments here. For live
> > >> review,
> > >> >> >>>>>> please
> > >> >> >>>>>>>> visit:
> > >> >> >>>>>>>>
> > >> >> >>>>>>>> https://hackmd.io/@xuanwo/apache-arrow-udf
> > >> >> >>>>>>>>
> > >> >> >>>>>>>> The original content also pasted here for a quick reading:
> > >> >> >>>>>>>>
> > >> >> >>>>>>>> ------
> > >> >> >>>>>>>>
> > >> >> >>>>>>>> ## Abstract
> > >> >> >>>>>>>>
> > >> >> >>>>>>>> Arrow UDF is a User-Defined Function Framework for Apache
> > >> Arrow.
> > >> >> >>>>>>>>
> > >> >> >>>>>>>> ## Proposal
> > >> >> >>>>>>>>
> > >> >> >>>>>>>> Arrow UDF allows user to easily create and run
> user-defined
> > >> >> >>>> functions
> > >> >> >>>>>>>> (UDF) in Rust, Python, Java or JavaScript based on Apache
> > >> Arrow.
> > >> >> >>>> The
> > >> >> >>>>>>>> functions can be executed natively, or in WebAssembly, or
> > in a
> > >> >> >>>> remote
> > >> >> >>>>>>>> server via Arrow Flight.
> > >> >> >>>>>>>>
> > >> >> >>>>>>>> Arrow UDF was originally designed to be used by the
> > RisingWave
> > >> >> >>>> project
> > >> >> >>>>>> but
> > >> >> >>>>>>>> is now being used by Databend and several database
> startups.
> > >> >> >>>>>>>>
> > >> >> >>>>>>>> We believe that the Arrow UDF project will provide
> diversity
> > >> value
> > >> >> >>>> to
> > >> >> >>>>>> the
> > >> >> >>>>>>>> entire Arrow community.
> > >> >> >>>>>>>>
> > >> >> >>>>>>>> ## Background
> > >> >> >>>>>>>>
> > >> >> >>>>>>>> Arrow UDF is being developed by an open-source community
> > from
> > >> day
> > >> >> >>>> one
> > >> >> >>>>>> and
> > >> >> >>>>>>>> is owned by RisingWaveLabs. The project has been launched
> in
> > >> >> >>>> December
> > >> >> >>>>>> 2023.
> > >> >> >>>>>>>>
> > >> >> >>>>>>>> ## Initial Goals
> > >> >> >>>>>>>>
> > >> >> >>>>>>>> By transferring ownership of the project to the Apache
> > Arrow,
> > >> >> >>>> Arrow UDF
> > >> >> >>>>>>>> expects to ensure its neutrality and further encourage and
> > >> >> >>>> facilitate
> > >> >> >>>>>> the
> > >> >> >>>>>>>> adoption of Arrow UDF by the community.
> > >> >> >>>>>>>>
> > >> >> >>>>>>>> ## Current Status
> > >> >> >>>>>>>>
> > >> >> >>>>>>>> Contributors: 5
> > >> >> >>>>>>>>
> > >> >> >>>>>>>> Users:
> > >> >> >>>>>>>>
> > >> >> >>>>>>>> -   [RisingWave]: A Distributed SQL Database for Stream
> > >> >> Processing.
> > >> >> >>>>>>>> -   [Databend]: An open-source cloud data warehouse that
> > >> serves as
> > >> >> >>>> a
> > >> >> >>>>>>>> cost-effective alternative to Snowflake.
> > >> >> >>>>>>>>
> > >> >> >>>>>>>> ## Documentation
> > >> >> >>>>>>>>
> > >> >> >>>>>>>> The document of Arrow UDF is hosted at
> > >> >> >>>>>>>> https://docs.rs/arrow-udf/latest/arrow_udf/.
> > >> >> >>>>>>>>
> > >> >> >>>>>>>> ## Initial Source
> > >> >> >>>>>>>>
> > >> >> >>>>>>>> The project currently holds a GitHub repository and
> multiple
> > >> >> >>>> packages:
> > >> >> >>>>>>>>
> > >> >> >>>>>>>> - https://github.com/risingwavelabs/arrow-udf
> > >> >> >>>>>>>>
> > >> >> >>>>>>>> Rust:
> > >> >> >>>>>>>>
> > >> >> >>>>>>>> - https://crates.io/arrow-udf/
> > >> >> >>>>>>>> - https://crates.io/arrow-udf-python/
> > >> >> >>>>>>>> - https://crates.io/arrow-udf-js/
> > >> >> >>>>>>>> - https://crates.io/arrow-udf-js-deno/
> > >> >> >>>>>>>> - https://crates.io/arrow-udf-wasm/
> > >> >> >>>>>>>>
> > >> >> >>>>>>>> Python:
> > >> >> >>>>>>>>
> > >> >> >>>>>>>> - https://pypi.org/project/arrow-udf/
> > >> >> >>>>>>>>
> > >> >> >>>>>>>> Those packge will retain its name, while the repository
> > will be
> > >> >> >>>> moved to
> > >> >> >>>>>>>> apache org.
> > >> >> >>>>>>>>
> > >> >> >>>>>>>> ## Required Resources
> > >> >> >>>>>>>>
> > >> >> >>>>>>>> ### Mailing Lists
> > >> >> >>>>>>>>
> > >> >> >>>>>>>> We can reuse the existing mailing lists that arrow have.
> > >> >> >>>>>>>>
> > >> >> >>>>>>>> ### Git Repositories
> > >> >> >>>>>>>>
> > >> >> >>>>>>>> From
> > >> >> >>>>>>>>
> > >> >> >>>>>>>> - https://github.com/risingwavelabs/arrow-udf
> > >> >> >>>>>>>>
> > >> >> >>>>>>>> To
> > >> >> >>>>>>>>
> > >> >> >>>>>>>> - https://gitbox.apache.org/asf/repos/arrow-udf
> > >> >> >>>>>>>> - https://github.com/apache/arrow-udf
> > >> >> >>>>>>>>
> > >> >> >>>>>>>> ### Issue Tracking
> > >> >> >>>>>>>>
> > >> >> >>>>>>>> The project would like to continue using GitHub Issues.
> > >> >> >>>>>>>>
> > >> >> >>>>>>>> ### Other Resources
> > >> >> >>>>>>>>
> > >> >> >>>>>>>> The project has already chosen GitHub actions as
> continuous
> > >> >> >>>> integration
> > >> >> >>>>>>>> tools.
> > >> >> >>>>>>>>
> > >> >> >>>>>>>> ## Initial Committers
> > >> >> >>>>>>>>
> > >> >> >>>>>>>> - Runji Wang wangrunji0...@163.com
> > >> >> >>>>>>>> - Giovanny Gutiérrez
> > >> >> >>>>>>>> - sundy-li sund...@apache.org
> > >> >> >>>>>>>> - Xuanwo xua...@apache.org
> > >> >> >>>>>>>> - Max Justus Spransy maxjus...@gmail.com
> > >> >> >>>>>>>>
> > >> >> >>>>>>>> [RisingWave]:
> https://github.com/risingwavelabs/risingwave
> > >> >> >>>>>>>> [Databend]: https://github.com/datafuselabs/databend
> > >> >> >>>>>>>>
> > >> >> >>>>>>>> Xuanwo
> > >> >> >>>>>>>>
> > >> >> >>>>>>
> > >> >> >>>
> > >> >> >>>
> > >> >> >>
> > >> >>
> > >> >> --
> > >> >> Xuanwo
> > >> >>
> > >>
> > >> --
> > >> Xuanwo
> > >>
> >
> > --
> > Xuanwo
> >
> > https://xuanwo.io/
> >
>

Re: [DISCUSS] Donation of a User-Defined Function Framework for Apache Arrow

Reply via email to