Re: [DISCUSS] Donation of a User-Defined Function Framework for Apache Arrow

Xuanwo Fri, 28 Jun 2024 22:47:46 -0700

> That said, wherever it ends up, there should be the agreement of
> individuals to accept maintenance of it. Since it's in rust, that would
> generally fall to the arrow-rs contributors and/or the DataFusion
> contributors IMO.
>
> It would be good for it to be part of the community, but only if it's not
> going to end up just bitrotting somewhere.


Thanks Matt. This concern does make sense. 

Arrow UDF is extensively used within RisingWave and Databend. We, the initial 
committers from both RisingWave and Databend, are eager to take responsibility 
for maintaining these crates.

Additionally, some of us are involved in other Apache Projects, so we 
understand 
how the Apache Way functions. We will focus on community growth to ensure this 
project remains active.

On Sat, Jun 29, 2024, at 13:29, Matt Topol wrote:
>> This UDF implementation doesn’t depend on DataFusion. It can work with
> any data in the arrow format.
>
> Given this I'm in agreement with Antoine that it would be weird for it to
> be maintained within the DataFusion repo as opposed to it's own repo (as
> we've done in the past for things like nanoarrow and arrow-experiments).
>
> That said, wherever it ends up, there should be the agreement of
> individuals to accept maintenance of it. Since it's in rust, that would
> generally fall to the arrow-rs contributors and/or the DataFusion
> contributors IMO.
>
> It would be good for it to be part of the community, but only if it's not
> going to end up just bitrotting somewhere.
>
> --Matt
>
> On Fri, Jun 28, 2024, 8:49 PM Xuanwo <[email protected]> wrote:
>
>> Hi,
>>
>> This UDF implementation doesn’t depend on DataFusion. It can work with any
>> data in the arrow format.
>>
>> It has the potential power to make users write ONE UDF function that works
>> for different query engines as we showed up in databend and risingwave.
>>
>> So I personally think it should be part of arrow community.
>>
>> On Sat, Jun 29, 2024, at 05:06, Antoine Pitrou wrote:
>> > Is this UDF implementation based on DataFusion? If so, it makes sense
>> > for it to be part of the DataFusion project.
>> >
>> > OTOH, if it can work with any data in the Arrow format, then it would
>> > sound weird to maintain it in the DataFusion repo IMHO.
>> >
>> > Regards
>> >
>> > Antoine.
>> >
>> >
>> > Le 28/06/2024 à 21:52, Andrew Lamb a écrit :
>> >> To be clear, if the arrow community thinks this would be better
>> organized /
>> >> administered in the Apache DataFusion project (especially if it is
>> aligned
>> >> with Rust) I think it would be good to discuss donating there
>> >>
>> >> On Fri, Jun 28, 2024 at 3:17 PM Andrew Lamb <[email protected]>
>> wrote:
>> >>
>> >>> I think there are two aspects:
>> >>> 1. The actual mechanics of implementing functions
>> >>> 2. The actual library of udf functions (e.g. sin, cos, nullif, etc)
>> >>>
>> >>> I agree 2 is not something that belongs naturally in the arrow project
>> and
>> >>> is better aligned with query engines
>> >>>
>> >>> However I think 1 is worth considering.
>> >>>
>> >>> As I understand it, the problem arrow_udf solves is avoiding some of
>> the
>> >>> boilerplate  required to make vectorized udfs. So instead of writing a
>> >>> special eval_gcd function like this
>> >>>
>> >>> ```
>> >>> fn gcd(l: i64, r: i64) -> i64 {
>> >>>   // do gcd calculation
>> >>> }
>> >>>
>> >>> // implement vectorized version
>> >>> fn eval_gcd(left: &ArrayRef, right: &ArrayRef) -> ArrayRef {
>> >>>    let left = left.as_primitive<Int64Type>();
>> >>>    let right = right.as_primitive<Int64Type>();
>> >>>    res = binary(left, right, |l, r| gcd(l, r));
>> >>>    Arc::new(res)
>> >>> }
>> >>> ```
>> >>>
>> >>> The user simply annotates the scalar function and have the library code
>> >>> gen the array version
>> >>> ```
>> >>> #[function("gcd(int64, int64) -> int64", output = "eval_gcd")]
>> >>> fn gcd(l: i64, r: i64) -> i64 {
>> >>>   // do gcd calculation
>> >>> }
>> >>> ```
>> >>>
>> >>> We have a lot of boilerplate / non idea macro stuff in DataFusion that
>> I
>> >>> think this would help a lot.
>> >>>
>> >>> Andrew
>> >>>
>> >>>
>> >>> On Fri, Jun 28, 2024 at 3:08 PM Raphael Taylor-Davies
>> >>> <[email protected]> wrote:
>> >>>
>> >>>> I wonder if the DataFusion project might be a more natural home for
>> this
>> >>>> functionality? UDFs are more of a query engine concept, whereas
>> arrow-rs is
>> >>>> more focused on purely physical execution?
>> >>>>
>> >>>> On 28 June 2024 19:41:39 BST, Runji Wang <[email protected]>
>> wrote:
>> >>>>> Hi Felipe,
>> >>>>>
>> >>>>> Vectorization will be applied whenever possible. When all input and
>> >>>> output types of a function are primitive (int16, int32, int64,
>> float32,
>> >>>> float64) and do not involve any Option or Result, the macro will
>> >>>> automatically generate code based on unary <
>> >>>> https://docs.rs/arrow/latest/arrow/compute/fn.unary.html> or binary <
>> >>>> https://docs.rs/arrow/latest/arrow/compute/fn.binary.html> kernels,
>> >>>> which potentially allows for vectorization.
>> >>>>>
>> >>>>> Both examples you showed are not vectorized. The `div` function is
>> due
>> >>>> to the Result output, while `gcd` is due to the loop in its
>> implementation.
>> >>>> However, if the function is simple enough, like an `add` function:
>> >>>>>
>> >>>>> #[function("add(int, int) -> int")]
>> >>>>> fn add(a: i32, b: i32) -> i32 {
>> >>>>>     a + b
>> >>>>> }
>> >>>>>
>> >>>>> It can be auto-vectorized by llvm.
>> >>>>>
>> >>>>> Runji
>> >>>>>
>> >>>>>
>> >>>>> On 2024/06/28 17:13:16 Felipe Oliveira Carvalho wrote:
>> >>>>>> On Fri, Jun 28, 2024 at 11:07 AM Andrew Lamb <[email protected]>
>> >>>> wrote:
>> >>>>>>>
>> >>>>>>> Hi Xuanwo,
>> >>>>>>>
>> >>>>>>> Sorry for the delay in responding. I think  the ability to easily
>> >>>> write
>> >>>>>>> functions that "feel" like native functions in whatever language
>> and
>> >>>> be
>> >>>>>>> able to generate arrow / vectorized versions of them is quite
>> >>>> valuable.
>> >>>>>>> This is my understanding of what this proposal is about.
>> >>>>>>
>> >>>>>> My understanding is that it's not vectorized. From the examples in
>> >>>>>> risingwavelabs/arrow-udf, <
>> https://github.com/risingwavelabs/arrow-udf>
>> >>>> it
>> >>>>>> looks like the macros generate code that gathers values from columns
>> >>>> into
>> >>>>>> local scalars that are passed as scalar parameters to user
>> functions.
>> >>>> Is
>> >>>>>> the hope here that rustc/llvm will auto-vectorize the code?
>> >>>>>>
>> >>>>>> #[function("gcd(int, int) -> int")]
>> >>>>>> fn gcd(mut a: i32, mut b: i32) -> i32 {
>> >>>>>>      while b != 0 {
>> >>>>>>          (a, b) = (b, a % b);
>> >>>>>>      }
>> >>>>>>      a
>> >>>>>> }
>> >>>>>>
>> >>>>>> #[function("div(int, int) -> int")]
>> >>>>>> fn div(x: i32, y: i32) -> Result<i32, &'static str> {
>> >>>>>>      if y == 0 {
>> >>>>>>          return Err("division by zero");
>> >>>>>>      }
>> >>>>>>      Ok(x / y)
>> >>>>>> }
>> >>>>>>
>> >>>>>>> I left some additional comments on the markdown.
>> >>>>>>>
>> >>>>>>> One thing that might be worth doing is articulate some other
>> >>>> potential
>> >>>>>>> locations for where the code might go. One option, as I think you
>> >>>> propose,
>> >>>>>>> is to make its own repository.  Another option could be to donate
>> >>>> the code
>> >>>>>>> and put the various language bindings in the same repo as the arrow
>> >>>>>>> language implementations (e.g arrow-rs, arrow for python, etc)
>> which
>> >>>> would
>> >>>>>>> likely make it easier to maintain and discover.
>> >>>>>>>
>> >>>>>>> I am curious about what other devs / users feel about this?
>> >>>>>>>
>> >>>>>>> Andrew
>> >>>>>>>
>> >>>>>>>
>> >>>>>>>
>> >>>>>>> On Thu, Jun 20, 2024 at 3:04 AM Xuanwo <[email protected]> wrote:
>> >>>>>>>
>> >>>>>>>> Hello, everyone.
>> >>>>>>>>
>> >>>>>>>> I start this thread to disscuss the donation of a User-Defined
>> >>>> Function
>> >>>>>>>> Framework for Apache Arrow.
>> >>>>>>>>
>> >>>>>>>> Feel free to review and leave your comments here. For live review,
>> >>>>>> please
>> >>>>>>>> visit:
>> >>>>>>>>
>> >>>>>>>> https://hackmd.io/@xuanwo/apache-arrow-udf
>> >>>>>>>>
>> >>>>>>>> The original content also pasted here for a quick reading:
>> >>>>>>>>
>> >>>>>>>> ------
>> >>>>>>>>
>> >>>>>>>> ## Abstract
>> >>>>>>>>
>> >>>>>>>> Arrow UDF is a User-Defined Function Framework for Apache Arrow.
>> >>>>>>>>
>> >>>>>>>> ## Proposal
>> >>>>>>>>
>> >>>>>>>> Arrow UDF allows user to easily create and run user-defined
>> >>>> functions
>> >>>>>>>> (UDF) in Rust, Python, Java or JavaScript based on Apache Arrow.
>> >>>> The
>> >>>>>>>> functions can be executed natively, or in WebAssembly, or in a
>> >>>> remote
>> >>>>>>>> server via Arrow Flight.
>> >>>>>>>>
>> >>>>>>>> Arrow UDF was originally designed to be used by the RisingWave
>> >>>> project
>> >>>>>> but
>> >>>>>>>> is now being used by Databend and several database startups.
>> >>>>>>>>
>> >>>>>>>> We believe that the Arrow UDF project will provide diversity value
>> >>>> to
>> >>>>>> the
>> >>>>>>>> entire Arrow community.
>> >>>>>>>>
>> >>>>>>>> ## Background
>> >>>>>>>>
>> >>>>>>>> Arrow UDF is being developed by an open-source community from day
>> >>>> one
>> >>>>>> and
>> >>>>>>>> is owned by RisingWaveLabs. The project has been launched in
>> >>>> December
>> >>>>>> 2023.
>> >>>>>>>>
>> >>>>>>>> ## Initial Goals
>> >>>>>>>>
>> >>>>>>>> By transferring ownership of the project to the Apache Arrow,
>> >>>> Arrow UDF
>> >>>>>>>> expects to ensure its neutrality and further encourage and
>> >>>> facilitate
>> >>>>>> the
>> >>>>>>>> adoption of Arrow UDF by the community.
>> >>>>>>>>
>> >>>>>>>> ## Current Status
>> >>>>>>>>
>> >>>>>>>> Contributors: 5
>> >>>>>>>>
>> >>>>>>>> Users:
>> >>>>>>>>
>> >>>>>>>> -   [RisingWave]: A Distributed SQL Database for Stream
>> Processing.
>> >>>>>>>> -   [Databend]: An open-source cloud data warehouse that serves as
>> >>>> a
>> >>>>>>>> cost-effective alternative to Snowflake.
>> >>>>>>>>
>> >>>>>>>> ## Documentation
>> >>>>>>>>
>> >>>>>>>> The document of Arrow UDF is hosted at
>> >>>>>>>> https://docs.rs/arrow-udf/latest/arrow_udf/.
>> >>>>>>>>
>> >>>>>>>> ## Initial Source
>> >>>>>>>>
>> >>>>>>>> The project currently holds a GitHub repository and multiple
>> >>>> packages:
>> >>>>>>>>
>> >>>>>>>> - https://github.com/risingwavelabs/arrow-udf
>> >>>>>>>>
>> >>>>>>>> Rust:
>> >>>>>>>>
>> >>>>>>>> - https://crates.io/arrow-udf/
>> >>>>>>>> - https://crates.io/arrow-udf-python/
>> >>>>>>>> - https://crates.io/arrow-udf-js/
>> >>>>>>>> - https://crates.io/arrow-udf-js-deno/
>> >>>>>>>> - https://crates.io/arrow-udf-wasm/
>> >>>>>>>>
>> >>>>>>>> Python:
>> >>>>>>>>
>> >>>>>>>> - https://pypi.org/project/arrow-udf/
>> >>>>>>>>
>> >>>>>>>> Those packge will retain its name, while the repository will be
>> >>>> moved to
>> >>>>>>>> apache org.
>> >>>>>>>>
>> >>>>>>>> ## Required Resources
>> >>>>>>>>
>> >>>>>>>> ### Mailing Lists
>> >>>>>>>>
>> >>>>>>>> We can reuse the existing mailing lists that arrow have.
>> >>>>>>>>
>> >>>>>>>> ### Git Repositories
>> >>>>>>>>
>> >>>>>>>> From
>> >>>>>>>>
>> >>>>>>>> - https://github.com/risingwavelabs/arrow-udf
>> >>>>>>>>
>> >>>>>>>> To
>> >>>>>>>>
>> >>>>>>>> - https://gitbox.apache.org/asf/repos/arrow-udf
>> >>>>>>>> - https://github.com/apache/arrow-udf
>> >>>>>>>>
>> >>>>>>>> ### Issue Tracking
>> >>>>>>>>
>> >>>>>>>> The project would like to continue using GitHub Issues.
>> >>>>>>>>
>> >>>>>>>> ### Other Resources
>> >>>>>>>>
>> >>>>>>>> The project has already chosen GitHub actions as continuous
>> >>>> integration
>> >>>>>>>> tools.
>> >>>>>>>>
>> >>>>>>>> ## Initial Committers
>> >>>>>>>>
>> >>>>>>>> - Runji Wang [email protected]
>> >>>>>>>> - Giovanny Gutiérrez
>> >>>>>>>> - sundy-li [email protected]
>> >>>>>>>> - Xuanwo [email protected]
>> >>>>>>>> - Max Justus Spransy [email protected]
>> >>>>>>>>
>> >>>>>>>> [RisingWave]: https://github.com/risingwavelabs/risingwave
>> >>>>>>>> [Databend]: https://github.com/datafuselabs/databend
>> >>>>>>>>
>> >>>>>>>> Xuanwo
>> >>>>>>>>
>> >>>>>>
>> >>>
>> >>>
>> >>
>>
>> --
>> Xuanwo
>>

-- 
Xuanwo

Re: [DISCUSS] Donation of a User-Defined Function Framework for Apache Arrow

Reply via email to