[GitHub] [arrow] andygrove commented on pull request #7971: ARROW-9752: [Rust] [DataFusion] Add support for aggregate UDFs

GitBox Mon, 17 Aug 2020 22:15:22 -0700


andygrove commented on pull request #7971:
URL: https://github.com/apache/arrow/pull/7971#issuecomment-675254823



   I took a very quick look at Spark just now and here are some observations:
   
   - math expressions such as sqrt always return double and don't try to
   optimize to smaller types
   - aggregate expressions min/max return the same type as their input
   - sum returns long, double, or decimal, depending on the input type
   
   
   
   
   On Mon, Aug 17, 2020 at 10:42 PM Andy Grove <andygrov...@gmail.com> wrote:
   
   > When faced with choices like this, it is often helpful to look at how
   > other projects implement this. Perhaps we could look at calcite or spark to
   > see what choices they made? I am more familiar with spark at this point so
   > could research the approach used there.
   >
   > On Mon, Aug 17, 2020, 9:59 PM Jorge Leitao <notificati...@github.com>
   > wrote:
   >
   >> *@jorgecarleitao* commented on this pull request.
   >> ------------------------------
   >>
   >> In rust/datafusion/src/execution/physical_plan/udf.rs
   >> <https://github.com/apache/arrow/pull/7971#discussion_r471899766>:
   >>
   >> > +
   >> +It is the developer of the function's responsibility to ensure that the 
aggregator correctly handles the different
   >> +types that are presented to them, and that the return type correctly 
matches the type returned by the
   >> +aggregator.
   >> +
   >> +It is the user of the function's responsibility to pass arguments to the 
function that have valid types.
   >> +*/
   >> +#[derive(Clone)]
   >> +pub struct AggregateFunction {
   >> +    /// Function name
   >> +    pub name: String,
   >> +    /// A list of arguments and their respective types. A function can 
accept more than one type as argument
   >> +    /// (e.g. sum(i8), sum(u8)).
   >> +    pub arg_types: Vec<Vec<DataType>>,
   >> +    /// Return type. This function takes
   >> +    pub return_type: ReturnType,
   >>
   >> This change and is under discussion in the mailing list.
   >>
   >> Essentially, the question is whether we should accept UDFs to have an
   >> input-dependent type or not (should this be a function or a DataType).
   >>
   >> If we decide to not accept input-dependent types, then UDFs are simpler
   >> (multiple input types, single output type), but we can't re-write our
   >> aggregates as UDFs
   >>
   >> If we decide to accept input-dependent types, then UDFs are more complex
   >> (multiple input types, multiple output type), and we can uniformize them
   >> all in a single interface.
   >>
   >> We can also do something in the middle, on which we declare an interface
   >> for functions in our end that support (multiple input types, multiple
   >> output type), but only expose public interfaces to register (multiple 
input
   >> types, single output type) UDFs.
   >>
   >> —
   >> You are receiving this because you were mentioned.
   >> Reply to this email directly, view it on GitHub
   >> <https://github.com/apache/arrow/pull/7971#pullrequestreview-468974772>,
   >> or unsubscribe
   >> 
<https://github.com/notifications/unsubscribe-auth/AAHEBRBWO7BL54QSCQ7DPWDSBH4DZANCNFSM4QAJVXOA>
   >> .
   >>
   >
   


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[GitHub] [arrow] andygrove commented on pull request #7971: ARROW-9752: [Rust] [DataFusion] Add support for aggregate UDFs

Reply via email to