icexelloss commented on PR #36673:
URL: https://github.com/apache/arrow/pull/36673#issuecomment-1647943407

   > From a code perspective I don't see much to worry about here. Conceptually 
though I think we should think carefully about how we explain these concepts to 
users. I left a few comments about the wording. I wonder if we might also want 
something in the docs (can be a future PR).
   > 
   > Basically, as a user, I think I will encounter this code and my first 
question will be "why are there two different kinds of UDFs?" and then "which 
one should I use?"
   > 
   > Also, eventually I think there will be even more than two kinds of UDFs. 
So it would be nice, in the docs somewhere, to be able to say something like...
   > 
   > ```
   > We have several different categories of functions.  For example, scalar 
functions,
   > aggregate functions, window functions, and vector functions.  These 
categories
   > describe how a function should behave and control where a function can be 
used.
   > The actual mechanics of a UDF are very similar regardless of the category 
of the
   > function.
   > 
   > Scalar Functions - These functions return 1 row for each input row.  In 
addition,
   > the calculation for each row should not depend on any other rows (e.g. 
function
   > is stateless).  As a result, the output for a row should be the same no 
matter
   > what order the rows arrive.  Most arithmetic functions (add, power, log) 
can be
   > implemented as scalar functions.  These functions can be used in project, 
filter,
   > and batch-in/batch-out relations.
   > 
   > Window Functions - These functions return 1 row for each input row.  
However,
   > the output for a row is allowed to depend on other rows.  These functions 
rely
   > on a specific row order and are often executed in groups.  For example, 
rank,
   > first, and cumulative_sum are all window functions.  These functions can be
   > used in window and batch-in/batch-out relations.
   > 
   > Vector Functions - These functions are not required to return 1 output row 
for
   > each input row.  For example, a "drop_nulls" function might return fewer 
rows
   > than the input.  A "list_flatten" function might return more rows than the 
input.
   > These functions can be used in batch-in/batch-out relations.
   > 
   > Scalar Aggregate Functions ...
   > Hash Aggregate Functions ...
   > ```
   
   I think this is useful - where do you suggest I put this? Also do you prefer 
to combine the doc for "different function  categories" in this change or 
separately?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscr...@arrow.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

Reply via email to