icexelloss commented on PR #36673: URL: https://github.com/apache/arrow/pull/36673#issuecomment-1647943407
> From a code perspective I don't see much to worry about here. Conceptually though I think we should think carefully about how we explain these concepts to users. I left a few comments about the wording. I wonder if we might also want something in the docs (can be a future PR). > > Basically, as a user, I think I will encounter this code and my first question will be "why are there two different kinds of UDFs?" and then "which one should I use?" > > Also, eventually I think there will be even more than two kinds of UDFs. So it would be nice, in the docs somewhere, to be able to say something like... > > ``` > We have several different categories of functions. For example, scalar functions, > aggregate functions, window functions, and vector functions. These categories > describe how a function should behave and control where a function can be used. > The actual mechanics of a UDF are very similar regardless of the category of the > function. > > Scalar Functions - These functions return 1 row for each input row. In addition, > the calculation for each row should not depend on any other rows (e.g. function > is stateless). As a result, the output for a row should be the same no matter > what order the rows arrive. Most arithmetic functions (add, power, log) can be > implemented as scalar functions. These functions can be used in project, filter, > and batch-in/batch-out relations. > > Window Functions - These functions return 1 row for each input row. However, > the output for a row is allowed to depend on other rows. These functions rely > on a specific row order and are often executed in groups. For example, rank, > first, and cumulative_sum are all window functions. These functions can be > used in window and batch-in/batch-out relations. > > Vector Functions - These functions are not required to return 1 output row for > each input row. For example, a "drop_nulls" function might return fewer rows > than the input. A "list_flatten" function might return more rows than the input. > These functions can be used in batch-in/batch-out relations. > > Scalar Aggregate Functions ... > Hash Aggregate Functions ... > ``` I think this is useful - where do you suggest I put this? Also do you prefer to combine the doc for "different function categories" in this change or separately? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: github-unsubscr...@arrow.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org