westonpace commented on code in PR #36673: URL: https://github.com/apache/arrow/pull/36673#discussion_r1270883486
########## python/pyarrow/_compute.pyx: ########## @@ -2789,6 +2794,83 @@ def register_scalar_function(func, function_name, function_doc, in_types, out_ty out_type, func_registry) +def register_vector_function(func, function_name, function_doc, in_types, out_type, + func_registry=None): + """ + Register a user-defined vector function. + + This API is EXPERIMENTAL. + + A vector function is a function that executes vector + operations on arrays. Unlike scalar function, vector Review Comment: ```suggestion operations on arrays. Unlike scalar functions, a vector ``` ########## python/pyarrow/_compute.pyx: ########## @@ -2789,6 +2794,83 @@ def register_scalar_function(func, function_name, function_doc, in_types, out_ty out_type, func_registry) +def register_vector_function(func, function_name, function_doc, in_types, out_type, + func_registry=None): + """ + Register a user-defined vector function. + + This API is EXPERIMENTAL. + + A vector function is a function that executes vector + operations on arrays. Unlike scalar function, vector + function often has a grouping semantics and the output Review Comment: What is "grouping semantics"? ########## python/pyarrow/_compute.pyx: ########## @@ -2789,6 +2794,83 @@ def register_scalar_function(func, function_name, function_doc, in_types, out_ty out_type, func_registry) +def register_vector_function(func, function_name, function_doc, in_types, out_type, + func_registry=None): + """ + Register a user-defined vector function. + + This API is EXPERIMENTAL. + + A vector function is a function that executes vector + operations on arrays. Unlike scalar function, vector + function often has a grouping semantics and the output + for a row depends on other rows. A typical example + of vector function is "rank". Review Comment: I'm not sure if we want rank to be listed as the prototypical "vector function" since it is often thought of as the prototypical "window function". `list_flatten` is maybe a better example of a function that is neither a scalar function or a window function. I think I understand where this is coming from because I believe you are using the vector function type to do window function operations. However, I think we want to leave the door open for a future where all three exist. In other words: ProjectNode - Can only run "scalar" functions AggregateNode - Can only run "scalar aggregate" functions GroupByNode - Can only run "hash aggregate" functions WindowNode (does not exist) - Can only run "window" functions NonDecomposableAggregateNode (e.g. batch in / batch out) - Can run any kind of function So, today, I think you are working with some kind of batch-in/batch-out node that can run any function. As a result, you can use this to run window functions. So you need to add support for registering vector functions because you want to make sure these functions don't run in project/aggregate/groupby nodes. This is all fine. I'm just wondering if we can describe vector functions not as "a function whose output depends on other rows" (since this is the definition of window functions) but as "a function whose output may depend on other rows and whose output length does not need to be the same as the input length". I don't know if I'm explaining myself very well though. Another option might be to call these "window functions" with the caveat that we don't actually have a window node yet. Then, if you have some kind of batch-in/batch-out node you can still run window functions with it (since a batch-in/batch-out node can run pretty much anything). -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: github-unsubscr...@arrow.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org