westonpace commented on code in PR #36673:
URL: https://github.com/apache/arrow/pull/36673#discussion_r1270883486
##########
python/pyarrow/_compute.pyx:
##########
@@ -2789,6 +2794,83 @@ def register_scalar_function(func, function_name,
function_doc, in_types, out_ty
out_type, func_registry)
+def register_vector_function(func, function_name, function_doc, in_types,
out_type,
+ func_registry=None):
+ """
+ Register a user-defined vector function.
+
+ This API is EXPERIMENTAL.
+
+ A vector function is a function that executes vector
+ operations on arrays. Unlike scalar function, vector
Review Comment:
```suggestion
operations on arrays. Unlike scalar functions, a vector
```
##########
python/pyarrow/_compute.pyx:
##########
@@ -2789,6 +2794,83 @@ def register_scalar_function(func, function_name,
function_doc, in_types, out_ty
out_type, func_registry)
+def register_vector_function(func, function_name, function_doc, in_types,
out_type,
+ func_registry=None):
+ """
+ Register a user-defined vector function.
+
+ This API is EXPERIMENTAL.
+
+ A vector function is a function that executes vector
+ operations on arrays. Unlike scalar function, vector
+ function often has a grouping semantics and the output
Review Comment:
What is "grouping semantics"?
##########
python/pyarrow/_compute.pyx:
##########
@@ -2789,6 +2794,83 @@ def register_scalar_function(func, function_name,
function_doc, in_types, out_ty
out_type, func_registry)
+def register_vector_function(func, function_name, function_doc, in_types,
out_type,
+ func_registry=None):
+ """
+ Register a user-defined vector function.
+
+ This API is EXPERIMENTAL.
+
+ A vector function is a function that executes vector
+ operations on arrays. Unlike scalar function, vector
+ function often has a grouping semantics and the output
+ for a row depends on other rows. A typical example
+ of vector function is "rank".
Review Comment:
I'm not sure if we want rank to be listed as the prototypical "vector
function" since it is often thought of as the prototypical "window function".
`list_flatten` is maybe a better example of a function that is neither a
scalar function or a window function.
I think I understand where this is coming from because I believe you are
using the vector function type to do window function operations. However, I
think we want to leave the door open for a future where all three exist. In
other words:
ProjectNode - Can only run "scalar" functions
AggregateNode - Can only run "scalar aggregate" functions
GroupByNode - Can only run "hash aggregate" functions
WindowNode (does not exist) - Can only run "window" functions
NonDecomposableAggregateNode (e.g. batch in / batch out) - Can run any kind
of function
So, today, I think you are working with some kind of batch-in/batch-out node
that can run any function. As a result, you can use this to run window
functions. So you need to add support for registering vector functions because
you want to make sure these functions don't run in project/aggregate/groupby
nodes.
This is all fine. I'm just wondering if we can describe vector functions
not as "a function whose output depends on other rows" (since this is the
definition of window functions) but as "a function whose output may depend on
other rows and whose output length does not need to be the same as the input
length".
I don't know if I'm explaining myself very well though.
Another option might be to call these "window functions" with the caveat
that we don't actually have a window node yet. Then, if you have some kind of
batch-in/batch-out node you can still run window functions with it (since a
batch-in/batch-out node can run pretty much anything).
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]