westonpace commented on code in PR #36673:
URL: https://github.com/apache/arrow/pull/36673#discussion_r1270883486


##########
python/pyarrow/_compute.pyx:
##########
@@ -2789,6 +2794,83 @@ def register_scalar_function(func, function_name, 
function_doc, in_types, out_ty
                                            out_type, func_registry)
 
 
+def register_vector_function(func, function_name, function_doc, in_types, 
out_type,
+                             func_registry=None):
+    """
+    Register a user-defined vector function.
+
+    This API is EXPERIMENTAL.
+
+    A vector function is a function that executes vector
+    operations on arrays. Unlike scalar function, vector

Review Comment:
   ```suggestion
       operations on arrays. Unlike scalar functions, a vector
   ```



##########
python/pyarrow/_compute.pyx:
##########
@@ -2789,6 +2794,83 @@ def register_scalar_function(func, function_name, 
function_doc, in_types, out_ty
                                            out_type, func_registry)
 
 
+def register_vector_function(func, function_name, function_doc, in_types, 
out_type,
+                             func_registry=None):
+    """
+    Register a user-defined vector function.
+
+    This API is EXPERIMENTAL.
+
+    A vector function is a function that executes vector
+    operations on arrays. Unlike scalar function, vector
+    function often has a grouping semantics and the output

Review Comment:
   What is "grouping semantics"?



##########
python/pyarrow/_compute.pyx:
##########
@@ -2789,6 +2794,83 @@ def register_scalar_function(func, function_name, 
function_doc, in_types, out_ty
                                            out_type, func_registry)
 
 
+def register_vector_function(func, function_name, function_doc, in_types, 
out_type,
+                             func_registry=None):
+    """
+    Register a user-defined vector function.
+
+    This API is EXPERIMENTAL.
+
+    A vector function is a function that executes vector
+    operations on arrays. Unlike scalar function, vector
+    function often has a grouping semantics and the output
+    for a row depends on other rows. A typical example
+    of vector function is "rank".

Review Comment:
   I'm not sure if we want rank to be listed as the prototypical "vector 
function" since it is often thought of as the prototypical "window function".
   
   `list_flatten` is maybe a better example of a function that is neither a 
scalar function or a window function.
   
   I think I understand where this is coming from because I believe you are 
using the vector function type to do window function operations.  However, I 
think we want to leave the door open for a future where all three exist.  In 
other words:
   
   ProjectNode - Can only run "scalar" functions
   AggregateNode - Can only run "scalar aggregate" functions
   GroupByNode - Can only run "hash aggregate" functions
   WindowNode (does not exist) - Can only run "window" functions
   NonDecomposableAggregateNode (e.g. batch in / batch out) - Can run any kind 
of function
   
   So, today, I think you are working with some kind of batch-in/batch-out node 
that can run any function.  As a result, you can use this to run window 
functions.  So you need to add support for registering vector functions because 
you want to make sure these functions don't run in project/aggregate/groupby 
nodes.
   
   This is all fine.  I'm just wondering if we can describe vector functions 
not as "a function whose output depends on other rows" (since this is the 
definition of window functions) but as "a function whose output may depend on 
other rows and whose output length does not need to be the same as the input 
length".
   
   I don't know if I'm explaining myself very well though.
   
   Another option might be to call these "window functions" with the caveat 
that we don't actually have a window node yet.  Then, if you have some kind of 
batch-in/batch-out node you can still run window functions with it (since a 
batch-in/batch-out node can run pretty much anything).
   



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscr...@arrow.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

Reply via email to