icexelloss commented on code in PR #35514:
URL: https://github.com/apache/arrow/pull/35514#discussion_r1213365848


##########
python/pyarrow/conftest.py:
##########
@@ -278,3 +278,59 @@ def unary_function(ctx, x):
                                 {"array": pa.int64()},
                                 pa.int64())
     return unary_function, func_name
+
+
[email protected](scope="session")
+def unary_agg_func_fixture():
+    """
+    Register a unary aggregate function
+    """
+    from pyarrow import compute as pc
+    import numpy as np
+
+    def func(ctx, x):
+        return pa.scalar(np.nanmean(x))
+
+    func_name = "y=avg(x)"
+    func_doc = {"summary": "y=avg(x)",
+                "description": "find mean of x"}
+
+    pc.register_aggregate_function(func,
+                                   func_name,
+                                   func_doc,
+                                   {
+                                       "x": pa.float64(),
+                                   },
+                                   pa.float64()
+                                   )
+    return func, func_name
+
+
[email protected](scope="session")
+def varargs_agg_func_fixture():
+    """
+    Register a unary aggregate function
+    """
+    from pyarrow import compute as pc
+    import numpy as np
+
+    def func(ctx, *args):
+        sum = 0.0
+        for arg in args:
+            sum += np.nanmean(arg)
+        return pa.scalar(sum)
+
+    func_name = "y=sum_mean(x...)"
+    func_doc = {"summary": "Varargs aggregate",
+                "description": "Varargs aggregate"}
+
+    pc.register_aggregate_function(func,
+                                   func_name,
+                                   func_doc,
+                                   {
+                                       "x": pa.int64(),
+                                       "y": pa.float64()

Review Comment:
   Admittedly this is weird/confusing but here is why:
   
   This not a truely "varargs" function, as this function must take two 
arguments x and y with the specified type. The "varargs" here is only refer to 
the function signature of "y=sum_mean(x...)" which is defined as 
   
   ```
      def func(ctx, *args):
           sum = 0.0
           for arg in args:
               sum += np.nanmean(arg)
           return pa.scalar(sum)
   ```
   
   I added the test because this reflects how we use this internally:
   
   (1) A user would define a UDF and call it with some inputs, i.e
   
   ```
   def foo(x, y):
       return x.mean() + y.mean()
   
   result = summarize_table(foo, t['x_col'], t['y_col'], by='time') # This 
returns a lazy expression
   ```
   (2) When executing the result expression, since the signature of `foo` 
doesn't not match what Acero wants, we would wrap it
   
   ```
   def acero_foo(ctx, *args):
        return pa.Scalar(foo(*args))
   ```
   
   So the function we register with Acero has a known input types (two 
arguments with float64) but the wrapper function is defined with *args.
   
   Does that make sense?
   



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Reply via email to