vibhatha commented on code in PR #13687:
URL: https://github.com/apache/arrow/pull/13687#discussion_r982584769
##########
docs/source/python/compute.rst:
##########
@@ -370,3 +370,133 @@ our ``even_filter`` with a ``pc.field("nums") > 5``
filter:
:class:`.Dataset` currently can be filtered using :meth:`.Dataset.to_table`
method
passing a ``filter`` argument. See :ref:`py-filter-dataset` in Dataset
documentation.
+
+
+User-Defined Functions
+======================
+
+.. warning::
+ This API is **experimental**.
+
+PyArrow allows defining and registering custom compute functions.
+These functions can then be called from Python as well as C++ (and potentially
+any other implementation wrapping Arrow C++, such as the R ``arrow`` package)
+using their registered function name.
+
+To register a UDF, a function name, function docs, input types and
+output type need to be defined. Using
:func:`pyarrow.compute.register_scalar_function`,
+
+.. code-block:: python
+
+ import numpy as np
+
+ import pyarrow as pa
+ import pyarrow.compute as pc
+
+ function_name = "numpy_gcd"
+ function_docs = {
+ "summary": "Calculates the greatest common divisor",
+ "description":
+ "Given 'x' and 'y' find the greatest number that divides\n"
+ "evenly into both x and y."
+ }
+
+ input_types = {
+ "x" : pa.int64(),
+ "y" : pa.int64()
+ }
+
+ output_type = pa.int64()
+
+ def to_np(val):
+ if isinstance(val, pa.Scalar):
+ return val.as_py()
Review Comment:
@jorisvandenbossche if we do that, we get the following error
```bash
>>> pc.call_function("numpy_gcd", [pa.scalar(27), pa.array([81, 12, 5])])
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "pyarrow/_compute.pyx", line 560, in pyarrow._compute.call_function
File "pyarrow/_compute.pyx", line 355, in pyarrow._compute.Function.call
File "pyarrow/error.pxi", line 144, in
pyarrow.lib.pyarrow_internal_check_status
File "pyarrow/_compute.pyx", line 2506, in
pyarrow._compute._scalar_udf_callback
File "<stdin>", line 4, in gcd_numpy
File
"/Users/vibhatha/venv/pyarrow_dev/lib/python3.10/site-packages/numpy/core/_internal.py",
line 790, in _gcd
a, b = b, a % b
TypeError: unsupported operand type(s) for %: 'pyarrow.lib.Int64Scalar' and
'int'
```
I think the reason is, Numpy cannot identify the passed in Arrow scalar
value. We need to take the python value of it or convert it to numpy.
The following is what would happen
```bash
>>> np.gcd(np.array(pa.scalar(27)), np.array([81, 12, 5]))
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File
"/Users/vibhatha/venv/pyarrow_dev/lib/python3.10/site-packages/numpy/core/_internal.py",
line 790, in _gcd
a, b = b, a % b
TypeError: unsupported operand type(s) for %: 'pyarrow.lib.Int64Scalar' and
'int'
```
But it would work for
```bash
np.gcd(np.array(pa.array([27])), np.array([81, 12, 5]))
array([27, 3, 1])
```
but not for
```bash
>>> np.gcd(np.array(pa.scalar(27)), np.array([81, 12, 5]))
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File
"/Users/vibhatha/venv/pyarrow_dev/lib/python3.10/site-packages/numpy/core/_internal.py",
line 790, in _gcd
a, b = b, a % b
TypeError: unsupported operand type(s) for %: 'pyarrow.lib.Int64Scalar' and
'int'
```
And again works for
```bash
>>> np.gcd(np.array(27), np.array([81, 12, 5]))
array([27, 3, 1])
```
Am I wrong here or missing something?
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]