[GitHub] [arrow] jorisvandenbossche commented on a diff in pull request #13687: ARROW-17181: [Docs][Python] Scalar UDF Experimental Documentation

GitBox Fri, 23 Sep 2022 02:39:58 -0700


jorisvandenbossche commented on code in PR #13687:
URL: https://github.com/apache/arrow/pull/13687#discussion_r978442938



##########
docs/source/python/compute.rst:
##########
@@ -370,3 +370,133 @@ our ``even_filter`` with a ``pc.field("nums") > 5`` 
filter:
 
 :class:`.Dataset` currently can be filtered using :meth:`.Dataset.to_table` 
method
 passing a ``filter`` argument. See :ref:`py-filter-dataset` in Dataset 
documentation.
+
+
+User-Defined Functions
+======================
+
+.. warning::
+   This API is **experimental**.
+
+PyArrow allows defining and registering custom compute functions.
+These functions can then be called from Python as well as C++ (and potentially
+any other implementation wrapping Arrow C++, such as the R ``arrow`` package)
+using their registered function name.
+
+To register a UDF, a function name, function docs, input types and
+output type need to be defined. Using 
:func:`pyarrow.compute.register_scalar_function`,
+
+.. code-block:: python
+
+   import numpy as np
+
+   import pyarrow as pa
+   import pyarrow.compute as pc
+
+   function_name = "numpy_gcd"
+   function_docs = {
+         "summary": "Calculates the greatest common divisor",
+         "description":
+            "Given 'x' and 'y' find the greatest number that divides\n"
+            "evenly into both x and y."
+   }
+
+   input_types = {
+      "x" : pa.int64(),
+      "y" : pa.int64()
+   }
+
+   output_type = pa.int64()
+
+   def to_np(val):
+      if isinstance(val, pa.Scalar):
+         return val.as_py()
+      else:
+         return np.array(val)
+
+   def gcd_numpy(ctx, x, y):
+      np_x = to_np(x)

Review Comment:
   Small nitpick: can you use 4-space indentation in the python snippets?



##########
docs/source/python/compute.rst:
##########
@@ -370,3 +370,133 @@ our ``even_filter`` with a ``pc.field("nums") > 5`` 
filter:
 
 :class:`.Dataset` currently can be filtered using :meth:`.Dataset.to_table` 
method
 passing a ``filter`` argument. See :ref:`py-filter-dataset` in Dataset 
documentation.
+
+
+User-Defined Functions
+======================
+
+.. warning::
+   This API is **experimental**.
+
+PyArrow allows defining and registering custom compute functions.
+These functions can then be called from Python as well as C++ (and potentially
+any other implementation wrapping Arrow C++, such as the R ``arrow`` package)
+using their registered function name.
+
+To register a UDF, a function name, function docs, input types and
+output type need to be defined. Using 
:func:`pyarrow.compute.register_scalar_function`,
+
+.. code-block:: python
+
+   import numpy as np
+
+   import pyarrow as pa
+   import pyarrow.compute as pc
+
+   function_name = "numpy_gcd"
+   function_docs = {
+         "summary": "Calculates the greatest common divisor",
+         "description":
+            "Given 'x' and 'y' find the greatest number that divides\n"
+            "evenly into both x and y."
+   }
+
+   input_types = {
+      "x" : pa.int64(),
+      "y" : pa.int64()
+   }
+
+   output_type = pa.int64()
+
+   def to_np(val):
+      if isinstance(val, pa.Scalar):
+         return val.as_py()

Review Comment:
   Is this conversion to Scalar still needed? (with the latest refactor of Wes, 
scalars might now be handled by len-1 arrays?) 
   In any case, the `gcd_numpy` function itself won't work with scalars because 
of the `pa.array(..)` call in it:
   
   ```
   In [32]: gcd_numpy(None, pa.scalar(27), pa.scalar(63))
   ---------------------------------------------------------------------------
   TypeError                                 Traceback (most recent call last)
   <ipython-input-32-5dc8dd5d05b1> in <module>
   ----> 1 gcd_numpy(None, pa.scalar(27), pa.scalar(63))
   
   <ipython-input-26-1579a8ef575a> in gcd_numpy(ctx, x, y)
        22    np_x = to_np(x)
        23    np_y = to_np(y)
   ---> 24    return pa.array(np.gcd(np_x, np_y))
        25 pc.register_scalar_function(gcd_numpy,
        26                            function_name,
   
   ~/scipy/repos/arrow/python/pyarrow/array.pxi in pyarrow.lib.array()
   
   ~/scipy/repos/arrow/python/pyarrow/array.pxi in 
pyarrow.lib._sequence_to_array()
   
   ~/scipy/repos/arrow/python/pyarrow/error.pxi in 
pyarrow.lib.pyarrow_internal_check_status()
   
   TypeError: 'numpy.int64' object is not iterable
   ```
   
   
   
   



##########
docs/source/python/compute.rst:
##########
@@ -370,3 +370,134 @@ our ``even_filter`` with a ``pc.field("nums") > 5`` 
filter:
 
 :class:`.Dataset` currently can be filtered using :meth:`.Dataset.to_table` 
method
 passing a ``filter`` argument. See :ref:`py-filter-dataset` in Dataset 
documentation.
+
+
+User-Defined Functions
+======================
+
+.. warning::
+   This API is **experimental**.
+
+PyArrow allows defining and registering custom compute functions.
+These functions can then be called from Python as well as C++ (and potentially
+any other implementation wrapping Arrow C++, such as the R ``arrow`` package)
+using their registered function name.
+
+To register a UDF, a function name, function docs, input types and
+output type need to be defined. Using 
:func:`pyarrow.compute.register_scalar_function`,
+
+.. code-block:: python
+
+   import numpy as np
+
+   import pyarrow as pa
+   import pyarrow.compute as pc
+
+   function_name = "numpy_gcd"
+   function_docs = {
+         "summary": "Calculates the greatest common divisor",
+         "description":
+            "Given 'x' and 'y' find the greatest number that divides\n"
+            "evenly into both x and y."
+   }
+
+   input_types = {
+      "x" : pa.int64(),
+      "y" : pa.int64()
+   }
+
+   output_type = pa.int64()
+
+   def to_np(val):
+      if isinstance(val, pa.Scalar):
+         return val.as_py()
+      else:
+         return np.array(val)
+
+   def gcd_numpy(ctx, x, y):
+      np_x = to_np(x)
+      np_y = to_np(y)
+      return pa.array(np.gcd(np_x, np_y))
+
+   pc.register_scalar_function(gcd_numpy,
+                              function_name,
+                              function_docs,
+                              input_types,
+                              output_type)
+   
+
+The implementation of a user-defined function always takes first *context*
+parameter (named ``ctx`` in the example above) which is an instance of
+:class:`pyarrow.compute.ScalarUdfContext`.
+This context exposes several useful attributes, particularly a
+:attr:`~pyarrow.compute.ScalarUdfContext.memory_pool` to be used for
+allocations in the context of the user-defined function.
+
+PyArrow UDFs accept input types of both :class:`~pyarrow.Scalar` and 
:class:`~pyarrow.Array`,
+and there will always be at least one input of type :class:`~pyarrow.Array`.
+The output should always be a :class:`~pyarrow.Array`.
+
+You can call a user-defined function directly using 
:func:`pyarrow.compute.call_function`:
+
+.. code-block:: python
+
+   >>> pc.call_function("numpy_gcd", [pa.scalar(27), pa.scalar(63)])
+   <pyarrow.Int64Scalar: 9>
+   >>> pc.call_function("numpy_gcd", [pa.scalar(27), pa.array([81, 12, 5])])
+   <pyarrow.lib.Int64Array object at 0x7fcfa0e7b100>
+   [
+     27,
+     3,
+     1
+   ]
+
+Working with Datasets
+---------------------
+
+More generally, user-defined functions are usable everywhere a compute function
+can be referred by its name. For example, they can be called on a dataset's
+column using :meth:`Expression._call`.
+
+Consider an instance where the data is in a table and we want to compute
+the GCD of one column with the scalar value 30.  We will be re-using the
+"numpy_gcd" user-defined function that was created above:
+
+.. code-block:: python
+
+   >>> import pyarrow.dataset as ds
+   >>> sample_data = {'category': ['A', 'B', 'C', 'D'], 'value': [90, 630, 
1827, 2709]}
+   >>> data_table = pa.Table.from_pydict(sample_data)
+   >>> dataset = ds.dataset(data_table)
+   >>> func_args = [pc.scalar(30), ds.field("value")]
+   >>> dataset.to_table(
+   ...             columns={
+   ...                 'gcd_value': ds.field('')._call("numpy_gcd", func_args),
+   ...                 'value': ds.field('value'),
+   ...                 'category': ds.field('category')
+   ...             })
+   pyarrow.Table
+   gcd_value: int64
+   value: int64
+   category: string
+   ----
+   gcd_value: [[30,30,3,3]]
+   value: [[90,630,1827,2709]]
+   category: [["A","B","C","D"]]
+
+Note that ``ds.field('')_call(...)`` returns a 
:func:`pyarrow.compute.Expression`.
+The arguments passed to this function call are expressions, not scalar values 
+(notice the difference between :func:`pyarrow.scalar` and 
:func:`pyarrow.compute.scalar`,
+the latter produces an expression). 
+This expression is evaluated when the projection operator executes it.
+
+Projection Expressions
+^^^^^^^^^^^^^^^^^^^^^^
+In the above example we used an expression to add a new column (``gcd_value``)

Review Comment:
   I think we should also mention somewhere (more to the beginning of the new 
section, I think), that currently the UDFs are limited to scalar functions (and 
then also explain what a scalar function is). 
   That will also make it easier to refer to that concept here to say that 
projections currently only support scalar functions.



##########
docs/source/python/compute.rst:
##########
@@ -370,3 +370,133 @@ our ``even_filter`` with a ``pc.field("nums") > 5`` 
filter:
 
 :class:`.Dataset` currently can be filtered using :meth:`.Dataset.to_table` 
method
 passing a ``filter`` argument. See :ref:`py-filter-dataset` in Dataset 
documentation.
+
+
+User-Defined Functions
+======================
+
+.. warning::
+   This API is **experimental**.
+
+PyArrow allows defining and registering custom compute functions.
+These functions can then be called from Python as well as C++ (and potentially
+any other implementation wrapping Arrow C++, such as the R ``arrow`` package)
+using their registered function name.
+
+To register a UDF, a function name, function docs, input types and
+output type need to be defined. Using 
:func:`pyarrow.compute.register_scalar_function`,
+
+.. code-block:: python
+
+   import numpy as np
+
+   import pyarrow as pa
+   import pyarrow.compute as pc
+
+   function_name = "numpy_gcd"
+   function_docs = {
+         "summary": "Calculates the greatest common divisor",
+         "description":
+            "Given 'x' and 'y' find the greatest number that divides\n"
+            "evenly into both x and y."
+   }
+
+   input_types = {
+      "x" : pa.int64(),
+      "y" : pa.int64()
+   }
+
+   output_type = pa.int64()
+
+   def to_np(val):
+      if isinstance(val, pa.Scalar):
+         return val.as_py()
+      else:
+         return np.array(val)
+
+   def gcd_numpy(ctx, x, y):
+      np_x = to_np(x)
+      np_y = to_np(y)
+      return pa.array(np.gcd(np_x, np_y))
+
+   pc.register_scalar_function(gcd_numpy,
+                              function_name,
+                              function_docs,
+                              input_types,
+                              output_type)
+   
+
+The implementation of a user-defined function always takes first *context*
+parameter (named ``ctx`` in the example above) which is an instance of
+:class:`pyarrow.compute.ScalarUdfContext`.
+This context exposes several useful attributes, particularly a
+:attr:`~pyarrow.compute.ScalarUdfContext.memory_pool` to be used for
+allocations in the context of the user-defined function.
+
+PyArrow UDFs accept input types of both :class:`~pyarrow.Scalar` and 
:class:`~pyarrow.Array`,
+and there will always be at least one input of type :class:`~pyarrow.Array`.
+The output should always be a :class:`~pyarrow.Array`.

Review Comment:
   This is probably related to my comment above, and could indeed explain why 
the function would work with scalars (if one of both is not an array, numpy 
will also return an array, and the `pa.array(..)` call will not fail)
   
   However, adding a print statement for the type of argument in `to_np` and 
running this example again, I see:
   
   ```
   In [3]: pc.call_function("numpy_gcd", [pa.scalar(27), pa.scalar(63)])
   <class 'pyarrow.lib.Int64Array'>
   <class 'pyarrow.lib.Int64Array'>
   Out[3]: <pyarrow.Int64Scalar: 9>
   ```
   
   So it seems that both arguments are converted to an array. If that is 
guaranteed to always be the case now, the above paragraph is outdated.
   
   



##########
docs/source/python/compute.rst:
##########
@@ -370,3 +370,133 @@ our ``even_filter`` with a ``pc.field("nums") > 5`` 
filter:
 
 :class:`.Dataset` currently can be filtered using :meth:`.Dataset.to_table` 
method
 passing a ``filter`` argument. See :ref:`py-filter-dataset` in Dataset 
documentation.
+
+
+User-Defined Functions
+======================
+
+.. warning::
+   This API is **experimental**.
+
+PyArrow allows defining and registering custom compute functions.
+These functions can then be called from Python as well as C++ (and potentially
+any other implementation wrapping Arrow C++, such as the R ``arrow`` package)
+using their registered function name.
+
+To register a UDF, a function name, function docs, input types and
+output type need to be defined. Using 
:func:`pyarrow.compute.register_scalar_function`,
+
+.. code-block:: python
+
+   import numpy as np
+
+   import pyarrow as pa
+   import pyarrow.compute as pc
+
+   function_name = "numpy_gcd"
+   function_docs = {
+         "summary": "Calculates the greatest common divisor",
+         "description":
+            "Given 'x' and 'y' find the greatest number that divides\n"
+            "evenly into both x and y."
+   }
+
+   input_types = {
+      "x" : pa.int64(),
+      "y" : pa.int64()
+   }
+
+   output_type = pa.int64()
+
+   def to_np(val):
+      if isinstance(val, pa.Scalar):
+         return val.as_py()
+      else:
+         return np.array(val)
+
+   def gcd_numpy(ctx, x, y):
+      np_x = to_np(x)
+      np_y = to_np(y)
+      return pa.array(np.gcd(np_x, np_y))
+
+   pc.register_scalar_function(gcd_numpy,
+                              function_name,
+                              function_docs,
+                              input_types,
+                              output_type)
+   
+
+The implementation of a user-defined function always takes first *context*
+parameter (named ``ctx`` in the example above) which is an instance of
+:class:`pyarrow.compute.ScalarUdfContext`.
+This context exposes several useful attributes, particularly a
+:attr:`~pyarrow.compute.ScalarUdfContext.memory_pool` to be used for
+allocations in the context of the user-defined function.
+
+PyArrow UDFs accept input types of both :class:`~pyarrow.Scalar` and 
:class:`~pyarrow.Array`,
+and there will always be at least one input of type :class:`~pyarrow.Array`.
+The output should always be a :class:`~pyarrow.Array`.
+
+You can call a user-defined function directly using 
:func:`pyarrow.compute.call_function`:
+
+.. code-block:: python
+
+   >>> pc.call_function("numpy_gcd", [pa.scalar(27), pa.scalar(63)])
+   <pyarrow.Int64Scalar: 9>
+   >>> pc.call_function("numpy_gcd", [pa.scalar(27), pa.array([81, 12, 5])])
+   <pyarrow.lib.Int64Array object at 0x7fcfa0e7b100>
+   [
+     27,
+     3,
+     1
+   ]
+
+Working with Datasets
+---------------------
+
+More generally, user-defined functions are usable everywhere a compute function
+can be referred to by its name. For example, they can be called on a dataset's
+column using :meth:`Expression._call`.
+
+Consider an instance where the data is in a table and we want to compute
+the GCD of one column with the scalar value 30.  We will be re-using the
+"numpy_gcd" user-defined function that was created above:
+
+.. code-block:: python
+
+   >>> import pyarrow.dataset as ds
+   >>> sample_data = {'category': ['A', 'B', 'C', 'D'], 'value': [90, 630, 
1827, 2709]}
+   >>> data_table = pa.Table.from_pydict(sample_data)

Review Comment:
   ```suggestion
      >>> data_table = pa.table({'category': ['A', 'B', 'C', 'D'], 'value': 
[90, 630, 1827, 2709]})
   ```
   
   (a bit simpler to create the same table)



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

[GitHub] [arrow] jorisvandenbossche commented on a diff in pull request #13687: ARROW-17181: [Docs][Python] Scalar UDF Experimental Documentation

Reply via email to