vibhatha commented on code in PR #13687: URL: https://github.com/apache/arrow/pull/13687#discussion_r977730323
########## docs/source/python/compute.rst: ########## @@ -370,3 +370,136 @@ our ``even_filter`` with a ``pc.field("nums") > 5`` filter: :class:`.Dataset` currently can be filtered using :meth:`.Dataset.to_table` method passing a ``filter`` argument. See :ref:`py-filter-dataset` in Dataset documentation. + + +User-Defined Functions +====================== + +.. warning:: + This API is **experimental**. + Also, only scalar functions can currently be user-defined. + +PyArrow allows defining and registering custom compute functions. +These functions can then be called from Python as well as C++ (and potentially +any other implementation wrapping Arrow C++, such as the R ``arrow`` package) +using their registered function name. + +To register a UDF, a function name, function docs, input types and +output type need to be defined. Using :func:`pyarrow.compute.register_scalar_function`, + +.. code-block:: python + + import numpy as np + + import pyarrow as pa + import pyarrow.compute as pc + + function_name = "numpy_gcd" + function_docs = { + "summary": "Calculates the greatest common divisor", + "description": + "Given 'x' and 'y' find the greatest number that divides\n" + "evenly into both x and y." + } + + input_types = { + "x" : pa.int64(), + "y" : pa.int64() + } + + output_type = pa.int64() + + def to_np(val): + if isinstance(val, pa.Scalar): + return val.as_py() + else: + return np.array(val) + + def gcd_numpy(ctx, x, y): + np_x = to_np(x) + np_y = to_np(y) + return pa.array(np.gcd(np_x, np_y)) + + pc.register_scalar_function(gcd_numpy, + function_name, + function_docs, + input_types, + output_type) + + +The implementation of a user-defined function always takes first *context* +parameter (named ``ctx`` in the example above) which is an instance of +:class:`pyarrow.compute.ScalarUdfContext`. +This context exposes several useful attributes, particularly a +:attr:`~pyarrow.compute.ScalarUdfContext.memory_pool` to be used for +allocations in the context of the user-defined function. + +PyArrow UDFs accept input types of both scalar and array. Also it can have +any combination of these types. It is important that the UDF author ensures +the UDF can handle such combinations correctly. Also the ability to use UDFs +with existing data processing libraries is very useful. + +Note that there is a helper function `to_np` to handle the conversion Review Comment: I am removing the section after `Note that there is ...` -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: github-unsubscr...@arrow.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org