[GitHub] [arrow] lidavidm commented on a diff in pull request #12590: ARROW-15639 [C++][Python] UDF Scalar Function Implementation

GitBox Thu, 21 Apr 2022 08:00:57 -0700


lidavidm commented on code in PR #12590:
URL: https://github.com/apache/arrow/pull/12590#discussion_r855285790



##########
python/pyarrow/_compute.pyx:
##########
@@ -2251,3 +2338,219 @@ cdef CExpression _bind(Expression filter, Schema 
schema) except *:
 
     return GetResultValue(filter.unwrap().Bind(
         deref(pyarrow_unwrap_schema(schema).get())))
+
+
+cdef class ScalarUdfContext:
+    """A container to hold user-defined-function related
+    entities. `batch_length` and `MemoryPool` are important
+    entities in defining functions which require these details. 
+
+    Example
+    -------
+
+    ScalarUdfContext is used with the scalar user-defined-functions. 
+    When defining such a function, the first parameter must be a
+    ScalarUdfContext object. This object can be used to hold important
+    information. This can be further enhanced depending on the use 
+    cases of user-defined-functions. 
+
+    >>> def random(context, one, two):
+            return pc.add(one, two, memory_pool=context.memory_pool)
+    """
+
+    def __init__(self):
+        raise TypeError("Do not call {}'s constructor directly"
+                        .format(self.__class__.__name__))
+
+    cdef void init(self, const CScalarUdfContext &c_context):
+        self.c_context = c_context
+
+    @property
+    def batch_length(self):
+        """
+        Returns the length of the batch associated with the
+        user-defined-function. Useful when the batch_length
+        is required to do computations specially when scalars
+        are parameters of the user-defined-function.
+
+        Returns
+        -------
+        batch_length : int
+            The number of batches used when calling 
+            user-defined-function. 
+        """
+        return self.c_context.batch_length
+
+    @property
+    def memory_pool(self):
+        """
+        Returns the MemoryPool associated with the 
+        user-defined-function. An already initialized
+        MemoryPool can be used within the
+        user-defined-function. 
+
+        Returns
+        -------
+        memory_pool : MemoryPool
+            MemoryPool is obtained from the KernelContext
+            and passed to the ScalarUdfContext.
+        """
+        return box_memory_pool(self.c_context.pool)
+
+
+cdef inline CFunctionDoc _make_function_doc(dict func_doc) except *:
+    """
+    Helper function to generate the FunctionDoc
+    This function accepts a dictionary and expect the 
+    summary(str), description(str) and arg_names(List[str]) keys. 
+    """
+    cdef:
+        CFunctionDoc f_doc
+        vector[c_string] c_arg_names
+
+    f_doc.summary = tobytes(func_doc["summary"])
+    f_doc.description = tobytes(func_doc["description"])
+    for arg_name in func_doc["arg_names"]:
+        c_arg_names.push_back(tobytes(arg_name))
+    f_doc.arg_names = c_arg_names
+    # UDFOptions integration:
+    # TODO: https://issues.apache.org/jira/browse/ARROW-16041
+    f_doc.options_class = tobytes("")
+    f_doc.options_required = False
+    return f_doc
+
+cdef _scalar_udf_callback(user_function, const CScalarUdfContext& c_context, 
inputs):
+    """
+    Helper callback function used to wrap the ScalarUdfContext from Python to 
C++
+    execution.
+    """
+    cdef ScalarUdfContext context = ScalarUdfContext.__new__(ScalarUdfContext)
+    context.init(c_context)
+    return user_function(context, *inputs)
+
+
+def register_scalar_function(func, func_name, function_doc, in_types,
+                             out_type):
+    """
+    Register a user-defined scalar function. 
+
+    A scalar function is a function that executes elementwise
+    operations on arrays or scalars, and therefore whose results
+    generally do not depend on the order of the values in the

Review Comment:
   It's not that results "generally" do not depend on the order. A scalar 
function _must_ be computed row-by-row.



##########
python/pyarrow/_compute.pyx:
##########
@@ -2251,3 +2338,219 @@ cdef CExpression _bind(Expression filter, Schema 
schema) except *:
 
     return GetResultValue(filter.unwrap().Bind(
         deref(pyarrow_unwrap_schema(schema).get())))
+
+
+cdef class ScalarUdfContext:
+    """A container to hold user-defined-function related
+    entities. `batch_length` and `MemoryPool` are important
+    entities in defining functions which require these details. 
+
+    Example
+    -------
+
+    ScalarUdfContext is used with the scalar user-defined-functions. 
+    When defining such a function, the first parameter must be a
+    ScalarUdfContext object. This object can be used to hold important
+    information. This can be further enhanced depending on the use 
+    cases of user-defined-functions. 
+
+    >>> def random(context, one, two):
+            return pc.add(one, two, memory_pool=context.memory_pool)
+    """
+
+    def __init__(self):
+        raise TypeError("Do not call {}'s constructor directly"
+                        .format(self.__class__.__name__))
+
+    cdef void init(self, const CScalarUdfContext &c_context):
+        self.c_context = c_context
+
+    @property
+    def batch_length(self):
+        """
+        Returns the length of the batch associated with the
+        user-defined-function. Useful when the batch_length
+        is required to do computations specially when scalars
+        are parameters of the user-defined-function.
+
+        Returns
+        -------
+        batch_length : int
+            The number of batches used when calling 
+            user-defined-function. 
+        """
+        return self.c_context.batch_length
+
+    @property
+    def memory_pool(self):
+        """
+        Returns the MemoryPool associated with the 
+        user-defined-function. An already initialized
+        MemoryPool can be used within the
+        user-defined-function. 
+
+        Returns
+        -------
+        memory_pool : MemoryPool
+            MemoryPool is obtained from the KernelContext
+            and passed to the ScalarUdfContext.
+        """
+        return box_memory_pool(self.c_context.pool)
+
+
+cdef inline CFunctionDoc _make_function_doc(dict func_doc) except *:
+    """
+    Helper function to generate the FunctionDoc
+    This function accepts a dictionary and expect the 
+    summary(str), description(str) and arg_names(List[str]) keys. 
+    """
+    cdef:
+        CFunctionDoc f_doc
+        vector[c_string] c_arg_names
+
+    f_doc.summary = tobytes(func_doc["summary"])
+    f_doc.description = tobytes(func_doc["description"])
+    for arg_name in func_doc["arg_names"]:
+        c_arg_names.push_back(tobytes(arg_name))
+    f_doc.arg_names = c_arg_names
+    # UDFOptions integration:
+    # TODO: https://issues.apache.org/jira/browse/ARROW-16041
+    f_doc.options_class = tobytes("")
+    f_doc.options_required = False
+    return f_doc
+
+cdef _scalar_udf_callback(user_function, const CScalarUdfContext& c_context, 
inputs):
+    """
+    Helper callback function used to wrap the ScalarUdfContext from Python to 
C++
+    execution.
+    """
+    cdef ScalarUdfContext context = ScalarUdfContext.__new__(ScalarUdfContext)
+    context.init(c_context)
+    return user_function(context, *inputs)
+
+
+def register_scalar_function(func, func_name, function_doc, in_types,
+                             out_type):
+    """
+    Register a user-defined scalar function. 
+
+    A scalar function is a function that executes elementwise
+    operations on arrays or scalars, and therefore whose results
+    generally do not depend on the order of the values in the

Review Comment:
   It's not that results "generally" do not depend on the order. A scalar 
function _must_ be computed row-by-row with no state.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscr...@arrow.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[GitHub] [arrow] lidavidm commented on a diff in pull request #12590: ARROW-15639 [C++][Python] UDF Scalar Function Implementation

Reply via email to