[
https://issues.apache.org/jira/browse/ARROW-15635?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Vibhatha Lakmal Abeykoon updated ARROW-15635:
---------------------------------------------
Description:
The objective is to list down a set of tasks required to provide UDF support
for Apache Arrow streaming execution engine. In the first iteration we will be
focusing on providing support for Python-based UDFs which can support Python
functions.
The UDF Integration is going to pan out with a series of sub-tasks associated
with the development and PoCs. Note that this is going to be the first
iteration of UDF integrations with a limited scope. This ticket will cover the
following topics;
# POC for UDF integration: The objective is to evaluate the existing
components in the source and evaluate the required modifications and new
building blocks required to integrate UDFs.
# The language will be limited to C+{+}/{+}Python users can register Python
function as a UDF and use it with an `apply` method on Arrow Tables or provide
a computation API endpoint via arrow::compute API. Note that the C+ API already
provides a way to register custom functions via the function registry API. At
the moment this is not exposed to Python.
# Planned features for this ticket are;
## Scalar UDFs : UDFs executed per value (per row)
## Vector UDFs : UDFs executed per batch (a full array or partial array)
## Aggregate UDFs : UDFs associated with an aggregation operation
# Integration limitations
## Doesn't support custom data types which doesn't support Numpy or Pandas
## Complex processing with parallelism within UDFs are not supported
was:
The objective is to list down a set of tasks required to provide UDF support
for Apache Arrow streaming execution engine. In the first iteration we will be
focusing on providing support for Python-based UDFs which can support Python
functions.
The UDF Integration is going to pan out with a series of sub-tasks associated
with the development and PoCs. Note that this is going to be the first
iteration of UDF integrations with a limited scope. This ticket will cover the
following topics;
# POC for UDF integration: The objective is to evaluate the existing
components in the source and evaluate the required modifications and new
building blocks required to integrate UDFs.
# The language will be limited to C++/Python users can register Python
function as a UDF and use it with an `apply` method on Arrow Tables or provide
a computation API endpoint via arrow::compute API. Note that the C++ API
already provides a way to register custom functions via the function registry
API. At the moment this is not exposed to Python.
# Planned features for this ticket are;
## Scalar UDFs : UDFs executed per value (per row)
## Vector UDFs : UDFs executed per batch (a full array or partial array)
## Aggregate UDFs : UDFs associated with an aggregation operation
# Integration limitations
## Doesn't support custom data types which doesn't support Numpy or Pandas
## Complex processing with parallelism within UDFs are not supported
> [C++][Python] UDF Integration
> ------------------------------
>
> Key: ARROW-15635
> URL: https://issues.apache.org/jira/browse/ARROW-15635
> Project: Apache Arrow
> Issue Type: Task
> Components: C++, Python
> Reporter: Vibhatha Lakmal Abeykoon
> Assignee: Vibhatha Lakmal Abeykoon
> Priority: Major
>
> The objective is to list down a set of tasks required to provide UDF support
> for Apache Arrow streaming execution engine. In the first iteration we will
> be focusing on providing support for Python-based UDFs which can support
> Python functions.
> The UDF Integration is going to pan out with a series of sub-tasks associated
> with the development and PoCs. Note that this is going to be the first
> iteration of UDF integrations with a limited scope. This ticket will cover
> the following topics;
> # POC for UDF integration: The objective is to evaluate the existing
> components in the source and evaluate the required modifications and new
> building blocks required to integrate UDFs.
> # The language will be limited to C+{+}/{+}Python users can register Python
> function as a UDF and use it with an `apply` method on Arrow Tables or
> provide a computation API endpoint via arrow::compute API. Note that the C+
> API already provides a way to register custom functions via the function
> registry API. At the moment this is not exposed to Python.
> # Planned features for this ticket are;
> ## Scalar UDFs : UDFs executed per value (per row)
> ## Vector UDFs : UDFs executed per batch (a full array or partial array)
> ## Aggregate UDFs : UDFs associated with an aggregation operation
> # Integration limitations
> ## Doesn't support custom data types which doesn't support Numpy or Pandas
> ## Complex processing with parallelism within UDFs are not supported
--
This message was sent by Atlassian Jira
(v8.20.1#820001)