[
https://issues.apache.org/jira/browse/ARROW-15765?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17497082#comment-17497082
]
Vibhatha Lakmal Abeykoon edited comment on ARROW-15765 at 2/24/22, 1:52 AM:
----------------------------------------------------------------------------
As [~westonpace] explained, we are working on a UDF PoC. At the moment how you
register a function can be as follows;
{code:java}
import pyarrow as pa
from pyarrow import compute as pc
from pyarrow.compute import call_function, register_pyfunction
from pyarrow.compute import Arity, InputType
func_doc = {}
func_doc["summary"] = "summary"
func_doc["description"] = "desc"
func_doc["arg_names"] = ["number"]
func_doc["options_class"] = "SomeOptions"
func_doc["options_required"] = False
arity = Arity.unary()
func_name = "python_udf"
in_types = [InputType.array(pa.int64())]
out_type = pa.int64()
def simple_function(arrow_array):
return call_function("add", [arrow_array, 1])
callback = simple_function
register_pyfunction(func_name, arity, func_doc, in_types, out_type, callback)
func1 = pc.get_function(func_name)
a1 = pc.call_function(func_name, [pa.array([20])]){code}
When registering the function user has to explicitly mention what is the arity
and what are the input and output types of the UDF. We can ease this by taking
all the information from the type-hints itself. This is only to improve the
usability.
For instance the user will write the function like this
{code:java}
def simple_function(arrow_array: pa.Int32Array) -> pa.Int32Array:
return call_function("add", [arrow_array, 1]) {code}
When registering user would only write
{code:java}
register_pyfunction(func_name, simple_function) {code}
We will extract the docs from comments or let user pass (optional) and the
arity, input and output types by inspecting the function signature.
Spark is already providing that support. When we go this route, we will extract
all the information from the UDF signature. At the moment I am using inspect
API to extract those information.
Next step is to extract from the type hint info: `pa.Int32Array` that this is a
`pa.Array` of type `pa.int32()`. This is the objective of this exercise.
[~apitrou] does it clear things out? Do you need more information to know why
we need this feature?
was (Author: vibhatha):
As [~westonpace] explained, we are working on a UDF PoC. At the moment how you
register a function can be as follows;
{code:java}
import pyarrow as pa
from pyarrow import compute as pc
from pyarrow.compute import call_function, register_pyfunction
from pyarrow.compute import Arity, InputType
func_doc = {}
func_doc["summary"] = "summary"
func_doc["description"] = "desc"
func_doc["arg_names"] = ["number"]
func_doc["options_class"] = "SomeOptions"
func_doc["options_required"] = False
arity = Arity.unary()
func_name = "python_udf"
in_types = [InputType.array(pa.x())]
out_type = pa.int64()
def simple_function(arrow_array):
return call_function("add", [arrow_array, 1])
callback = simple_function
register_pyfunction(func_name, arity, func_doc, in_types, out_type, callback)
func1 = pc.get_function(func_name)
a1 = pc.call_function(func_name, [pa.array([20])]){code}
When registering the function user has to explicitly mention what is the arity
and what are the input and output types of the UDF. We can ease this by taking
all the information from the type-hints itself. This is only to improve the
usability.
For instance the user will write the function like this
{code:java}
def simple_function(arrow_array: pa.Int32Array) -> pa.Int32Array:
return call_function("add", [arrow_array, 1]) {code}
When registering user would only write
{code:java}
register_pyfunction(func_name, simple_function) {code}
We will extract the docs from comments or let user pass (optional) and the
arity, input and output types by inspecting the function signature.
Spark is already providing that support. When we go this route, we will extract
all the information from the UDF signature. At the moment I am using inspect
API to extract those information.
Next step is to extract from the type hint info: `pa.Int32Array` that this is a
`pa.Array` of type `pa.int32()`. This is the objective of this exercise.
[~apitrou] does it clear things out? Do you need more information to know why
we need this feature?
> [Python] Extracting Type information from Python Objects
> --------------------------------------------------------
>
> Key: ARROW-15765
> URL: https://issues.apache.org/jira/browse/ARROW-15765
> Project: Apache Arrow
> Issue Type: Improvement
> Components: C++, Python
> Reporter: Vibhatha Lakmal Abeykoon
> Assignee: Vibhatha Lakmal Abeykoon
> Priority: Major
>
> When creating user defined functions or similar exercises where we want to
> extract the Arrow data types from the type hints, the existing Python API
> have some limitations.
> An example case is as follows;
> {code:java}
> def function(array1: pa.Int64Array, arrya2: pa.Int64Array) -> pa.Int64Array:
> return pc.call_function("add", [array1, array2])
> {code}
> We want to extract the fact that array1 is an `pa.Array` of `pa.Int32Type`.
> At the moment there doesn't exist a straightforward manner to get this done.
> So the idea is to expose this feature to Python.
--
This message was sent by Atlassian Jira
(v8.20.1#820001)