[
https://issues.apache.org/jira/browse/ARROW-15765?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17497720#comment-17497720
]
Weston Pace commented on ARROW-15765:
-------------------------------------
For a concrete use case consider a user that wants to integrate some kind of
Arrow native geojson library. They would have extension types for geojson data
types and custom functions that can do things like normalize coordinates to
some kind of different reference or format coordinates in a particular way. In
this case the UDFs would be taking in extension arrays for custom data types
which I think would have its own typings-based considerations.
Another possible example that comes from the TPCx-BB benchmark is doing
sentiment analysis on strings (is this user comment a positive comment or a
negative comment?) If we had an arrow-native natural language processing
library we could hook in an extract_sentiment operation which took in strings
and returns ? (maybe doubles?).
As far as I know the type information itself is only used for validation and
casting purposes.
Another dimension to consider is whether a UDF would care if an array were
dictionary encoded or not? We probably want a way to express that too.
> [Python] Extracting Type information from Python Objects
> --------------------------------------------------------
>
> Key: ARROW-15765
> URL: https://issues.apache.org/jira/browse/ARROW-15765
> Project: Apache Arrow
> Issue Type: Improvement
> Components: C++, Python
> Reporter: Vibhatha Lakmal Abeykoon
> Assignee: Vibhatha Lakmal Abeykoon
> Priority: Major
>
> When creating user defined functions or similar exercises where we want to
> extract the Arrow data types from the type hints, the existing Python API
> have some limitations.
> An example case is as follows;
> {code:java}
> def function(array1: pa.Int64Array, arrya2: pa.Int64Array) -> pa.Int64Array:
> return pc.call_function("add", [array1, array2])
> {code}
> We want to extract the fact that array1 is an `pa.Array` of `pa.Int32Type`.
> At the moment there doesn't exist a straightforward manner to get this done.
> So the idea is to expose this feature to Python.
--
This message was sent by Atlassian Jira
(v8.20.1#820001)