[
https://issues.apache.org/jira/browse/ARROW-555?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17029070#comment-17029070
]
Joris Van den Bossche commented on ARROW-555:
---------------------------------------------
Do we already have a good idea of how we want to approach this?
Because I think there has been some discussion on implementing custom C++
kernels (similar to other existing kernels in the compute module) vs finding a
way to re-use the scalar kernels that are already implemented for gandiva.
For reference: Gandiva already has several string functions implemented.
Illustration with the python interface for the "upper" function:
{code:python}
from pyarrow import gandiva
table = pa.table({'a': ['a', 'b', 'c']})
builder = gandiva.TreeExprBuilder()
node_a = builder.make_field(table.schema.field("a"))
node_upper = builder.make_function("upper", [node_a], pa.string())
field_result = pa.field('res', pa.string())
expr = builder.make_expression(node_upper, field_result)
projector = gandiva.make_projector(table.schema, [expr],
pa.default_memory_pool())
>>> projector.evaluate(table.to_batches()[0])
[<pyarrow.lib.StringArray object at 0x7fc324f71580>
[
"A",
"B",
"C"
]]
{code}
> [C++] String algorithm library for StringArray/BinaryArray
> ----------------------------------------------------------
>
> Key: ARROW-555
> URL: https://issues.apache.org/jira/browse/ARROW-555
> Project: Apache Arrow
> Issue Type: New Feature
> Components: C++
> Reporter: Wes McKinney
> Priority: Major
> Labels: Analytics
>
> This is a parent JIRA for starting a module for processing strings in-memory
> arranged in Arrow format. This will include using the re2 C++ regular
> expression library and other standard string manipulations (such as those
> found on Python's string objects)
--
This message was sent by Atlassian Jira
(v8.3.4#803005)