Neal Richardson created ARROW-17437:
---------------------------------------
Summary: [R][C++] Scalar UDFs don't actually deal with scalars
Key: ARROW-17437
URL: https://issues.apache.org/jira/browse/ARROW-17437
Project: Apache Arrow
Issue Type: Bug
Components: C++, R
Reporter: Neal Richardson
Noted while testing out UDFs in R. I was wrapping a {{system()}} call in a UDF
to shell out and capture the stdout for each value in the data, but I ended up
getting the same result for all rows. After some exploration, I figured out
that the problem was that the data going into the UDF is actually a vector, so
unless the R UDF function is properly vectorized, you'll get unexpected data.
Here's an example that illustrates:
{code}
register_scalar_function(
"test",
function(context, x) paste(x, collapse=","),
utf8(),
utf8(),
auto_convert=TRUE
)
Table$create(x = c("a", "b", "c")) |>
transmute(test(x)) |>
collect()
# # A tibble: 3 × 1
# `test(x)`
# <chr>
# 1 a,b,c
# 2 a,b,c
# 3 a,b,c
{code}
Basically, the UDF gets the chunk of data and evaluates to return a Scalar,
which gets recycled for all rows.
--
This message was sent by Atlassian Jira
(v8.20.10#820010)