[
https://issues.apache.org/jira/browse/ARROW-17437?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17580926#comment-17580926
]
Weston Pace commented on ARROW-17437:
-------------------------------------
Is this a documentation problem or are you expecting different behavior?
A "scalar function" is a stateless function that produces exactly one output
for each row. It is one categorization of functions alongside "aggregate
function" (collection of multiple functions which work together to produce a
single value for all rows), "hash aggregate function" (similar to aggregate but
the first (maybe last?) column is the group id and it produces one value per
group id), "table function" (produces 0-N values for each row), "window
function" (I don't know the semantics here yet but it gets to expect a certain
order to the input and is not stateless), etc.
I do not like the term "scalar function" because at no point does a "scalar"
(single-valued item) need to be involved. However, this appears to be a
somewhat established term in the literature:
https://docs.snowflake.com/en/sql-reference/functions.html
https://substrait.io/expressions/scalar_functions/
Each input item might be a scalar and it might be a vector. The input items
could all be vectors. A UDF should really be ready to handle all cases. For
example FOO(5, [1, 1, 1]) should yield the same thing as FOO([5, 5, 5], 1) and
the same thing as FOO([5, 5, 5], [1, 1, 1]).
It's actually impossible for all input items to be scalars so a function
doesn't technically have to handle that case.
I suspect R's UDF integration could maybe simplify things here. For example,
it could do something like (python pseudocode):
{noformat}
output = []
for row_idx in range(context.batch_length):
row = []
for arg_idx in range(args):
if is_scalar(arg_idx):
row[arg_idx] = args[arg_idx].value
else:
row[arg_idx] = args[arg_idx].values[row_idx]
output[row_idx] = user_function(row)
{noformat}
This would make it very easy to write functions that are pure R but it would
kill performance if the goal was instead to link to some other "R library that
wraps C++ functions that work on Arrow data" so I think both cases will be
needed.
> [R][C++] Scalar UDFs don't actually deal with scalars
> -----------------------------------------------------
>
> Key: ARROW-17437
> URL: https://issues.apache.org/jira/browse/ARROW-17437
> Project: Apache Arrow
> Issue Type: Bug
> Components: C++, R
> Reporter: Neal Richardson
> Priority: Major
>
> Noted while testing out UDFs in R. I was wrapping a {{system()}} call in a
> UDF to shell out and capture the stdout for each value in the data, but I
> ended up getting the same result for all rows. After some exploration, I
> figured out that the problem was that the data going into the UDF is actually
> a vector, so unless the R UDF function is properly vectorized, you'll get
> unexpected data.
> Here's an example that illustrates:
> {code}
> register_scalar_function(
> "test",
> function(context, x) paste(x, collapse=","),
> utf8(),
> utf8(),
> auto_convert=TRUE
> )
> Table$create(x = c("a", "b", "c")) |>
> transmute(test(x)) |>
> collect()
> # # A tibble: 3 × 1
> # `test(x)`
> # <chr>
> # 1 a,b,c
> # 2 a,b,c
> # 3 a,b,c
> {code}
> Basically, the UDF gets the chunk of data and evaluates to return a Scalar,
> which gets recycled for all rows.
--
This message was sent by Atlassian Jira
(v8.20.10#820010)