[jira] [Commented] (ARROW-17437) [R][C++] Scalar UDFs don't actually deal with scalars

Weston Pace (Jira) Wed, 17 Aug 2022 11:12:08 -0700


    [ 
https://issues.apache.org/jira/browse/ARROW-17437?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17580926#comment-17580926
 ]


Weston Pace commented on ARROW-17437:
-------------------------------------

Is this a documentation problem or are you expecting different behavior?

A "scalar function" is a stateless function that produces exactly one output 
for each row.  It is one categorization of functions alongside "aggregate 
function" (collection of multiple functions which work together to produce a 
single value for all rows), "hash aggregate function" (similar to aggregate but 
the first (maybe last?) column is the group id and it produces one value per 
group id), "table function" (produces 0-N values for each row), "window 
function" (I don't know the semantics here yet but it gets to expect a certain 
order to the input and is not stateless), etc.

I do not like the term "scalar function" because at no point does a "scalar" 
(single-valued item) need to be involved.  However, this appears to be a 
somewhat established term in the literature:

https://docs.snowflake.com/en/sql-reference/functions.html
https://substrait.io/expressions/scalar_functions/

Each input item might be a scalar and it might be a vector.  The input items 
could all be vectors.  A UDF should really be ready to handle all cases.  For 
example FOO(5, [1, 1, 1]) should yield the same thing as FOO([5, 5, 5], 1) and 
the same thing as FOO([5, 5, 5], [1, 1, 1]).

It's actually impossible for all input items to be scalars so a function 
doesn't technically have to handle that case.

I suspect R's UDF integration could maybe simplify things here.  For example, 
it could do something like (python pseudocode):

{noformat}
output = []
for row_idx in range(context.batch_length):
  row = []
  for arg_idx in range(args):
    if is_scalar(arg_idx):
      row[arg_idx] = args[arg_idx].value
    else:
      row[arg_idx] = args[arg_idx].values[row_idx]
  output[row_idx] = user_function(row)
{noformat}

This would make it very easy to write functions that are pure R but it would 
kill performance if the goal was instead to link to some other "R library that 
wraps C++ functions that work on Arrow data" so I think both cases will be 
needed.

> [R][C++] Scalar UDFs don't actually deal with scalars
> -----------------------------------------------------
>
>                 Key: ARROW-17437
>                 URL: https://issues.apache.org/jira/browse/ARROW-17437
>             Project: Apache Arrow
>          Issue Type: Bug
>          Components: C++, R
>            Reporter: Neal Richardson
>            Priority: Major
>
> Noted while testing out UDFs in R. I was wrapping a {{system()}} call in a 
> UDF to shell out and capture the stdout for each value in the data, but I 
> ended up getting the same result for all rows. After some exploration, I 
> figured out that the problem was that the data going into the UDF is actually 
> a vector, so unless the R UDF function is properly vectorized, you'll get 
> unexpected data. 
> Here's an example that illustrates:
> {code}
> register_scalar_function(
>   "test", 
>   function(context, x) paste(x, collapse=","), 
>   utf8(), 
>   utf8(), 
>   auto_convert=TRUE
> )
> Table$create(x = c("a", "b", "c")) |>
>   transmute(test(x)) |>
>   collect()
> # # A tibble: 3 × 1
> #   `test(x)`
> #   <chr>    
> # 1 a,b,c    
> # 2 a,b,c    
> # 3 a,b,c    
> {code}
> Basically, the UDF gets the chunk of data and evaluates to return a Scalar, 
> which gets recycled for all rows.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[jira] [Commented] (ARROW-17437) [R][C++] Scalar UDFs don't actually deal with scalars

Reply via email to