[ 
https://issues.apache.org/jira/browse/ARROW-17446?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ben Kietzman updated ARROW-17446:
---------------------------------
    Description: 
Currently, if an R expression is not entirely supported by the arrow compute 
engine, the entire input will be pulled into memory for native R to operate on. 
It would be possible to instead provide add a custom compute function to the 
registry (inside {{R_init_arrow}}, probably) which evaluates any sub 
expressions which couldn't be translated to native arrow compute expressions.

This would for example allow a filter expression including a call to an R 
function {{baz}} to evaluate on a dataset larger than memory and with predicate 
and projection pushdown as normal using the expressions which *are* 
translatable. The resulting expression might look something like this in c++:

{code}
call("and_kleene", {
  call("greater", {field_ref("a"), scalar(1)}),
  call("r_expr", {field_ref("b")},
      /*options=*/RexprOptions{cpp11::function(baz_sexp)}),
});
{code}

In this case although the "r_expr" function is opaque to compute and datasets, 
we would still recognize that only fields "a" and "b" need to be materialized. 
Furthermore, the first member of the filter's conjunction is {{a > 1}}, which 
*is* translatable and could be used for predicate pushdown, for checking 
against parquet statistics, etc.

Since R is not multithreaded, the compute function would need to take a global 
lock to ensure only a single thread of R execution. This would also block the 
interpreter, so it's not a high-performance solution... but it *would* block 
the interpreter less than doing everything in pure native R (since at least 
*some* of the work could be offloaded to worker threads and we could take 
advantage of batched input). Still, it seems like a worthwhile option to 
consider

  was:
Currently, if an R expression is not entirely supported by the arrow compute 
engine, the entire input will be pulled into memory for native R to operate on. 
It would be possible to instead provide add a custom compute function to the 
registry (inside {{R_init_arrow}}, probably) which evaluates any sub 
expressions which couldn't be translated to native arrow compute expressions.

This would for example allow a filter expression including a call to an R 
function {{baz}} to evaluate on a dataset larger than memory and with predicate 
and projection pushdown as normal using the expressions which *are* 
translatable. The resulting expression might look something like this in c++:

{code}
call("and_kleene", {
  call("greater", {field_ref("a"), scalar(1)}),
  call("r_expr", {field_ref("b")},
      /*options=*/RexprOptions{cpp11::function(baz_sexp)}),
});
{code}

In this case although the "r_expr" function is opaque to compute and datasets, 
we would still recognize that only fields "a" and "b" need to be materialized. 
Furthermore, the first member of the filter's conjunction is {{a > 1}}, which 
*is* translatable and could be used for predicate pushdown, for checking 
against parquet statistics, etc.

Since R is not multithreaded, the compute function would need to take a global 
lock to ensure only a single thread of R execution. This would also lock the 
interpreter, so it's not a high-performance solution... but it *would* lock the 
interpreter less than doing everything in pure native R (since at least *some* 
of the work could be offloaded to worker threads). Still, it seems like a 
worthwhile option to consider


> [R] Allow unrecognized R expressions to be callable as compute::Functions
> -------------------------------------------------------------------------
>
>                 Key: ARROW-17446
>                 URL: https://issues.apache.org/jira/browse/ARROW-17446
>             Project: Apache Arrow
>          Issue Type: Bug
>          Components: R
>            Reporter: Ben Kietzman
>            Priority: Major
>              Labels: compute
>
> Currently, if an R expression is not entirely supported by the arrow compute 
> engine, the entire input will be pulled into memory for native R to operate 
> on. It would be possible to instead provide add a custom compute function to 
> the registry (inside {{R_init_arrow}}, probably) which evaluates any sub 
> expressions which couldn't be translated to native arrow compute expressions.
> This would for example allow a filter expression including a call to an R 
> function {{baz}} to evaluate on a dataset larger than memory and with 
> predicate and projection pushdown as normal using the expressions which *are* 
> translatable. The resulting expression might look something like this in c++:
> {code}
> call("and_kleene", {
>   call("greater", {field_ref("a"), scalar(1)}),
>   call("r_expr", {field_ref("b")},
>       /*options=*/RexprOptions{cpp11::function(baz_sexp)}),
> });
> {code}
> In this case although the "r_expr" function is opaque to compute and 
> datasets, we would still recognize that only fields "a" and "b" need to be 
> materialized. Furthermore, the first member of the filter's conjunction is 
> {{a > 1}}, which *is* translatable and could be used for predicate pushdown, 
> for checking against parquet statistics, etc.
> Since R is not multithreaded, the compute function would need to take a 
> global lock to ensure only a single thread of R execution. This would also 
> block the interpreter, so it's not a high-performance solution... but it 
> *would* block the interpreter less than doing everything in pure native R 
> (since at least *some* of the work could be offloaded to worker threads and 
> we could take advantage of batched input). Still, it seems like a worthwhile 
> option to consider



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

Reply via email to