[
https://issues.apache.org/jira/browse/ARROW-17446?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17580869#comment-17580869
]
Ben Kietzman commented on ARROW-17446:
--------------------------------------
A difficulty with this approach is currently only scalar functions can be used
in a filter, which would mean that we'd need to fallback to pure r if baz()
depended on batch order in any way. See also ARROW-12632, ongoing work to
support ordering between exec nodes
> [R] Allow unrecognized R expressions to be callable as compute::Functions
> -------------------------------------------------------------------------
>
> Key: ARROW-17446
> URL: https://issues.apache.org/jira/browse/ARROW-17446
> Project: Apache Arrow
> Issue Type: Bug
> Components: R
> Reporter: Ben Kietzman
> Priority: Major
> Labels: compute
>
> Currently, if an R expression is not entirely supported by the arrow compute
> engine, the entire input will be pulled into memory for native R to operate
> on. It would be possible to instead provide add a custom compute function to
> the registry (inside {{R_init_arrow}}, probably) which evaluates any sub
> expressions which couldn't be translated to native arrow compute expressions.
> This would for example allow a filter expression including a call to an R
> function {{baz}} to evaluate on a dataset larger than memory and with
> predicate and projection pushdown as normal using the expressions which *are*
> translatable. The resulting expression might look something like this in c++:
> {code}
> call("and_kleene", {
> call("greater", {field_ref("a"), scalar(1)}),
> call("r_expr", {field_ref("b")},
> /*options=*/RexprOptions{cpp11::function(baz_sexp)}),
> });
> {code}
> In this case although the "r_expr" function is opaque to compute and
> datasets, we would still recognize that only fields "a" and "b" need to be
> materialized. Furthermore, the first member of the filter's conjunction is
> {{a > 1}}, which *is* translatable and could be used for predicate pushdown,
> for checking against parquet statistics, etc.
> Since R is not multithreaded, the compute function would need to take a
> global lock to ensure only a single thread of R execution. This would also
> block the interpreter, so it's not a high-performance solution... but it
> *would* block the interpreter less than doing everything in pure native R
> (since at least *some* of the work could be offloaded to worker threads and
> we could take advantage of batched input). Still, it seems like a worthwhile
> option to consider
--
This message was sent by Atlassian Jira
(v8.20.10#820010)