Oliver Reiter created ARROW-17361:
-------------------------------------
Summary: dplyr::summarize fails with division when divisor is a
variable
Key: ARROW-17361
URL: https://issues.apache.org/jira/browse/ARROW-17361
Project: Apache Arrow
Issue Type: Bug
Components: R
Affects Versions: 8.0.0
Reporter: Oliver Reiter
Hello,
I found this odd behaviour when trying to compute an aggregate with
dplyr::summarize: When I want to use a pre-defined variable to do a divison
while aggregating, the execution fails with 'unsupported expression'. When I
the value of the variable as is in the aggregation, it works.
See below:
{code:java}
library(dplyr)
library(arrow)
small_dataset <- tibble::tibble(
## x = rep(c("a", "b"), each = 5),
y = rep(1:5, 2)
)
## convert "small_dataset" into a ...dataset
tmpdir <- tempfile()
dir.create(tmpdir)
write_dataset(small_dataset, tmpdir)
## works
open_dataset(tmpdir) %>%
summarize(value = sum(y) / 10) %>%
collect()
## fails
scale_factor <- 10
open_dataset(tmpdir) %>%
summarize(value = sum(y) / scale_factor) %>%
collect()
#> Fehler: Error in summarize_eval(names(exprs)[i],
#> exprs[[i]], ctx, length(.data$group_by_vars) > :
# Expression sum(y)/scale_factor is not an aggregate
# expression or is not supported in Arrow
# Call collect() first to pull data into R.
{code}
I was not sure how to name this issue/bug (if it is one), so if there is a
clearer, more descriptive title you're welcome to adjust.
Thanks for your work!
Oliver
{code:java}
> arrow_info()
Arrow package version: 8.0.0
Capabilities:
dataset TRUE
substrait FALSE
parquet TRUE
json TRUE
s3 TRUE
utf8proc TRUE
re2 TRUE
snappy TRUE
gzip TRUE
brotli TRUE
zstd TRUE
lz4 TRUE
lz4_frame TRUE
lzo FALSE
bz2 TRUE
jemalloc TRUE
mimalloc TRUE
Memory:
Allocator jemalloc
Current 64 bytes
Max 41.25 Kb
Runtime:
SIMD Level avx2
Detected SIMD Level avx2
Build:
C++ Library Version 8.0.0
C++ Compiler GNU
C++ Compiler Version 12.1.0 {code}
--
This message was sent by Atlassian Jira
(v8.20.10#820010)