[jira] [Created] (ARROW-17361) dplyr::summarize fails with division when divisor is a variable

Oliver Reiter (Jira) Tue, 09 Aug 2022 13:37:05 -0700

Oliver Reiter created ARROW-17361:
-------------------------------------

             Summary: dplyr::summarize fails with division when divisor is a 
variable
                 Key: ARROW-17361
                 URL: https://issues.apache.org/jira/browse/ARROW-17361
             Project: Apache Arrow
          Issue Type: Bug
          Components: R
    Affects Versions: 8.0.0
            Reporter: Oliver Reiter



Hello,

I found this odd behaviour when trying to compute an aggregate with 
dplyr::summarize: When I want to use a pre-defined variable to do a divison 
while aggregating, the execution fails with 'unsupported expression'. When I 
the value of the variable as is in the aggregation, it works.

 

See below:

 
{code:java}
library(dplyr)
library(arrow)

small_dataset <- tibble::tibble(
  ## x = rep(c("a", "b"), each = 5),
  y = rep(1:5, 2)
)

## convert "small_dataset" into a ...dataset
tmpdir <- tempfile()
dir.create(tmpdir)
write_dataset(small_dataset, tmpdir)

## works
open_dataset(tmpdir) %>%
  summarize(value = sum(y) / 10) %>%
  collect()

## fails
scale_factor <- 10
open_dataset(tmpdir) %>%
  summarize(value = sum(y) / scale_factor) %>%
  collect()
#> Fehler: Error in summarize_eval(names(exprs)[i],
#> exprs[[i]], ctx, length(.data$group_by_vars) > :
#   Expression sum(y)/scale_factor is not an aggregate
#   expression or is not supported in Arrow
# Call collect() first to pull data into R.
   {code}
I was not sure how to name this issue/bug (if it is one), so if there is a 
clearer, more descriptive title you're welcome to adjust.

 

Thanks for your work!

 

Oliver

 
{code:java}
> arrow_info()
Arrow package version: 8.0.0

Capabilities:
               
dataset    TRUE
substrait FALSE
parquet    TRUE
json       TRUE
s3         TRUE
utf8proc   TRUE
re2        TRUE
snappy     TRUE
gzip       TRUE
brotli     TRUE
zstd       TRUE
lz4        TRUE
lz4_frame  TRUE
lzo       FALSE
bz2        TRUE
jemalloc   TRUE
mimalloc   TRUE

Memory:
                  
Allocator jemalloc
Current   64 bytes
Max       41.25 Kb

Runtime:
                        
SIMD Level          avx2
Detected SIMD Level avx2

Build:
                           
C++ Library Version   8.0.0
C++ Compiler            GNU
C++ Compiler Version 12.1.0 {code}



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[jira] [Created] (ARROW-17361) dplyr::summarize fails with division when divisor is a variable

Reply via email to