[jira] [Commented] (ARROW-17361) [R] dplyr::summarize fails with division when divisor is a variable

Jira Mon, 15 Aug 2022 07:35:43 -0700


    [ 
https://issues.apache.org/jira/browse/ARROW-17361?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17579737#comment-17579737
 ]


Dragoș Moldovan-Grünfeld commented on ARROW-17361:
--------------------------------------------------

Thanks for reporting this. I can confirm it is an unintended consequence of how 
we do the evaluation inside {{summarise}}. We should definitely get this to 
work. In the meantime, I can recommend a work-around using {{rlang}}'s 
injection operator ({{!!}}):
{code:r}
open_dataset(tmpdir) %>%
  summarize(value = sum(y) / !!scale_factor) %>%
  collect()
#> # A tibble: 1 × 1
#>   value
#>   <dbl>
#> 1     3
{code}  

> [R] dplyr::summarize fails with division when divisor is a variable
> -------------------------------------------------------------------
>
>                 Key: ARROW-17361
>                 URL: https://issues.apache.org/jira/browse/ARROW-17361
>             Project: Apache Arrow
>          Issue Type: Bug
>          Components: R
>    Affects Versions: 8.0.0
>            Reporter: Oliver Reiter
>            Assignee: Dragoș Moldovan-Grünfeld
>            Priority: Minor
>              Labels: aggregation, dplyr
>
> Hello,
> I found this odd behaviour when trying to compute an aggregate with 
> dplyr::summarize: When I want to use a pre-defined variable to do a divison 
> while aggregating, the execution fails with 'unsupported expression'. When I 
> the value of the variable as is in the aggregation, it works.
>  
> See below:
>  
> {code:java}
> library(dplyr)
> library(arrow)
> small_dataset <- tibble::tibble(
>   ## x = rep(c("a", "b"), each = 5),
>   y = rep(1:5, 2)
> )
> ## convert "small_dataset" into a ...dataset
> tmpdir <- tempfile()
> dir.create(tmpdir)
> write_dataset(small_dataset, tmpdir)
> ## works
> open_dataset(tmpdir) %>%
>   summarize(value = sum(y) / 10) %>%
>   collect()
> ## fails
> scale_factor <- 10
> open_dataset(tmpdir) %>%
>   summarize(value = sum(y) / scale_factor) %>%
>   collect()
> #> Fehler: Error in summarize_eval(names(exprs)[i],
> #> exprs[[i]], ctx, length(.data$group_by_vars) > :
> #   Expression sum(y)/scale_factor is not an aggregate
> #   expression or is not supported in Arrow
> # Call collect() first to pull data into R.
>    {code}
> I was not sure how to name this issue/bug (if it is one), so if there is a 
> clearer, more descriptive title you're welcome to adjust.
>  
> Thanks for your work!
>  
> Oliver
>  
> {code:java}
> > arrow_info()
> Arrow package version: 8.0.0
> Capabilities:
>                
> dataset    TRUE
> substrait FALSE
> parquet    TRUE
> json       TRUE
> s3         TRUE
> utf8proc   TRUE
> re2        TRUE
> snappy     TRUE
> gzip       TRUE
> brotli     TRUE
> zstd       TRUE
> lz4        TRUE
> lz4_frame  TRUE
> lzo       FALSE
> bz2        TRUE
> jemalloc   TRUE
> mimalloc   TRUE
> Memory:
>                   
> Allocator jemalloc
> Current   64 bytes
> Max       41.25 Kb
> Runtime:
>                         
> SIMD Level          avx2
> Detected SIMD Level avx2
> Build:
>                            
> C++ Library Version   8.0.0
> C++ Compiler            GNU
> C++ Compiler Version 12.1.0 {code}



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[jira] [Commented] (ARROW-17361) [R] dplyr::summarize fails with division when divisor is a variable

Reply via email to