[ 
https://issues.apache.org/jira/browse/ARROW-15679?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ian Cook resolved ARROW-15679.
------------------------------
    Fix Version/s: 8.0.0
       Resolution: Fixed

Issue resolved by pull request 12435
[https://github.com/apache/arrow/pull/12435]

> [R] count should return an ungrouped dataframe
> ----------------------------------------------
>
>                 Key: ARROW-15679
>                 URL: https://issues.apache.org/jira/browse/ARROW-15679
>             Project: Apache Arrow
>          Issue Type: Bug
>          Components: R
>    Affects Versions: 7.0.0
>            Reporter: Sam Albers
>            Assignee: Sam Albers
>            Priority: Major
>              Labels: pull-request-available
>             Fix For: 8.0.0
>
>          Time Spent: 40m
>  Remaining Estimate: 0h
>
> Unless grouped before `dplyr::count` returns a ungrouped data.frame. The 
> arrow implement preserves the grouping variables:
>  
> {code:java}
> library(arrow, warn.conflicts = FALSE)
> library(dplyr, warn.conflicts = FALSE)
> tf1 <- tempfile()
> dir.create(tf1)
> starwars |>
>   write_dataset(tf1)
> # no group ----------------------------------------------------------------
> ## dplyr behaviour
> count_dplyr_no_group <- starwars %>%
>   count(gender, homeworld, species)
> group_vars(count_dplyr_no_group)
> #> character(0)
> ## arrow behaviour
> count_arrow_no_group <- open_dataset(tf1) %>%
>   count(gender, homeworld, species) %>%
>   collect()
> group_vars(count_arrow_no_group)
> #> [1] "gender"    "homeworld"
> {code}
> If I am correct that this is a undesired behaviour I think it can be fixed 
> [here|https://github.com/apache/arrow/blob/5ad5ddcafee8fada9cebb341df638b750c98efb7/r/R/dplyr-count.R#L20-L35]
>  using this patch:
>  
> {code:java}
> count.arrow_dplyr_query <- function(x, ..., wt = NULL, sort = FALSE, name = 
> NULL) {
>   if (!missing(...)) {
>     out <- dplyr::group_by(x, ..., .add = TRUE)
>   } else {
>     out <- x
>   }
>   out <- dplyr::tally(out, wt = {{ wt }}, sort = sort, name = name)
>   gv <- dplyr::group_vars(x)
>   if (rlang::is_empty(gv)) {
>     out <- dplyr::ungroup(out)
>   } else {
>     # Restore original group vars
>     out$group_by_vars <- gv
>   }
>   out
> }
> {code}
>  
> I can submit a PR with some tests if that would be helpful.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

Reply via email to