[GitHub] [arrow] DavZim opened a new issue, #14872: [R] arrow returns wrong variable content when multiple group_by/summarise statements are used

GitBox Wed, 07 Dec 2022 07:55:53 -0800


DavZim opened a new issue, #14872:
URL: https://github.com/apache/arrow/issues/14872


   ### Describe the bug, including details regarding any error messages, 
version, and platform.
   
   When collecting a query with multiple group_by + summarise statements, one 
variable gets wrongly assigned values from another variable. When an ungroup is 
inserted, everything works fine again.
   
   To reproduce, consider the following:
   In the examples below, the variable `gender` should be `F`, or `M` and not 
`Group X`.
   When the `ungroup()` is inserted (second part), gender is again F/M and not 
Group X.
   
   ``` r
   library(dplyr)
   library(arrow)
   
   # Create sample dataset
   N <- 1000
   set.seed(123)
   orig_data <- tibble(
     code_group = sample(paste("Group", 1:2), N, replace = TRUE),
     year = sample(2015:2016, N, replace = TRUE),
     gender = sample(c("F", "M"), N, replace = TRUE),
     value = runif(N, 0, 10)
   )
   write_dataset(orig_data, "example")
   
   # Query and replicate the error
   (ds <- open_dataset("example/"))
   #> FileSystemDataset with 1 Parquet file
   #> code_group: string
   #> year: int32
   #> gender: string
   #> value: double
   
   ds |>
     group_by(year, code_group, gender) |>
     summarise(value = sum(value)) |>
     group_by(code_group, gender) |>
     summarise(value = max(value), NN = n()) |>
     collect()
   #> # A tibble: 2 × 4
   #> # Groups:   code_group [2]
   #>   code_group gender  value    NN
   #>   <chr>      <chr>   <dbl> <int>
   #> 1 Group 1    Group 1  724.     4
   #> 2 Group 2    Group 2  661.     4
   ```
   
   **ERROR** the gender variable is replaced by the values of the group variable
   
   ``` r
   ds |>
     group_by(year, code_group, gender) |>
     summarise(value = sum(value)) |>
     ungroup() |>                                             #< Added this 
line...
     group_by(code_group, gender) |>
     summarise(value = max(value), NN = n()) |>
     collect()
   #> # A tibble: 4 × 4
   #> # Groups:   code_group [2]
   #>   code_group gender value    NN
   #>   <chr>      <chr>  <dbl> <int>
   #> 1 Group 1    F       724.     2
   #> 2 Group 2    M       627.     2
   #> 3 Group 1    M       658.     2
   #> 4 Group 2    F       661.     2
   ```
   
   **Note** now after inserting the `ungroup()` between the group-by - 
summarise calls, gender is not replaced
   
   
   Quick look at the query (note Node 4 where `"gender": code_group`)
   
   ``` r
   ds |>
     group_by(year, code_group, gender) |>
     summarise(value = sum(value)) |>
     group_by(code_group, gender) |>
     summarise(value = max(value), NN = n()) |> 
     show_query()
   #> ExecPlan with 8 nodes:
   #> 7:SinkNode{}
   #>   6:ProjectNode{projection=[code_group, gender, value, NN]}
   #>     5:GroupByNode{keys=["code_group", "gender"], aggregates=[
   #>      hash_max(value, {skip_nulls=false, min_count=0}),
   #>      hash_sum(NN, {skip_nulls=true, min_count=1}),
   #>     ]}
   #>       4:ProjectNode{projection=[value, "NN": 1, code_group, "gender": 
code_group]}       #< gender is wrongfully mapped to code_group! 
   #>         3:ProjectNode{projection=[year, code_group, gender, value]}
   #>           2:GroupByNode{keys=["year", "code_group", "gender"], 
aggregates=[
   #>              hash_sum(value, {skip_nulls=false, min_count=0}),
   #>           ]}
   #>             1:ProjectNode{projection=[value, year, code_group, gender]}
   #>               0:SourceNode{}
   ```
   
   Note that this was also asked [here on 
SO](https://stackoverflow.com/q/74710844/3048453)
   
   ### Component(s)
   
   R


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

[GitHub] [arrow] DavZim opened a new issue, #14872: [R] arrow returns wrong variable content when multiple group_by/summarise statements are used

Reply via email to