DavZim opened a new issue, #14872:
URL: https://github.com/apache/arrow/issues/14872
### Describe the bug, including details regarding any error messages,
version, and platform.
When collecting a query with multiple group_by + summarise statements, one
variable gets wrongly assigned values from another variable. When an ungroup is
inserted, everything works fine again.
To reproduce, consider the following:
In the examples below, the variable `gender` should be `F`, or `M` and not
`Group X`.
When the `ungroup()` is inserted (second part), gender is again F/M and not
Group X.
``` r
library(dplyr)
library(arrow)
# Create sample dataset
N <- 1000
set.seed(123)
orig_data <- tibble(
code_group = sample(paste("Group", 1:2), N, replace = TRUE),
year = sample(2015:2016, N, replace = TRUE),
gender = sample(c("F", "M"), N, replace = TRUE),
value = runif(N, 0, 10)
)
write_dataset(orig_data, "example")
# Query and replicate the error
(ds <- open_dataset("example/"))
#> FileSystemDataset with 1 Parquet file
#> code_group: string
#> year: int32
#> gender: string
#> value: double
ds |>
group_by(year, code_group, gender) |>
summarise(value = sum(value)) |>
group_by(code_group, gender) |>
summarise(value = max(value), NN = n()) |>
collect()
#> # A tibble: 2 × 4
#> # Groups: code_group [2]
#> code_group gender value NN
#> <chr> <chr> <dbl> <int>
#> 1 Group 1 Group 1 724. 4
#> 2 Group 2 Group 2 661. 4
```
**ERROR** the gender variable is replaced by the values of the group variable
``` r
ds |>
group_by(year, code_group, gender) |>
summarise(value = sum(value)) |>
ungroup() |> #< Added this
line...
group_by(code_group, gender) |>
summarise(value = max(value), NN = n()) |>
collect()
#> # A tibble: 4 × 4
#> # Groups: code_group [2]
#> code_group gender value NN
#> <chr> <chr> <dbl> <int>
#> 1 Group 1 F 724. 2
#> 2 Group 2 M 627. 2
#> 3 Group 1 M 658. 2
#> 4 Group 2 F 661. 2
```
**Note** now after inserting the `ungroup()` between the group-by -
summarise calls, gender is not replaced
Quick look at the query (note Node 4 where `"gender": code_group`)
``` r
ds |>
group_by(year, code_group, gender) |>
summarise(value = sum(value)) |>
group_by(code_group, gender) |>
summarise(value = max(value), NN = n()) |>
show_query()
#> ExecPlan with 8 nodes:
#> 7:SinkNode{}
#> 6:ProjectNode{projection=[code_group, gender, value, NN]}
#> 5:GroupByNode{keys=["code_group", "gender"], aggregates=[
#> hash_max(value, {skip_nulls=false, min_count=0}),
#> hash_sum(NN, {skip_nulls=true, min_count=1}),
#> ]}
#> 4:ProjectNode{projection=[value, "NN": 1, code_group, "gender":
code_group]} #< gender is wrongfully mapped to code_group!
#> 3:ProjectNode{projection=[year, code_group, gender, value]}
#> 2:GroupByNode{keys=["year", "code_group", "gender"],
aggregates=[
#> hash_sum(value, {skip_nulls=false, min_count=0}),
#> ]}
#> 1:ProjectNode{projection=[value, year, code_group, gender]}
#> 0:SourceNode{}
```
Note that this was also asked [here on
SO](https://stackoverflow.com/q/74710844/3048453)
### Component(s)
R
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]