[
https://issues.apache.org/jira/browse/ARROW-16700?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17562721#comment-17562721
]
Weston Pace commented on ARROW-16700:
-------------------------------------
{quote}tl;dr it looks like min/max itself is broken and only aggregates only
the last partition it sees (in each thread); every Consume call casually
ignores and overrides this->state. Extraordinary claims demand extraordinary
evidence, especially since I only started looking at Acero in-depth and
installed R about five hours ago, so here's my complete analysis for
posterity...{quote}
Yes, min/max is broken in the way you describe. This is captured in
ARROW-16904. [~octalene] is working on updating the unit tests so that we can
reproduce this issue correctly. Sorry, I hadn't realized this JIRA also
involved a min/max and I hadn't realized the project workaround you mentioned.
However, that is excellent analysis.
You are correct that a project node would fix this issue. However, a project
node isn't normally inserted immediately after a scan node. In fact, what is
happening here, is that R always inserts a project node immediately _before_ an
aggregate node. Either way, that is a pretty significant workaround, as the
project & filter nodes are already satisfied by the guarantee.
Still, I'd like to leave this issue in place for the moment, though maybe it
doesn't need to be a blocker. Future Substrait queries would, in theory, be
able to create plans without the preceding project node. At some point the
scan node will not emit these columns unless they are asked for so I don't
think a project to satisfy the column-dropping emit will always be necessary
either.
> [C++] [R] [Datasets] aggregates on partitioning columns
> -------------------------------------------------------
>
> Key: ARROW-16700
> URL: https://issues.apache.org/jira/browse/ARROW-16700
> Project: Apache Arrow
> Issue Type: Bug
> Components: C++, R
> Reporter: Jonathan Keane
> Assignee: Jeroen van Straten
> Priority: Blocker
> Labels: pull-request-available
> Fix For: 9.0.0, 8.0.1
>
> Time Spent: 0.5h
> Remaining Estimate: 0h
>
> When summarizing a whole dataset (without group_by) with an aggregate, and
> summarizing a partitioned column, arrow returns wrong data:
> {code:r}
> library(arrow, warn.conflicts = FALSE)
> library(dplyr, warn.conflicts = FALSE)
> df <- expand.grid(
> some_nulls = c(0L, 1L, 2L),
> year = 2010:2023,
> month = 1:12,
> day = 1:30
> )
> path <- tempfile()
> dir.create(path)
> write_dataset(df, path, partitioning = c("year", "month"))
> ds <- open_dataset(path)
> # with arrow the mins/maxes are off for partitioning columns
> ds %>%
> summarise(n = n(), min_year = min(year), min_month = min(month), min_day =
> min(day), max_year = max(year), max_month = max(month), max_day = max(day))
> %>%
> collect()
> #> # A tibble: 1 × 7
> #> n min_year min_month min_day max_year max_month max_day
> #> <int> <int> <int> <int> <int> <int> <int>
> #> 1 15120 2023 1 1 2023 12 30
> # comapred to what we get with dplyr
> df %>%
> summarise(n = n(), min_year = min(year), min_month = min(month), min_day =
> min(day), max_year = max(year), max_month = max(month), max_day = max(day))
> %>%
> collect()
> #> n min_year min_month min_day max_year max_month max_day
> #> 1 15120 2010 1 1 2023 12 30
> # even min alone is off:
> ds %>%
> summarise(min_year = min(year)) %>%
> collect()
> #> # A tibble: 1 × 1
> #> min_year
> #> <int>
> #> 1 2016
>
> # but non-partitioning columns are fine:
> ds %>%
> summarise(min_day = min(day)) %>%
> collect()
> #> # A tibble: 1 × 1
> #> min_day
> #> <int>
> #> 1 1
>
>
> # But with a group_by, this seems ok
> ds %>%
> group_by(some_nulls) %>%
> summarise(min_year = min(year)) %>%
> collect()
> #> # A tibble: 3 × 2
> #> some_nulls min_year
> #> <int> <int>
> #> 1 0 2010
> #> 2 1 2010
> #> 3 2 2010
> {code}
--
This message was sent by Atlassian Jira
(v8.20.10#820010)