[jira] [Commented] (ARROW-16700) [C++] [R] [Datasets] aggregates on partitioning columns

Weston Pace (Jira) Tue, 05 Jul 2022 08:29:09 -0700


    [ 
https://issues.apache.org/jira/browse/ARROW-16700?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17562721#comment-17562721
 ]


Weston Pace commented on ARROW-16700:
-------------------------------------

{quote}tl;dr it looks like min/max itself is broken and only aggregates only 
the last partition it sees (in each thread); every Consume call casually 
ignores and overrides this->state. Extraordinary claims demand extraordinary 
evidence, especially since I only started looking at Acero in-depth and 
installed R about five hours ago, so here's my complete analysis for 
posterity...{quote}

Yes, min/max is broken in the way you describe.  This is captured in 
ARROW-16904.  [~octalene] is working on updating the unit tests so that we can 
reproduce this issue correctly.  Sorry, I hadn't realized this JIRA also 
involved a min/max and I hadn't realized the project workaround you mentioned.  
However, that is excellent analysis.

You are correct that a project node would fix this issue.  However, a project 
node isn't normally inserted immediately after a scan node.  In fact, what is 
happening here, is that R always inserts a project node immediately _before_ an 
aggregate node.  Either way, that is a pretty significant workaround, as the 
project & filter nodes are already satisfied by the guarantee.

Still, I'd like to leave this issue in place for the moment, though maybe it 
doesn't need to be a blocker.  Future Substrait queries would, in theory, be 
able to create plans without the preceding project node.  At some point the 
scan node will not emit these columns unless they are asked for so I don't 
think a project to satisfy the column-dropping emit will always be necessary 
either.

> [C++] [R] [Datasets] aggregates on partitioning columns
> -------------------------------------------------------
>
>                 Key: ARROW-16700
>                 URL: https://issues.apache.org/jira/browse/ARROW-16700
>             Project: Apache Arrow
>          Issue Type: Bug
>          Components: C++, R
>            Reporter: Jonathan Keane
>            Assignee: Jeroen van Straten
>            Priority: Blocker
>              Labels: pull-request-available
>             Fix For: 9.0.0, 8.0.1
>
>          Time Spent: 0.5h
>  Remaining Estimate: 0h
>
> When summarizing a whole dataset (without group_by) with an aggregate, and 
> summarizing a partitioned column, arrow returns wrong data:
> {code:r}
> library(arrow, warn.conflicts = FALSE)
> library(dplyr, warn.conflicts = FALSE)
> df <- expand.grid(
>   some_nulls = c(0L, 1L, 2L),
>   year = 2010:2023,
>   month = 1:12,
>   day = 1:30
> )
> path <- tempfile()
> dir.create(path)
> write_dataset(df, path, partitioning = c("year", "month"))
> ds <- open_dataset(path)
> # with arrow the mins/maxes are off for partitioning columns
> ds %>%
>   summarise(n = n(), min_year = min(year), min_month = min(month), min_day = 
> min(day), max_year = max(year), max_month = max(month), max_day = max(day)) 
> %>% 
>   collect()
> #> # A tibble: 1 × 7
> #>       n min_year min_month min_day max_year max_month max_day
> #>   <int>    <int>     <int>   <int>    <int>     <int>   <int>
> #> 1 15120     2023         1       1     2023        12      30
> # comapred to what we get with dplyr
> df %>%
>   summarise(n = n(), min_year = min(year), min_month = min(month), min_day = 
> min(day), max_year = max(year), max_month = max(month), max_day = max(day)) 
> %>% 
>   collect()
> #>       n min_year min_month min_day max_year max_month max_day
> #> 1 15120     2010         1       1     2023        12      30
> # even min alone is off:
> ds %>%
>   summarise(min_year = min(year)) %>% 
>   collect()
> #> # A tibble: 1 × 1
> #>   min_year
> #>      <int>
> #> 1     2016
>   
> # but non-partitioning columns are fine:
> ds %>%
>   summarise(min_day = min(day)) %>% 
>   collect()
> #> # A tibble: 1 × 1
> #>   min_day
> #>     <int>
> #> 1       1
>   
>   
> # But with a group_by, this seems ok
> ds %>%
>   group_by(some_nulls) %>%
>   summarise(min_year = min(year)) %>% 
>   collect()
> #> # A tibble: 3 × 2
> #>   some_nulls min_year
> #>        <int>    <int>
> #> 1          0     2010
> #> 2          1     2010
> #> 3          2     2010
> {code}



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[jira] [Commented] (ARROW-16700) [C++] [R] [Datasets] aggregates on partitioning columns

Reply via email to