[jira] [Commented] (ARROW-16700) [C++] [R] [Datasets] aggregates on partitioning columns

Weston Pace (Jira) Tue, 05 Jul 2022 10:06:07 -0700


    [ 
https://issues.apache.org/jira/browse/ARROW-16700?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17562762#comment-17562762
 ]


Weston Pace commented on ARROW-16700:
-------------------------------------

> So I guess that basically means that my PR is linked to the wrong issue? I'm 
> not sure how to neatly resolve that

I think you just need to update the title of the PR.  I'm not sure if that 
removes the linkage to this issue but that's probably ok.  It should add 
linkage to the correct PR.

> For the guarantee issue that remains, I'm inclined to place the blame on 
> Scanner rather than all the nodes other than filter and project, and to just 
> insert the code for a trivial projection into Scanner to leverage the 
> existing SimplifyWithGuarantee implementation

I agree the best fix is on the scanner and this change should happen within the 
scan node.

> I'm assuming that an expression that only selects an existing field will just 
> result in a pointer copy, and that evaluating a literal expression to a 
> scalar is also cheap (at least if the literal isn't massive). What do you 
> think?

I think you are correct here.  It should be cheap.

> I'm not 100% sure on that one because of the tag fields that Scanner normally 
> adds, which Substrait wouldn't know about. It feels fragile to leave them in 
> because I imagine they would affect column indices after a join if not 
> treated carefully. But I get your point.

I'm not entirely sure I understand your point on the join.  The partition 
columns I think can just be normal columns as far as Substrait is considered.  
Are you maybe thinking of the __fragment_index, __filename, etc. columns?

> [C++] [R] [Datasets] aggregates on partitioning columns
> -------------------------------------------------------
>
>                 Key: ARROW-16700
>                 URL: https://issues.apache.org/jira/browse/ARROW-16700
>             Project: Apache Arrow
>          Issue Type: Bug
>          Components: C++, R
>            Reporter: Jonathan Keane
>            Assignee: Jeroen van Straten
>            Priority: Blocker
>              Labels: pull-request-available
>             Fix For: 9.0.0, 8.0.1
>
>          Time Spent: 1h 10m
>  Remaining Estimate: 0h
>
> When summarizing a whole dataset (without group_by) with an aggregate, and 
> summarizing a partitioned column, arrow returns wrong data:
> {code:r}
> library(arrow, warn.conflicts = FALSE)
> library(dplyr, warn.conflicts = FALSE)
> df <- expand.grid(
>   some_nulls = c(0L, 1L, 2L),
>   year = 2010:2023,
>   month = 1:12,
>   day = 1:30
> )
> path <- tempfile()
> dir.create(path)
> write_dataset(df, path, partitioning = c("year", "month"))
> ds <- open_dataset(path)
> # with arrow the mins/maxes are off for partitioning columns
> ds %>%
>   summarise(n = n(), min_year = min(year), min_month = min(month), min_day = 
> min(day), max_year = max(year), max_month = max(month), max_day = max(day)) 
> %>% 
>   collect()
> #> # A tibble: 1 × 7
> #>       n min_year min_month min_day max_year max_month max_day
> #>   <int>    <int>     <int>   <int>    <int>     <int>   <int>
> #> 1 15120     2023         1       1     2023        12      30
> # comapred to what we get with dplyr
> df %>%
>   summarise(n = n(), min_year = min(year), min_month = min(month), min_day = 
> min(day), max_year = max(year), max_month = max(month), max_day = max(day)) 
> %>% 
>   collect()
> #>       n min_year min_month min_day max_year max_month max_day
> #> 1 15120     2010         1       1     2023        12      30
> # even min alone is off:
> ds %>%
>   summarise(min_year = min(year)) %>% 
>   collect()
> #> # A tibble: 1 × 1
> #>   min_year
> #>      <int>
> #> 1     2016
>   
> # but non-partitioning columns are fine:
> ds %>%
>   summarise(min_day = min(day)) %>% 
>   collect()
> #> # A tibble: 1 × 1
> #>   min_day
> #>     <int>
> #> 1       1
>   
>   
> # But with a group_by, this seems ok
> ds %>%
>   group_by(some_nulls) %>%
>   summarise(min_year = min(year)) %>% 
>   collect()
> #> # A tibble: 3 × 2
> #>   some_nulls min_year
> #>        <int>    <int>
> #> 1          0     2010
> #> 2          1     2010
> #> 3          2     2010
> {code}



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[jira] [Commented] (ARROW-16700) [C++] [R] [Datasets] aggregates on partitioning columns

Reply via email to