[jira] [Commented] (ARROW-16700) [C++] [R] [Datasets] aggregates on partitioning columns

Weston Pace (Jira) Thu, 30 Jun 2022 17:19:29 -0700


    [ 
https://issues.apache.org/jira/browse/ARROW-16700?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17561218#comment-17561218
 ]


Weston Pace commented on ARROW-16700:
-------------------------------------

So the root cause here is how partition columns are handed to the exec plan.  
In the datasets API a partitioning doesn't necessarily yield a single value for 
a batch, but rather an expression.

For example, the folder `/foo=7/chunk-0.parquet` will yield the expression `foo 
== 7` instead of the value `foo:7`.  This allow for rather novel partitioning 
schemes like `/temp=med/chunk-0.parquet` which could attach the expression `25 
< temp < 75`.

Unfortunately, this cleverness cannot really be utilized by the execution 
engine.  For example, how does one resolve the query `SELECT * FROM 
measurements WHERE temp == 50` when all they know for a batch is `25 < temp < 
75`.

At the moment, these expressions are attached to the incoming batch as a 
"guarantee".  In addition, we also add a new column to the batch (e.g. `foo` or 
`temp`).  However, no value is ever set on this column and so it ends up 
getting set implicitly to `NULL`.  This guarantee is then used by project and 
filter nodes to simplify an expression.

This is why the query `SELECT foo from dataset WHERE foo == 7` works.  The 
expression `foo == 7` gets simplified by the guarantee (`foo ==7`) to `true` 
and so the row is included (even though the value of the row is null).  The 
projection `field_ref('foo')` gets simplified to `scalar(7)` and so the 
underlying array (which is full of nulls) is not looked at.

However, aggregate nodes do not simplify with a guarantee.  So, instead, they 
see the raw underlying value (null) and it doesn't get processed correctly by 
the aggregate node.

One quick fix would be to get rid of the guarantee concept entirely.  When we 
have an incoming partition expression we should:

 * If the expression is an equality we replace it with a scalar.
 * If the expression is not an equality we raise an error.

> [C++] [R] [Datasets] aggregates on partitioning columns
> -------------------------------------------------------
>
>                 Key: ARROW-16700
>                 URL: https://issues.apache.org/jira/browse/ARROW-16700
>             Project: Apache Arrow
>          Issue Type: Bug
>          Components: C++, R
>            Reporter: Jonathan Keane
>            Priority: Blocker
>             Fix For: 9.0.0, 8.0.1
>
>
> When summarizing a whole dataset (without group_by) with an aggregate, and 
> summarizing a partitioned column, arrow returns wrong data:
> {code:r}
> library(arrow, warn.conflicts = FALSE)
> library(dplyr, warn.conflicts = FALSE)
> df <- expand.grid(
>   some_nulls = c(0L, 1L, 2L),
>   year = 2010:2023,
>   month = 1:12,
>   day = 1:30
> )
> path <- tempfile()
> dir.create(path)
> write_dataset(df, path, partitioning = c("year", "month"))
> ds <- open_dataset(path)
> # with arrow the mins/maxes are off for partitioning columns
> ds %>%
>   summarise(n = n(), min_year = min(year), min_month = min(month), min_day = 
> min(day), max_year = max(year), max_month = max(month), max_day = max(day)) 
> %>% 
>   collect()
> #> # A tibble: 1 × 7
> #>       n min_year min_month min_day max_year max_month max_day
> #>   <int>    <int>     <int>   <int>    <int>     <int>   <int>
> #> 1 15120     2023         1       1     2023        12      30
> # comapred to what we get with dplyr
> df %>%
>   summarise(n = n(), min_year = min(year), min_month = min(month), min_day = 
> min(day), max_year = max(year), max_month = max(month), max_day = max(day)) 
> %>% 
>   collect()
> #>       n min_year min_month min_day max_year max_month max_day
> #> 1 15120     2010         1       1     2023        12      30
> # even min alone is off:
> ds %>%
>   summarise(min_year = min(year)) %>% 
>   collect()
> #> # A tibble: 1 × 1
> #>   min_year
> #>      <int>
> #> 1     2016
>   
> # but non-partitioning columns are fine:
> ds %>%
>   summarise(min_day = min(day)) %>% 
>   collect()
> #> # A tibble: 1 × 1
> #>   min_day
> #>     <int>
> #> 1       1
>   
>   
> # But with a group_by, this seems ok
> ds %>%
>   group_by(some_nulls) %>%
>   summarise(min_year = min(year)) %>% 
>   collect()
> #> # A tibble: 3 × 2
> #>   some_nulls min_year
> #>        <int>    <int>
> #> 1          0     2010
> #> 2          1     2010
> #> 3          2     2010
> {code}



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[jira] [Commented] (ARROW-16700) [C++] [R] [Datasets] aggregates on partitioning columns

Reply via email to