[
https://issues.apache.org/jira/browse/ARROW-13618?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Ian Cook updated ARROW-13618:
-----------------------------
Description:
ARROW-13344 enabled the dplyr verb {{summarise()}} to use the Arrow engine but
kept this off by default, controlled by the {{arrow.debug}} option.
Before this can be turned on by default, we should ensure that the following
are all implemented:
* a sufficient set of hash aggregate kernels and R aggregate function mappings
to them, covering the vast majority of all aggregate functions that dplyr users
call in {{summarise()}} (add any additional required ones to ARROW-13339)
* support for a sufficient set of data types in aggregates
* support for a sufficient set of data types in grouping columns
* handling of {{NA}} and {{NaN}} values in aggregates and the {{na.rm}} option
consistent with base R and dplyr (ARROW-13497 and possibly other issues)
* handling of {{NA}} and {{NaN}} values in grouping columns consistent with
dplyr
* handling empty or bad input to {{summarise()}} (ARROW-13543)
* many new tests to confirm equivalent results from a variety of {{group_by()
%>% summarise()}} queries on data frames and on Arrow data
* resolution of various related bugs
was:
ARROW-13344 enabled the dplyr verb {{summarise()}} to use the Arrow engine but
kept this off by default, controlled by the {{arrow.debug}} option.
Before this can be turned on by default, we should ensure that the following
are all implemented:
* a sufficient set of hash aggregate kernels and R aggregate function mappings
to them, covering the vast majority of all aggregate functions that dplyr users
call in {{summarise()}} (add any additional required ones to ARROW-13339)
* support for a sufficient set of data types in aggregates
* support for a sufficient set of data types in grouping columns
* handling of {{NA}} and {{NaN}} values in aggregates and the {{na.rm}} option
consistent with base R and dplyr (ARROW-13497 and possibly other issues)
* handling of {{NA}} and {{NaN}} values in grouping columns consistent with
dplyr
* handling empty or bad input to {{summarise()}} (ARROW-13543)
* many new tests to confirm equivalent results from a variety of {{group_by()
%>% summarise()}} queries on data frames and on Arrow data
> [R] Use Arrow engine for summarize() by default
> -------------------------------------------------
>
> Key: ARROW-13618
> URL: https://issues.apache.org/jira/browse/ARROW-13618
> Project: Apache Arrow
> Issue Type: Improvement
> Components: R
> Reporter: Ian Cook
> Assignee: Ian Cook
> Priority: Major
> Fix For: 6.0.0
>
>
> ARROW-13344 enabled the dplyr verb {{summarise()}} to use the Arrow engine
> but kept this off by default, controlled by the {{arrow.debug}} option.
> Before this can be turned on by default, we should ensure that the following
> are all implemented:
> * a sufficient set of hash aggregate kernels and R aggregate function
> mappings to them, covering the vast majority of all aggregate functions that
> dplyr users call in {{summarise()}} (add any additional required ones to
> ARROW-13339)
> * support for a sufficient set of data types in aggregates
> * support for a sufficient set of data types in grouping columns
> * handling of {{NA}} and {{NaN}} values in aggregates and the {{na.rm}}
> option consistent with base R and dplyr (ARROW-13497 and possibly other
> issues)
> * handling of {{NA}} and {{NaN}} values in grouping columns consistent with
> dplyr
> * handling empty or bad input to {{summarise()}} (ARROW-13543)
> * many new tests to confirm equivalent results from a variety of
> {{group_by() %>% summarise()}} queries on data frames and on Arrow data
> * resolution of various related bugs
--
This message was sent by Atlassian Jira
(v8.3.4#803005)