[jira] [Updated] (ARROW-13618) [R] Use Arrow engine for summarize() by default

Ian Cook (Jira) Thu, 12 Aug 2021 10:22:06 -0700


     [ 
https://issues.apache.org/jira/browse/ARROW-13618?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]


Ian Cook updated ARROW-13618:
-----------------------------
    Description: 
ARROW-13344 enabled the dplyr verb {{summarise()}} to use the Arrow engine but 
kept this off by default, controlled by the {{arrow.debug}} option.

Before this can be turned on by default, we should ensure that the following 
are all implemented:
 * a sufficient set of hash aggregate kernels and R aggregate function mappings 
to them, covering the vast majority of all aggregate functions that dplyr users 
call in {{summarise()}} (add any additional required ones to ARROW-13339)
 * support for a sufficient set of data types in aggregates
 * support for a sufficient set of data types in grouping columns
 * handling of {{NA}} and {{NaN}} values in aggregates and the {{na.rm}} option 
consistent with base R and dplyr (ARROW-13497 and possibly other issues)
 * handling of {{NA}} and {{NaN}} values in grouping columns consistent with 
dplyr
 * handling empty or bad input to {{summarise()}} (ARROW-13543)
 * many new tests to confirm equivalent results from a variety of {{group_by() 
%>% summarise()}} queries on data frames and on Arrow data

  was:
ARROW-13344 enabled the dplyr verb {{summarise()}} to use the Arrow engine but 
kept this off by default, controlled by the {{arrow.debug}} option.

Before this can be turned on by default, we should ensure that the following 
are all implemented:
 * a sufficient set of hash aggregate kernels and R aggregate function mappings 
to them, covering the vast majority of all aggregate functions that dplyr users 
call in {{summarise()}} (add any additional required ones to ARROW-13339)
 * support for a sufficient set of data types in aggregates\{{}}
 * support for a sufficient set of data types in grouping columns
 * handling of {{NA}} and {{NaN}} values in aggregates and the {{na.rm}} option 
consistent with base R and dplyr (ARROW-13497 and possibly other issues)
 * handling of {{NA}} and {{NaN}} values in grouping columns consistent with 
dplyr
 * handling empty or bad input to {{summarise()}} (ARROW-13543)
 * many new tests to confirm equivalent results from a variety of {{group_by() 
%>% summarise()}} queries on data frames and on Arrow data


> [R] Use Arrow engine for summarize() by default  
> -------------------------------------------------
>
>                 Key: ARROW-13618
>                 URL: https://issues.apache.org/jira/browse/ARROW-13618
>             Project: Apache Arrow
>          Issue Type: Improvement
>          Components: R
>            Reporter: Ian Cook
>            Assignee: Ian Cook
>            Priority: Major
>             Fix For: 6.0.0
>
>
> ARROW-13344 enabled the dplyr verb {{summarise()}} to use the Arrow engine 
> but kept this off by default, controlled by the {{arrow.debug}} option.
> Before this can be turned on by default, we should ensure that the following 
> are all implemented:
>  * a sufficient set of hash aggregate kernels and R aggregate function 
> mappings to them, covering the vast majority of all aggregate functions that 
> dplyr users call in {{summarise()}} (add any additional required ones to 
> ARROW-13339)
>  * support for a sufficient set of data types in aggregates
>  * support for a sufficient set of data types in grouping columns
>  * handling of {{NA}} and {{NaN}} values in aggregates and the {{na.rm}} 
> option consistent with base R and dplyr (ARROW-13497 and possibly other 
> issues)
>  * handling of {{NA}} and {{NaN}} values in grouping columns consistent with 
> dplyr
>  * handling empty or bad input to {{summarise()}} (ARROW-13543)
>  * many new tests to confirm equivalent results from a variety of 
> {{group_by() %>% summarise()}} queries on data frames and on Arrow data



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Updated] (ARROW-13618) [R] Use Arrow engine for summarize() by default

Reply via email to