[jira] [Updated] (ARROW-13434) [R] group_by() with an unnammed expression

Jonathan Keane (Jira) Thu, 22 Jul 2021 07:06:21 -0700


     [ 
https://issues.apache.org/jira/browse/ARROW-13434?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]


Jonathan Keane updated ARROW-13434:
-----------------------------------
    Description: 
With dplyr, when we group_by with an unnamed expression, a column is added to 
the dataframe that has the result of the expression.

{code}
> example_data %>% 
+   group_by(int < 4) %>% collect()
# A tibble: 10 x 8
# Groups:   int < 4 [3]
     int   dbl  dbl2 lgl   false chr   fct   `int < 4`
   <int> <dbl> <dbl> <lgl> <lgl> <chr> <fct> <lgl>    
 1     1   1.1     5 TRUE  FALSE a     a     TRUE     
 2     2   2.1     5 NA    FALSE b     b     TRUE     
 3     3   3.1     5 TRUE  FALSE c     c     TRUE     
 4    NA   4.1     5 FALSE FALSE d     d     NA       
 5     5   5.1     5 TRUE  FALSE e     NA    FALSE    
 6     6   6.1     5 NA    FALSE NA    NA    FALSE    
 7     7   7.1     5 NA    FALSE g     g     FALSE    
 8     8   8.1     5 FALSE FALSE h     h     FALSE    
 9     9  NA       5 FALSE FALSE i     i     FALSE    
10    10  10.1     5 NA    FALSE j     j     FALSE    
{code}

Arrow doesn't do this, however because we (currently) only add columns when the 
expression is named.

{code}
> Table$create(example_data) %>% 
+   group_by(int < 4) %>% collect()
 Error: Invalid: No match for FieldRef.Name(int < 4) in int: int32
dbl: double
dbl2: double
lgl: bool
false: bool
chr: string
fct: dictionary<values=string, indices=int8, ordered=0> 
{code}

This isn't a big deal right now since grouped aggregations aren't (quite) here 
yet, but once we start having support for that, we will have people using 
examples like this. 

  was:
With dplyr, when we group_by with an expression, a column is added to the 
dataframe that has the result of the expression.

{code}
> example_data %>% 
+   group_by(int < 4) %>% collect()
# A tibble: 10 x 8
# Groups:   int < 4 [3]
     int   dbl  dbl2 lgl   false chr   fct   `int < 4`
   <int> <dbl> <dbl> <lgl> <lgl> <chr> <fct> <lgl>    
 1     1   1.1     5 TRUE  FALSE a     a     TRUE     
 2     2   2.1     5 NA    FALSE b     b     TRUE     
 3     3   3.1     5 TRUE  FALSE c     c     TRUE     
 4    NA   4.1     5 FALSE FALSE d     d     NA       
 5     5   5.1     5 TRUE  FALSE e     NA    FALSE    
 6     6   6.1     5 NA    FALSE NA    NA    FALSE    
 7     7   7.1     5 NA    FALSE g     g     FALSE    
 8     8   8.1     5 FALSE FALSE h     h     FALSE    
 9     9  NA       5 FALSE FALSE i     i     FALSE    
10    10  10.1     5 NA    FALSE j     j     FALSE    
{code}

Arrow doesn't do this, however:

{code}
> Table$create(example_data) %>% 
+   group_by(int < 4) %>% collect()
 Error: Invalid: No match for FieldRef.Name(int < 4) in int: int32
dbl: double
dbl2: double
lgl: bool
false: bool
chr: string
fct: dictionary<values=string, indices=int8, ordered=0> 
{code}

This isn't a big deal right now since grouped aggregations aren't (quite) here 
yet, but once we start having support for that, we will have people using 
examples like this. This might actually be something we need/want to do in C++ 
instead of in the R client.

The workaround is relatively simple: add the expression in a mutate, then 
group_by that.


> [R] group_by() with an unnammed expression
> ------------------------------------------
>
>                 Key: ARROW-13434
>                 URL: https://issues.apache.org/jira/browse/ARROW-13434
>             Project: Apache Arrow
>          Issue Type: Improvement
>          Components: R
>            Reporter: Jonathan Keane
>            Priority: Major
>
> With dplyr, when we group_by with an unnamed expression, a column is added to 
> the dataframe that has the result of the expression.
> {code}
> > example_data %>% 
> +   group_by(int < 4) %>% collect()
> # A tibble: 10 x 8
> # Groups:   int < 4 [3]
>      int   dbl  dbl2 lgl   false chr   fct   `int < 4`
>    <int> <dbl> <dbl> <lgl> <lgl> <chr> <fct> <lgl>    
>  1     1   1.1     5 TRUE  FALSE a     a     TRUE     
>  2     2   2.1     5 NA    FALSE b     b     TRUE     
>  3     3   3.1     5 TRUE  FALSE c     c     TRUE     
>  4    NA   4.1     5 FALSE FALSE d     d     NA       
>  5     5   5.1     5 TRUE  FALSE e     NA    FALSE    
>  6     6   6.1     5 NA    FALSE NA    NA    FALSE    
>  7     7   7.1     5 NA    FALSE g     g     FALSE    
>  8     8   8.1     5 FALSE FALSE h     h     FALSE    
>  9     9  NA       5 FALSE FALSE i     i     FALSE    
> 10    10  10.1     5 NA    FALSE j     j     FALSE    
> {code}
> Arrow doesn't do this, however because we (currently) only add columns when 
> the expression is named.
> {code}
> > Table$create(example_data) %>% 
> +   group_by(int < 4) %>% collect()
>  Error: Invalid: No match for FieldRef.Name(int < 4) in int: int32
> dbl: double
> dbl2: double
> lgl: bool
> false: bool
> chr: string
> fct: dictionary<values=string, indices=int8, ordered=0> 
> {code}
> This isn't a big deal right now since grouped aggregations aren't (quite) 
> here yet, but once we start having support for that, we will have people 
> using examples like this. 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Updated] (ARROW-13434) [R] group_by() with an unnammed expression

Reply via email to