Jonathan Keane created ARROW-13434:
--------------------------------------

             Summary: [R] group_by() with an expression
                 Key: ARROW-13434
                 URL: https://issues.apache.org/jira/browse/ARROW-13434
             Project: Apache Arrow
          Issue Type: Improvement
          Components: R
            Reporter: Jonathan Keane


With dplyr, when we group_by with an expression, a column is added to the 
dataframe that has the result of the expression.

{code}
> example_data %>% 
+   group_by(int < 4) %>% collect()
# A tibble: 10 x 8
# Groups:   int < 4 [3]
     int   dbl  dbl2 lgl   false chr   fct   `int < 4`
   <int> <dbl> <dbl> <lgl> <lgl> <chr> <fct> <lgl>    
 1     1   1.1     5 TRUE  FALSE a     a     TRUE     
 2     2   2.1     5 NA    FALSE b     b     TRUE     
 3     3   3.1     5 TRUE  FALSE c     c     TRUE     
 4    NA   4.1     5 FALSE FALSE d     d     NA       
 5     5   5.1     5 TRUE  FALSE e     NA    FALSE    
 6     6   6.1     5 NA    FALSE NA    NA    FALSE    
 7     7   7.1     5 NA    FALSE g     g     FALSE    
 8     8   8.1     5 FALSE FALSE h     h     FALSE    
 9     9  NA       5 FALSE FALSE i     i     FALSE    
10    10  10.1     5 NA    FALSE j     j     FALSE    
{code}

Arrow doesn't do this, however:

{code}
> Table$create(example_data) %>% 
+   group_by(int < 4) %>% collect()
 Error: Invalid: No match for FieldRef.Name(int < 4) in int: int32
dbl: double
dbl2: double
lgl: bool
false: bool
chr: string
fct: dictionary<values=string, indices=int8, ordered=0> 
{code}

This isn't a big deal right now since grouped aggregations aren't (quite) here 
yet, but once we start having support for that, we will have people using 
examples like this. This might actually be something we need/want to do in C++ 
instead of in the R client.

The workaround is relatively simple: add the expression in a mutate, then 
group_by that.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

Reply via email to