Jonathan Keane created ARROW-13434:
--------------------------------------
Summary: [R] group_by() with an expression
Key: ARROW-13434
URL: https://issues.apache.org/jira/browse/ARROW-13434
Project: Apache Arrow
Issue Type: Improvement
Components: R
Reporter: Jonathan Keane
With dplyr, when we group_by with an expression, a column is added to the
dataframe that has the result of the expression.
{code}
> example_data %>%
+ group_by(int < 4) %>% collect()
# A tibble: 10 x 8
# Groups: int < 4 [3]
int dbl dbl2 lgl false chr fct `int < 4`
<int> <dbl> <dbl> <lgl> <lgl> <chr> <fct> <lgl>
1 1 1.1 5 TRUE FALSE a a TRUE
2 2 2.1 5 NA FALSE b b TRUE
3 3 3.1 5 TRUE FALSE c c TRUE
4 NA 4.1 5 FALSE FALSE d d NA
5 5 5.1 5 TRUE FALSE e NA FALSE
6 6 6.1 5 NA FALSE NA NA FALSE
7 7 7.1 5 NA FALSE g g FALSE
8 8 8.1 5 FALSE FALSE h h FALSE
9 9 NA 5 FALSE FALSE i i FALSE
10 10 10.1 5 NA FALSE j j FALSE
{code}
Arrow doesn't do this, however:
{code}
> Table$create(example_data) %>%
+ group_by(int < 4) %>% collect()
Error: Invalid: No match for FieldRef.Name(int < 4) in int: int32
dbl: double
dbl2: double
lgl: bool
false: bool
chr: string
fct: dictionary<values=string, indices=int8, ordered=0>
{code}
This isn't a big deal right now since grouped aggregations aren't (quite) here
yet, but once we start having support for that, we will have people using
examples like this. This might actually be something we need/want to do in C++
instead of in the R client.
The workaround is relatively simple: add the expression in a mutate, then
group_by that.
--
This message was sent by Atlassian Jira
(v8.3.4#803005)