paleolimbot edited a comment on pull request #11690:
URL: https://github.com/apache/arrow/pull/11690#issuecomment-977888123
Ok...summary of the changes:
- This now uses `arrow_not_supported()` for `check.rows` and `row.names` in
the `data.frame()` translation
- Added tests for using literals and existing data frames
<details>
``` r
library(arrow, warn.conflicts = FALSE)
library(dplyr, warn.conflicts = FALSE)
df <- record_batch(a = 1:2)
# "normal"
df %>% mutate(df_col = tibble(a2 = a)) %>% collect()
#> # A tibble: 2 × 2
#> a df_col$a2
#> <int> <int>
#> 1 1 1
#> 2 2 2
df %>% mutate(df_col = data.frame(a2 = a)) %>% collect()
#> # A tibble: 2 × 2
#> a df_col$a2
#> <int> <int>
#> 1 1 1
#> 2 2 2
# scalars and existing data frames
df %>% mutate(df_col = tibble(a2 = "nested value")) %>% collect()
#> # A tibble: 2 × 2
#> a df_col$a2
#> <int> <chr>
#> 1 1 nested value
#> 2 2 nested value
one_row_df <- tibble(a2 = "nested value")
df %>% mutate(df_col = one_row_df) %>% collect()
#> # A tibble: 2 × 2
#> a df_col$a2
#> <int> <chr>
#> 1 1 nested value
#> 2 2 nested value
# this is surprising behaviour (to me) of Scalar$create(c("nested value",
"nested value2"))
df %>% mutate(df_col = tibble(a2 = c("nested value", "nested value2"))) %>%
collect()
#> # A tibble: 2 × 2
#> a df_col$a2
#> <int> <list<character>>
#> 1 1 [2]
#> 2 2 [2]
# opened https://issues.apache.org/jira/browse/ARROW-14828 to fix this
two_row_df <- tibble(a2 = c("nested value", "nested value2"))
df %>% mutate(df_col = two_row_df) %>% collect()
#> # A tibble: 2 × 2
#> a df_col$a2
#> <int> <chr>
#> 1 1 nested value
#> 2 2 nested value
# duplicated cols
df %>% mutate(df_col = tibble(a, a)) %>% collect()
#> Warning: Expression tibble(a, a) not supported in Arrow; pulling data
into R
#> Error: Problem with `mutate()` column `df_col`.
#> ℹ `df_col = tibble(a, a)`.
#> x Column name `a` must not be duplicated.
#> Use .name_repair to specify repair.
df %>% mutate(df_col = data.frame(a, a)) %>% collect()
#> # A tibble: 2 × 2
#> a df_col$a $a.1
#> <int> <int> <int>
#> 1 1 1 1
#> 2 2 2 2
df %>% mutate(df_col = data.frame(a, a, check.names = FALSE)) %>% collect()
#> # A tibble: 2 × 2
#> a df_col$a $a
#> <int> <int> <int>
#> 1 1 1 1
#> 2 2 2 2
# empty names
df %>%
mutate(df_col = data.frame(a, check.names = TRUE, fix.empty.names = TRUE))
%>%
collect()
#> # A tibble: 2 × 2
#> a df_col$a
#> <int> <int>
#> 1 1 1
#> 2 2 2
df %>%
mutate(df_col = data.frame(a, check.names = TRUE, fix.empty.names =
FALSE)) %>%
collect()
#> # A tibble: 2 × 2
#> a df_col$``
#> <int> <int>
#> 1 1 1
#> 2 2 2
df %>%
mutate(df_col = data.frame(a, check.names = FALSE, fix.empty.names =
TRUE)) %>%
collect()
#> # A tibble: 2 × 2
#> a df_col$a
#> <int> <int>
#> 1 1 1
#> 2 2 2
df %>%
mutate(df_col = data.frame(a, check.names = FALSE, fix.empty.names =
FALSE)) %>%
collect()
#> # A tibble: 2 × 2
#> a df_col$``
#> <int> <int>
#> 1 1 1
#> 2 2 2
# arrow_not_supported
df %>% mutate(df_col = tibble(a, .rows = 1L)) %>% collect()
#> Warning: In tibble(a, .rows = 1L), .rows not supported in Arrow; pulling
data
#> into R
#> # A tibble: 2 × 2
#> a df_col$a
#> <int> <int>
#> 1 1 1
#> 2 2 2
df %>% mutate(df_col = tibble(a, .name_repair = "universal")) %>% collect()
#> Warning: In tibble(a, .name_repair = "universal"), .name_repair not
supported in
#> Arrow; pulling data into R
#> # A tibble: 2 × 2
#> a df_col$a
#> <int> <int>
#> 1 1 1
#> 2 2 2
df %>% mutate(df_col = data.frame(a, check.rows = TRUE)) %>% collect()
#> Warning: In data.frame(a, check.rows = TRUE), check.rows not supported in
Arrow;
#> pulling data into R
#> # A tibble: 2 × 2
#> a df_col$a
#> <int> <int>
#> 1 1 1
#> 2 2 2
df %>% mutate(df_col = data.frame(a, row.names = TRUE)) %>% collect()
#> Warning: In data.frame(a, row.names = TRUE), row.names not supported in
Arrow;
#> pulling data into R
#> # A tibble: 2 × 2
#> a df_col
#> <int> <named list>
#> 1 1 <NULL>
#> 2 2 <NULL>
```
<sup>Created on 2021-11-24 by the [reprex
package](https://reprex.tidyverse.org) (v2.0.1)</sup>
</details>
I *didn't* add a test for `mutate(df_col = tibble(a2 = c("nested value",
"nested value2")))` and `mutate(df_col = two_row_df)` because these both give
surprising values to me that don't align with what dplyr would give you. I
think they should be fixed and tested at the `Scalar$create()` level, not here,
but I'm happy to add in more here with some guidance on the desired behaviour.
I opened ARROW-14828 for `Scalar$create()` on a two-row data.frame and
ARROW-14855 for general handling of non-size-one values that are passed to
`build_expr()`.
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]