nealrichardson commented on code in PR #14371:
URL: https://github.com/apache/arrow/pull/14371#discussion_r992302239
##########
r/R/arrow-package.R:
##########
@@ -27,7 +27,7 @@
#' @importFrom rlang is_list call2 is_empty as_function as_label arg_match
is_symbol is_call call_args
#' @importFrom rlang quo_set_env quo_get_env is_formula quo_is_call f_rhs
parse_expr f_env new_quosure
#' @importFrom rlang new_quosures expr_text
-#' @importFrom tidyselect vars_pull vars_rename vars_select eval_select
+#' @importFrom tidyselect vars_pull vars_select eval_select eval_rename
Review Comment:
Do we need to update the other usages of vars_select in the package too?
##########
r/R/util.R:
##########
@@ -251,3 +251,25 @@ augment_io_error_msg <- function(e, call, schema = NULL,
format = NULL) {
handle_augmented_field_misuse(msg, call)
abort(msg, call = call)
}
+
+simulate_data_frame <- function(schema) {
+
Review Comment:
Add a comment explaining why we need this function.
And do we need to export this? I think I saw some discussion between
@paleolimbot and @krlmlr about needing this function in DBI or adbc.
##########
r/R/util.R:
##########
@@ -251,3 +251,25 @@ augment_io_error_msg <- function(e, call, schema = NULL,
format = NULL) {
handle_augmented_field_misuse(msg, call)
abort(msg, call = call)
}
+
+simulate_data_frame <- function(schema) {
Review Comment:
Since this is going to be called every time someone does
`select/rename/relocate`, I'd like for this function to be cheaper. In other
PRs I've been noticing the overhead of creating R6 objects, which generally is
not terrible (~150 microseconds on my machine) but it adds up. And here, we're
creating lots of objects we're throwing away: for each column, we create a
Field, then a DataType from that, then in concat_arrays, we create a null
DataType, an Array with that, and a new Array that is cast to the correct
DataType. That adds up to around 1ms per column, every time this function is
called. That's enough to get noticed.
Can we move this to C++? Should be a simple enough switch statement to map
Arrow type ids to the corresponding R length-0 vector.
##########
r/R/dplyr-select.R:
##########
@@ -115,18 +90,39 @@ relocate.arrow_dplyr_query <- function(.data, ..., .before
= NULL, .after = NULL
}
relocate.Dataset <- relocate.ArrowTabular <- relocate.RecordBatchReader <-
relocate.arrow_dplyr_query
-check_select_helpers <- function(exprs) {
- # Throw an error if unsupported tidyselect selection helpers in `exprs`
- exprs <- lapply(exprs, function(x) if (is_quosure(x)) quo_get_expr(x) else x)
- unsup_select_helpers <- "where"
- funs_in_exprs <- unlist(lapply(exprs, all_funs))
- unsup_funs <- funs_in_exprs[funs_in_exprs %in% unsup_select_helpers]
- if (length(unsup_funs)) {
- stop(
- "Unsupported selection ",
- ngettext(length(unsup_funs), "helper: ", "helpers: "),
- oxford_paste(paste0(unsup_funs, "()"), quote = FALSE),
- call. = FALSE
- )
+column_select <- function(.data, select_expression, op = c("select",
"rename")) {
+ op <- match.arg(op)
+
+ .data <- as_adq(.data)
+ sim_df <- simulate_data_frame(implicit_schema(.data))
+ old_names <- names(sim_df)
+
+ if (op == "select") {
+ out <- eval_select(expr(c(!!!select_expression)), sim_df)
+ # select only columns from `out`
+ subset <- out
+ } else if (op == "rename") {
+ out <- eval_rename(expr(c(!!!select_expression)), sim_df)
+ # select all columns as only renaming
+ subset <- set_names(seq_along(old_names), old_names)
+ names(subset)[out] <- names(out)
+ }
+
+ .data$selected_columns <- set_names(.data$selected_columns[subset],
names(subset))
+
+ # check if names have updated
+ new_names <- old_names
+ new_names[out] <- names(out)
+ names_compared <- set_names(old_names, new_names)
+ renamed <- names_compared[names(names_compared) != names_compared]
Review Comment:
Same thing but I think the intent is clearer:
```suggestion
renamed <- names_compared[old_names != new_names]
```
##########
r/data-raw/docgen.R:
##########
@@ -127,10 +127,7 @@ docs <- arrow:::.cache$docs
# Add some functions
# across() is handled by manipulating the quosures, not by nse_funcs
-docs[["dplyr::across"]] <- c(
- # TODO(ARROW-17384): implement where
- "Use of `where()` selection helper not yet supported"
-)
+docs[["dplyr::across"]] <- character(0)
Review Comment:
🎉
##########
r/R/util.R:
##########
@@ -251,3 +251,25 @@ augment_io_error_msg <- function(e, call, schema = NULL,
format = NULL) {
handle_augmented_field_misuse(msg, call)
abort(msg, call = call)
}
+
+simulate_data_frame <- function(schema) {
+
+ arrays <- lapply(
+ schema$fields,
+ function(field) tryCatch(
+ concat_arrays(type = field$type),
+ error = function(...) concat_arrays(type = NULL)
Review Comment:
Why would this error ever occur?
##########
r/R/util.R:
##########
@@ -251,3 +251,25 @@ augment_io_error_msg <- function(e, call, schema = NULL,
format = NULL) {
handle_augmented_field_misuse(msg, call)
abort(msg, call = call)
}
+
+simulate_data_frame <- function(schema) {
+
+ arrays <- lapply(
+ schema$fields,
+ function(field) tryCatch(
+ concat_arrays(type = field$type),
+ error = function(...) concat_arrays(type = NULL)
+ )
+ )
+
+ vectors <- lapply(
+ arrays,
+ function(array) tryCatch(
+ as.vector(array),
+ error = function(...) vctrs::unspecified()
+ )
+ )
+
+ names(vectors) <- names(schema)
+ tibble::new_tibble(vectors, nrow = 0)
Review Comment:
```suggestion
vctrs::new_data_frame(vectors, nrow = 0)
```
and we should import that not :: it
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]