nealrichardson commented on a change in pull request #9745: URL: https://github.com/apache/arrow/pull/9745#discussion_r597793008
########## File path: r/R/array.R ########## @@ -291,3 +299,11 @@ is.Array <- function(x, type = NULL) { } is_it } + +#' @export +sort.Array <- function(x, decreasing = FALSE, na.last = TRUE, ...) { Review comment: Is there a reason you can't make this `sort.ArrowDatum`? ########## File path: r/R/array.R ########## @@ -291,3 +299,11 @@ is.Array <- function(x, type = NULL) { } is_it } + +#' @export +sort.Array <- function(x, decreasing = FALSE, na.last = TRUE, ...) { + if (!identical(na.last, TRUE)) { + stop("Arrow only supports sort() with na.last = TRUE", call. = FALSE) Review comment: Do we need a C++ JIRA for this? Is there an R workaround that would allow supporting this? ########## File path: r/R/dplyr.R ########## @@ -390,6 +409,25 @@ collect.arrow_dplyr_query <- function(x, as_data_frame = TRUE, ...) { tab <- RecordBatch$create(!!!cols) } } + # Arrange rows + # TODO: support sorting by expressions, not just field names Review comment: So you can sort on an expression, but that expression won't just `mutate()`? Hmm. ########## File path: r/R/dplyr.R ########## @@ -631,17 +669,98 @@ abandon_ship <- function(call, .data, msg = NULL) { eval.parent(call, 2) } -arrange.arrow_dplyr_query <- function(.data, ...) { +arrange.arrow_dplyr_query <- function(.data, ..., .by_group = FALSE) { + # TODO: handle .by_group argument + if (!isFALSE(.by_group)) { + stop(".by_group argument not supported for Arrow objects", call. = FALSE) + } + + call <- match.call() + exprs <- quos(...) + + if (length(exprs) == 0) { + # Nothing to do + return(.data) + } + .data <- arrow_dplyr_query(.data) + + # Restrict the cases we support for now if (query_on_dataset(.data)) { not_implemented_for_dataset("arrange()") } - # TODO(ARROW-11703) move this to Arrow - call <- match.call() - abandon_ship(call, .data) + is_grouped <- length(dplyr::group_vars(.data)) > 0 Review comment: does this matter if .by_group is FALSE? ########## File path: r/R/dplyr.R ########## @@ -390,6 +409,25 @@ collect.arrow_dplyr_query <- function(x, as_data_frame = TRUE, ...) { tab <- RecordBatch$create(!!!cols) } } + # Arrange rows + # TODO: support sorting by expressions, not just field names + # TODO: support sorting by names and expressions that are valid *before* the Review comment: This is solvable if you use the same machinery that we do for filter and mutate: the expressions they build always map all the way down to the underlying data. ########## File path: r/R/dplyr.R ########## @@ -46,7 +46,12 @@ arrow_dplyr_query <- function(.data) { # drop_empty_groups is a logical value indicating whether to drop # groups formed by factor levels that don't appear in the data. It # should be non-null only when the data is grouped. - drop_empty_groups = NULL + drop_empty_groups = NULL, + # arrange_vars will be a list of expressions + arrange_vars = list(), Review comment: Why isn't this a character vector like `group_by_vars`? ########## File path: r/R/array.R ########## @@ -131,6 +133,12 @@ Array <- R6Class("Array", assert_is(i, "Array") call_function("filter", self, i, options = list(keep_na = keep_na)) }, + SortIndices = function(descending = FALSE) { Review comment: IMO `SortIndices` feels a little odd as an R6 method. An alternative approach you could take would be like `match_arrow` in compute.R: create an `order_arrow` function that behaves like `order`, just to encapsulate any logic of how to format the inputs etc. Then you can call that in `sort` and `arrange`. ########## File path: r/R/dplyr.R ########## @@ -631,17 +669,98 @@ abandon_ship <- function(call, .data, msg = NULL) { eval.parent(call, 2) } -arrange.arrow_dplyr_query <- function(.data, ...) { +arrange.arrow_dplyr_query <- function(.data, ..., .by_group = FALSE) { + # TODO: handle .by_group argument + if (!isFALSE(.by_group)) { + stop(".by_group argument not supported for Arrow objects", call. = FALSE) + } + + call <- match.call() + exprs <- quos(...) + + if (length(exprs) == 0) { + # Nothing to do + return(.data) + } + .data <- arrow_dplyr_query(.data) + + # Restrict the cases we support for now if (query_on_dataset(.data)) { not_implemented_for_dataset("arrange()") } - # TODO(ARROW-11703) move this to Arrow - call <- match.call() - abandon_ship(call, .data) + is_grouped <- length(dplyr::group_vars(.data)) > 0 + if (is_grouped) { + return(abandon_ship(call, .data, 'arrange() on grouped data not supported in Arrow')) + } + is_dataset <- query_on_dataset(.data) + if (is_dataset) { + return(abandon_ship(call, .data)) + } + + # find and remove any dplyr::desc() and tidy-eval + # the arrange expressions inside an Arrow data_mask + sorts <- vector("list", length(exprs)) + descs <- logical(0) + mask <- arrow_mask(.data) + for (i in seq_along(exprs)) { + x <- find_and_remove_desc(exprs[[i]]) + exprs[[i]] <- x[["quos"]] + sorts[[i]] <- arrow_eval(exprs[[i]], mask) Review comment: Ah so you are using `arrow_eval`, so this should handle expressions and later mutating just fine (though of course you should trust the tests and not take my word for it). ########## File path: r/tests/testthat/helper-data.R ########## @@ -134,3 +134,17 @@ example_with_logical_factors <- tibble::tibble( "hey buddy" ) ) + +# the values in each column of this tibble are in ascending order in the C locale. +# there are some ties, but sorting by any two columns will give a deterministic order. +example_data_for_sorting <- tibble::tibble( + int = c(-.Machine$integer.max, -101L, -100L, 0L, 0L, 1L, 100L, 1000L, .Machine$integer.max, NA_integer_), + dbl = c(-Inf, -.Machine$double.xmax, -.Machine$double.xmin, 0, .Machine$double.xmin, pi, .Machine$double.xmax, Inf, NaN, NA_real_), + # R string collation varies by locale, while libarrow always uses the C locale for string collation + # (in other words: string values in libarrow are ordered lexicographically as bytestrings) + # to make R sort functions use the C locale, run Sys.setlocale("LC_COLLATE", "C") Review comment: Doesn't testthat set that? ########## File path: r/R/dplyr.R ########## @@ -46,7 +46,12 @@ arrow_dplyr_query <- function(.data) { # drop_empty_groups is a logical value indicating whether to drop # groups formed by factor levels that don't appear in the data. It # should be non-null only when the data is grouped. - drop_empty_groups = NULL + drop_empty_groups = NULL, + # arrange_vars will be a list of expressions + arrange_vars = list(), + # arrange_desc will be a logical vector indicating the sort order for each + # expression in arrange_vars (FALSE for ascending, TRUE for descending) + arrange_desc = logical() Review comment: Alternatively, `arrange_vars` could be a logical vector with names being the column names and values being the `descending` logical value. ########## File path: r/R/expression.R ########## @@ -181,6 +181,20 @@ find_array_refs <- function(x) { unlist(out) } +get_field_name <- function(x, msg = NULL) { Review comment: How does this relate to `get_field_names()` in dplyr.R? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org