[GitHub] [arrow] nealrichardson commented on a change in pull request #9745: ARROW-11703: [R] Implement dplyr::arrange() [WIP]

GitBox Fri, 19 Mar 2021 09:14:32 -0700


nealrichardson commented on a change in pull request #9745:
URL: https://github.com/apache/arrow/pull/9745#discussion_r597793008




##########
File path: r/R/array.R
##########
@@ -291,3 +299,11 @@ is.Array <- function(x, type = NULL) {
   }
   is_it
 }
+
+#' @export
+sort.Array <- function(x, decreasing = FALSE, na.last = TRUE, ...) {

Review comment:
       Is there a reason you can't make this `sort.ArrowDatum`? 

##########
File path: r/R/array.R
##########
@@ -291,3 +299,11 @@ is.Array <- function(x, type = NULL) {
   }
   is_it
 }
+
+#' @export
+sort.Array <- function(x, decreasing = FALSE, na.last = TRUE, ...) {
+  if (!identical(na.last, TRUE)) {
+    stop("Arrow only supports sort() with na.last = TRUE", call. = FALSE)

Review comment:
       Do we need a C++ JIRA for this?
   
   Is there an R workaround that would allow supporting this?

##########
File path: r/R/dplyr.R
##########
@@ -390,6 +409,25 @@ collect.arrow_dplyr_query <- function(x, as_data_frame = 
TRUE, ...) {
         tab <- RecordBatch$create(!!!cols)
       }
     }
+    # Arrange rows
+    # TODO: support sorting by expressions, not just field names

Review comment:
       So you can sort on an expression, but that expression won't just 
`mutate()`? Hmm.

##########
File path: r/R/dplyr.R
##########
@@ -631,17 +669,98 @@ abandon_ship <- function(call, .data, msg = NULL) {
   eval.parent(call, 2)
 }
 
-arrange.arrow_dplyr_query <- function(.data, ...) {
+arrange.arrow_dplyr_query <- function(.data, ..., .by_group = FALSE) {
+  # TODO: handle .by_group argument
+  if (!isFALSE(.by_group)) {
+    stop(".by_group argument not supported for Arrow objects", call. = FALSE)
+  }
+
+  call <- match.call()
+  exprs <- quos(...)
+
+  if (length(exprs) == 0) {
+    # Nothing to do
+    return(.data)
+  }
+
   .data <- arrow_dplyr_query(.data)
+
+  # Restrict the cases we support for now
   if (query_on_dataset(.data)) {
     not_implemented_for_dataset("arrange()")
   }
-  # TODO(ARROW-11703) move this to Arrow
-  call <- match.call()
-  abandon_ship(call, .data)
+  is_grouped <- length(dplyr::group_vars(.data)) > 0

Review comment:
       does this matter if .by_group is FALSE?

##########
File path: r/R/dplyr.R
##########
@@ -390,6 +409,25 @@ collect.arrow_dplyr_query <- function(x, as_data_frame = 
TRUE, ...) {
         tab <- RecordBatch$create(!!!cols)
       }
     }
+    # Arrange rows
+    # TODO: support sorting by expressions, not just field names
+    # TODO: support sorting by names and expressions that are valid *before* 
the

Review comment:
       This is solvable if you use the same machinery that we do for filter and 
mutate: the expressions they build always map all the way down to the 
underlying data. 
   

##########
File path: r/R/dplyr.R
##########
@@ -46,7 +46,12 @@ arrow_dplyr_query <- function(.data) {
       # drop_empty_groups is a logical value indicating whether to drop
       # groups formed by factor levels that don't appear in the data. It
       # should be non-null only when the data is grouped.
-      drop_empty_groups = NULL
+      drop_empty_groups = NULL,
+      # arrange_vars will be a list of expressions
+      arrange_vars = list(),

Review comment:
       Why isn't this a character vector like `group_by_vars`?

##########
File path: r/R/array.R
##########
@@ -131,6 +133,12 @@ Array <- R6Class("Array",
       assert_is(i, "Array")
       call_function("filter", self, i, options = list(keep_na = keep_na))
     },
+    SortIndices = function(descending = FALSE) {

Review comment:
       IMO `SortIndices` feels a little odd as an R6 method. An alternative 
approach you could take would be like `match_arrow` in compute.R: create an 
`order_arrow` function that behaves like `order`, just to encapsulate any logic 
of how to format the inputs etc. Then you can call that in `sort` and `arrange`.

##########
File path: r/R/dplyr.R
##########
@@ -631,17 +669,98 @@ abandon_ship <- function(call, .data, msg = NULL) {
   eval.parent(call, 2)
 }
 
-arrange.arrow_dplyr_query <- function(.data, ...) {
+arrange.arrow_dplyr_query <- function(.data, ..., .by_group = FALSE) {
+  # TODO: handle .by_group argument
+  if (!isFALSE(.by_group)) {
+    stop(".by_group argument not supported for Arrow objects", call. = FALSE)
+  }
+
+  call <- match.call()
+  exprs <- quos(...)
+
+  if (length(exprs) == 0) {
+    # Nothing to do
+    return(.data)
+  }
+
   .data <- arrow_dplyr_query(.data)
+
+  # Restrict the cases we support for now
   if (query_on_dataset(.data)) {
     not_implemented_for_dataset("arrange()")
   }
-  # TODO(ARROW-11703) move this to Arrow
-  call <- match.call()
-  abandon_ship(call, .data)
+  is_grouped <- length(dplyr::group_vars(.data)) > 0
+  if (is_grouped) {
+    return(abandon_ship(call, .data, 'arrange() on grouped data not supported 
in Arrow'))
+  }
+  is_dataset <- query_on_dataset(.data)
+  if (is_dataset) {
+    return(abandon_ship(call, .data))
+  }
+
+  # find and remove any dplyr::desc() and tidy-eval
+  # the arrange expressions inside an Arrow data_mask
+  sorts <- vector("list", length(exprs))
+  descs <- logical(0)
+  mask <- arrow_mask(.data)
+  for (i in seq_along(exprs)) {
+    x <- find_and_remove_desc(exprs[[i]])
+    exprs[[i]] <- x[["quos"]]
+    sorts[[i]] <- arrow_eval(exprs[[i]], mask)

Review comment:
       Ah so you are using `arrow_eval`, so this should handle expressions and 
later mutating just fine (though of course you should trust the tests and not 
take my word for it).

##########
File path: r/tests/testthat/helper-data.R
##########
@@ -134,3 +134,17 @@ example_with_logical_factors <- tibble::tibble(
     "hey buddy"
   )
 )
+
+# the values in each column of this tibble are in ascending order in the C 
locale.
+# there are some ties, but sorting by any two columns will give a 
deterministic order.
+example_data_for_sorting <- tibble::tibble(
+  int = c(-.Machine$integer.max, -101L, -100L, 0L, 0L, 1L, 100L, 1000L, 
.Machine$integer.max, NA_integer_),
+  dbl = c(-Inf, -.Machine$double.xmax, -.Machine$double.xmin, 0, 
.Machine$double.xmin, pi, .Machine$double.xmax, Inf, NaN, NA_real_),
+  # R string collation varies by locale, while libarrow always uses the C 
locale for string collation
+  # (in other words: string values in libarrow are ordered lexicographically 
as bytestrings)
+  # to make R sort functions use the C locale, run Sys.setlocale("LC_COLLATE", 
"C")

Review comment:
       Doesn't testthat set that?

##########
File path: r/R/dplyr.R
##########
@@ -46,7 +46,12 @@ arrow_dplyr_query <- function(.data) {
       # drop_empty_groups is a logical value indicating whether to drop
       # groups formed by factor levels that don't appear in the data. It
       # should be non-null only when the data is grouped.
-      drop_empty_groups = NULL
+      drop_empty_groups = NULL,
+      # arrange_vars will be a list of expressions
+      arrange_vars = list(),
+      # arrange_desc will be a logical vector indicating the sort order for 
each
+      # expression in arrange_vars (FALSE for ascending, TRUE for descending)
+      arrange_desc = logical()

Review comment:
       Alternatively, `arrange_vars` could be a logical vector with names being 
the column names and values being the `descending` logical value.

##########
File path: r/R/expression.R
##########
@@ -181,6 +181,20 @@ find_array_refs <- function(x) {
   unlist(out)
 }
 
+get_field_name <-  function(x, msg = NULL) {

Review comment:
       How does this relate to `get_field_names()` in dplyr.R?




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[GitHub] [arrow] nealrichardson commented on a change in pull request #9745: ARROW-11703: [R] Implement dplyr::arrange() [WIP]

Reply via email to