ianmcook commented on a change in pull request #10992:
URL: https://github.com/apache/arrow/pull/10992#discussion_r695831416
##########
File path: r/R/query-engine.R
##########
@@ -42,11 +55,73 @@ ExecPlan <- R6Class("ExecPlan",
}
# ScanNode needs the filter to do predicate pushdown and skip partitions,
# and it needs to know which fields to materialize (and which are
unnecessary)
- ExecNode_Scan(self, dataset, filter, colnames)
+ ExecNode_Scan(self, dataset, filter, colnames %||% character(0))
+ },
+ Build = function(.data) {
+ # This method takes an arrow_dplyr_query and chains together the
+ # ExecNodes that they produce. It does not evaluate them--that is Run().
+ group_vars <- dplyr::group_vars(.data)
+ grouped <- length(group_vars) > 0
+
+ # Collect the target names first because we have to add back the group
vars
+ target_names <- names(.data)
+ .data <- ensure_group_vars(.data)
+ .data <- ensure_arrange_vars(.data) # this sets .data$temp_columns
+
+ node <- self$Scan(.data)
+ # ARROW-13498: Even though Scan takes the filter, apparently we have to
do it again
+ if (inherits(.data$filtered_rows, "Expression")) {
+ node <- node$Filter(.data$filtered_rows)
+ }
+ # If any columns are derived we need to Project (otherwise this may be
no-op)
+ node <- node$Project(c(.data$selected_columns, .data$temp_columns))
+
+ if (length(.data$aggregations)) {
+ if (grouped) {
+ # We need to prefix all of the aggregation function names with
"hash_"
+ .data$aggregations <- lapply(.data$aggregations, function(x) {
+ x[["fun"]] <- paste0("hash_", x[["fun"]])
+ x
+ })
+ }
+
+ node <- node$Aggregate(
+ options = .data$aggregations,
+ target_names = target_names,
+ out_field_names = names(.data$aggregations),
+ key_names = group_vars
+ )
+
+ if (grouped) {
+ # The result will have result columns first then the grouping cols.
+ # dplyr orders group cols first, so adapt the result to meet that
expectation.
Review comment:
For context: some databases will sort by the group columns if the result
set is small or if the grouping columns have low cardinality, giving users an
expectation that results will always be sorted by the group columns. But then
when the result is huge or the grouping columns have high cardinality, they
will return unsorted results (because sorting would have a large cost). This is
an insidious behavior that causes users a lot of bafflement and grief. There is
some benefit in just ripping off the band aid immediately and never returning
sorted results, to avoid giving users the impression that they should expect
that.
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]