dragosmg commented on code in PR #13541:
URL: https://github.com/apache/arrow/pull/13541#discussion_r925392721


##########
r/R/dplyr.R:
##########
@@ -219,6 +219,31 @@ tail.arrow_dplyr_query <- function(x, n = 6L, ...) {
   x
 }
 
+#' Show the details of an Arrow Execution Plan
+#'
+#' This is a function which gives more details about the Execution Plan 
(`ExecPlan`)
+#' of an `arrow_dplyr_query` object. It is similar to `dplyr::explain()`.
+#'
+#' @param x an `arrow_dplyr_query` to print the `ExecPlan` for.
+#'
+#' @return The argument, invisibly.
+#' @export
+#'
+#' @examplesIf arrow_with_dataset() & requireNamespace("dplyr", quietly = TRUE)
+#' library(dplyr)
+#' mtcars %>%
+#'   arrow_table() %>%
+#'   filter(mpg > 20) %>%
+#'   mutate(x = gear/carb) %>%
+#'   show_exec_plan()
+show_exec_plan <- function(x) {
+  adq <- as_adq(x)
+  plan <- ExecPlan$create()
+  final_node <- plan$Build(x)
+  cat(plan$ToString())

Review Comment:
   I think I understand what is going on here. We effectively have 2 ExecPlans 
because the `order_by_sink` option (involved in `arrange()`) is a _pipeline 
breaker_ and fully materialises the input into memory (thus splitting the 
ExecPlan into 2 pars). The first ExecPlan starts with a `TableSourceNode`, 
includes a `FilterNode` and most likely ends with an `OrderBySinkNode`. The 
second ExecPlan starts with a `SourceNode` and ends with the `SinkNode` 
corresponding to `head()`. The print method only captures the second (final) 
ExecPlan. Maybe we could think how we want to capture situations in which 
"multiple" ExecPlans are involved (i.e. we have an operation that is a pipeline 
breaker).
   Is my understanding correct? @nealrichardson @jonkeane @paleolimbot



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Reply via email to