[GitHub] [arrow] nealrichardson commented on a diff in pull request #13541: ARROW-15016: [R] `show_exec_plan` for an `arrow_dplyr_query`

GitBox Thu, 21 Jul 2022 13:43:29 -0700


nealrichardson commented on code in PR #13541:
URL: https://github.com/apache/arrow/pull/13541#discussion_r926715529



##########
r/tests/testthat/test-dplyr-query.R:
##########
@@ -433,3 +433,343 @@ test_that("query_can_stream()", {
       query_can_stream()
   )
 })
+
+test_that("show_exec_plan(), show_query() and explain()", {
+  # minimal test - this fails if we don't coerce the input to 
`show_exec_plan()`
+  # to be an `arrow_dplyr_query`
+  expect_output(
+    mtcars %>%
+      arrow_table() %>%
+      show_exec_plan(),
+    regexp = paste0(
+      "ExecPlan with .* nodes:.*", # boiler plate for ExecPlan
+      "ProjectNode.*",             # output columns
+      "TableSourceNode"            # entry point
+    )
+  )
+
+  # minimal test - show_query()
+  expect_output(
+    mtcars %>%
+      arrow_table() %>%
+      show_query(),
+    regexp = paste0(
+      "ExecPlan with .* nodes:.*", # boiler plate for ExecPlan
+      "ProjectNode.*",             # output new columns
+      "TableSourceNode"            # entry point
+    )
+  )
+
+  # minimal test - explain()
+  expect_output(
+    mtcars %>%
+      arrow_table() %>%
+      explain(),
+    regexp = paste0(
+      "ExecPlan with .* nodes:.*", # boiler plate for ExecPlan
+      "ProjectNode.*",             # output columns
+      "TableSourceNode"            # entry point
+    )
+  )
+
+  # arrow_table and mutate
+  expect_output(
+    tbl %>%
+      arrow_table() %>%
+      filter(dbl > 2, chr != "e") %>%
+      select(chr, int, lgl) %>%
+      mutate(int_plus_ten = int + 10) %>%
+      show_exec_plan(),
+    regexp = paste0(
+      "ExecPlan with .* nodes:.*",           # boiler plate for ExecPlan
+      "chr, int, lgl, \"int_plus_ten\".*",   # selected columns
+      "FilterNode.*",                        # filter node
+      "(dbl > 2).*",                         # filter expressions
+      "chr != \"e\".*",
+      "TableSourceNode"                      # entry point
+    )
+  )
+
+  # arrow_table and mutate - show_query()
+  expect_output(
+    tbl %>%
+      arrow_table() %>%
+      filter(dbl > 2, chr != "e") %>%
+      select(chr, int, lgl) %>%
+      mutate(int_plus_ten = int + 10) %>%
+      show_query(),
+    regexp = paste0(
+      "ExecPlan with .* nodes:.*",           # boiler plate for ExecPlan
+      "chr, int, lgl, \"int_plus_ten\".*",   # selected columns
+      "FilterNode.*",                        # filter node
+      "(dbl > 2).*",                         # filter expressions
+      "chr != \"e\".*",
+      "TableSourceNode"                      # entry point
+    )
+  )
+
+  # arrow_table and mutate - explain()
+  expect_output(
+    tbl %>%
+      arrow_table() %>%
+      filter(dbl > 2, chr != "e") %>%
+      select(chr, int, lgl) %>%
+      mutate(int_plus_ten = int + 10) %>%
+      explain(),
+    regexp = paste0(
+      "ExecPlan with .* nodes:.*",           # boiler plate for ExecPlan
+      "chr, int, lgl, \"int_plus_ten\".*",   # selected columns
+      "FilterNode.*",                        # filter node
+      "(dbl > 2).*",                         # filter expressions
+      "chr != \"e\".*",
+      "TableSourceNode"                      # entry point
+    )
+  )
+
+  # record_batch and mutate
+  expect_output(
+    tbl %>%
+      record_batch() %>%
+      filter(dbl > 2, chr != "e") %>%
+      select(chr, int, lgl) %>%
+      mutate(int_plus_ten = int + 10) %>%
+      show_exec_plan(),
+    regexp = paste0(
+      "ExecPlan with .* nodes:.*",           # boiler plate for ExecPlan
+      "chr, int, lgl, \"int_plus_ten\".*",   # selected columns
+      "(dbl > 2).*",                         # the filter expressions
+      "chr != \"e\".*",
+      "TableSourceNode"                      # the entry point"
+    )
+  )
+
+  # test with group_by and summarise
+  expect_output(
+    tbl %>%
+      arrow_table() %>%
+      group_by(lgl) %>%
+      summarise(avg = mean(dbl, na.rm = TRUE)) %>%
+      ungroup() %>%
+      show_exec_plan(),
+    regexp = paste0(
+      "ExecPlan with .* nodes:.*",            # boiler plate for ExecPlan
+      "ProjectNode.*",                        # output columns
+      "GroupByNode.*",                        # the group_by statement
+      "keys=.*lgl.*",                         # the key for the aggregations
+      "aggregates=.*hash_mean.*avg.*",        # the aggregations
+      "ProjectNode.*",                        # the input columns
+      "TableSourceNode"                       # the entry point
+    )
+  )
+
+  # test with group_by and summarise - show_query()
+  expect_output(
+    tbl %>%
+      arrow_table() %>%
+      group_by(lgl) %>%
+      summarise(avg = mean(dbl, na.rm = TRUE)) %>%
+      ungroup() %>%

Review Comment:
   Is `ungroup()` necessary in all of these tests?



##########
r/src/compute-exec.cpp:
##########
@@ -125,6 +138,16 @@ std::shared_ptr<arrow::Schema> ExecNode_output_schema(
   return node->output_schema();
 }
 
+// [[arrow::export]]
+std::string ExecPlan_BuildAndShow(const std::shared_ptr<compute::ExecPlan>& 
plan,
+                                  const std::shared_ptr<compute::ExecNode>& 
final_node,
+                                  cpp11::list sort_options, cpp11::strings 
metadata,
+                                  int64_t head = -1) {
+  auto prepared_plan = ExecPlan_prepare(plan, final_node, sort_options, 
metadata, head);
+  arrow::StopIfNotOk(prepared_plan.first->StartProducing());

Review Comment:
   IIUC this starts evaluating the ExecPlan, which we don't want to do. 
   
   ```suggestion
   ```



##########
r/tests/testthat/test-dplyr-query.R:
##########
@@ -433,3 +433,343 @@ test_that("query_can_stream()", {
       query_can_stream()
   )
 })
+
+test_that("show_exec_plan(), show_query() and explain()", {
+  # minimal test - this fails if we don't coerce the input to 
`show_exec_plan()`
+  # to be an `arrow_dplyr_query`
+  expect_output(
+    mtcars %>%
+      arrow_table() %>%
+      show_exec_plan(),
+    regexp = paste0(
+      "ExecPlan with .* nodes:.*", # boiler plate for ExecPlan
+      "ProjectNode.*",             # output columns
+      "TableSourceNode"            # entry point
+    )
+  )
+
+  # minimal test - show_query()
+  expect_output(
+    mtcars %>%
+      arrow_table() %>%
+      show_query(),
+    regexp = paste0(
+      "ExecPlan with .* nodes:.*", # boiler plate for ExecPlan
+      "ProjectNode.*",             # output new columns
+      "TableSourceNode"            # entry point
+    )
+  )
+
+  # minimal test - explain()
+  expect_output(
+    mtcars %>%
+      arrow_table() %>%
+      explain(),
+    regexp = paste0(
+      "ExecPlan with .* nodes:.*", # boiler plate for ExecPlan
+      "ProjectNode.*",             # output columns
+      "TableSourceNode"            # entry point
+    )
+  )
+
+  # arrow_table and mutate
+  expect_output(
+    tbl %>%
+      arrow_table() %>%
+      filter(dbl > 2, chr != "e") %>%
+      select(chr, int, lgl) %>%
+      mutate(int_plus_ten = int + 10) %>%
+      show_exec_plan(),
+    regexp = paste0(
+      "ExecPlan with .* nodes:.*",           # boiler plate for ExecPlan
+      "chr, int, lgl, \"int_plus_ten\".*",   # selected columns
+      "FilterNode.*",                        # filter node
+      "(dbl > 2).*",                         # filter expressions
+      "chr != \"e\".*",
+      "TableSourceNode"                      # entry point
+    )
+  )
+
+  # arrow_table and mutate - show_query()
+  expect_output(
+    tbl %>%
+      arrow_table() %>%
+      filter(dbl > 2, chr != "e") %>%
+      select(chr, int, lgl) %>%
+      mutate(int_plus_ten = int + 10) %>%
+      show_query(),
+    regexp = paste0(
+      "ExecPlan with .* nodes:.*",           # boiler plate for ExecPlan
+      "chr, int, lgl, \"int_plus_ten\".*",   # selected columns
+      "FilterNode.*",                        # filter node
+      "(dbl > 2).*",                         # filter expressions
+      "chr != \"e\".*",
+      "TableSourceNode"                      # entry point
+    )
+  )
+
+  # arrow_table and mutate - explain()
+  expect_output(
+    tbl %>%
+      arrow_table() %>%
+      filter(dbl > 2, chr != "e") %>%
+      select(chr, int, lgl) %>%
+      mutate(int_plus_ten = int + 10) %>%
+      explain(),
+    regexp = paste0(
+      "ExecPlan with .* nodes:.*",           # boiler plate for ExecPlan
+      "chr, int, lgl, \"int_plus_ten\".*",   # selected columns
+      "FilterNode.*",                        # filter node
+      "(dbl > 2).*",                         # filter expressions
+      "chr != \"e\".*",
+      "TableSourceNode"                      # entry point
+    )
+  )
+
+  # record_batch and mutate
+  expect_output(
+    tbl %>%
+      record_batch() %>%
+      filter(dbl > 2, chr != "e") %>%
+      select(chr, int, lgl) %>%
+      mutate(int_plus_ten = int + 10) %>%
+      show_exec_plan(),
+    regexp = paste0(
+      "ExecPlan with .* nodes:.*",           # boiler plate for ExecPlan
+      "chr, int, lgl, \"int_plus_ten\".*",   # selected columns
+      "(dbl > 2).*",                         # the filter expressions
+      "chr != \"e\".*",
+      "TableSourceNode"                      # the entry point"
+    )
+  )
+
+  # test with group_by and summarise
+  expect_output(
+    tbl %>%
+      arrow_table() %>%
+      group_by(lgl) %>%
+      summarise(avg = mean(dbl, na.rm = TRUE)) %>%
+      ungroup() %>%
+      show_exec_plan(),
+    regexp = paste0(
+      "ExecPlan with .* nodes:.*",            # boiler plate for ExecPlan
+      "ProjectNode.*",                        # output columns
+      "GroupByNode.*",                        # the group_by statement
+      "keys=.*lgl.*",                         # the key for the aggregations
+      "aggregates=.*hash_mean.*avg.*",        # the aggregations
+      "ProjectNode.*",                        # the input columns
+      "TableSourceNode"                       # the entry point
+    )
+  )
+
+  # test with group_by and summarise - show_query()
+  expect_output(
+    tbl %>%
+      arrow_table() %>%
+      group_by(lgl) %>%
+      summarise(avg = mean(dbl, na.rm = TRUE)) %>%
+      ungroup() %>%
+      show_query(),
+    regexp = paste0(
+      "ExecPlan with .* nodes:.*",            # boiler plate for ExecPlan
+      "ProjectNode.*",                        # output columns
+      "GroupByNode.*",                        # the group_by statement
+      "keys=.*lgl.*",                         # the key for the aggregations
+      "aggregates=.*hash_mean.*avg.*",        # the aggregations
+      "ProjectNode.*",                        # the input columns
+      "TableSourceNode"                       # the entry point
+    )
+  )
+
+  # test with group_by and summarise - explain()
+  expect_output(
+    tbl %>%
+      arrow_table() %>%
+      group_by(lgl) %>%
+      summarise(avg = mean(dbl, na.rm = TRUE)) %>%
+      ungroup() %>%
+      explain(),
+    regexp = paste0(
+      "ExecPlan with .* nodes:.*",            # boiler plate for ExecPlan
+      "ProjectNode.*",                        # output columns
+      "GroupByNode.*",                        # group_by statement
+      "keys=.*lgl.*",                         # key for the aggregations
+      "aggregates=.*hash_mean.*avg.*",        # aggregations
+      "ProjectNode.*",                        # input columns
+      "TableSourceNode"                       # entry point
+    )
+  )
+
+  # test with join
+  expect_output(
+    tbl %>%
+      arrow_table() %>%
+      left_join(
+        example_data %>%
+          arrow_table() %>%
+          mutate(doubled_dbl = dbl * 2) %>%
+          select(int, doubled_dbl),
+        by = "int"
+      ) %>%
+      select(int, verses, doubled_dbl) %>%
+      show_exec_plan(),
+    regexp = paste0(
+      "ExecPlan with .* nodes:.*",              # boiler plate for ExecPlan
+      "ProjectNode.*",                          # output columns
+      "HashJoinNode.*",                         # the join
+      "ProjectNode.*",                          # input columns for the second 
table
+      "\"doubled_dbl\"\\: multiply_checked\\(dbl, 2\\).*", # mutate
+      "TableSourceNode.*",                      # second table
+      "ProjectNode.*",                          # input columns for the first 
table
+      "TableSourceNode"                         # first table
+    )
+  )
+
+  # test with join - show_query()
+  expect_output(
+    tbl %>%
+      arrow_table() %>%
+      left_join(
+        example_data %>%
+          arrow_table() %>%
+          mutate(doubled_dbl = dbl * 2) %>%
+          select(int, doubled_dbl),
+        by = "int"
+      ) %>%
+      select(int, verses, doubled_dbl) %>%
+      show_query(),
+    regexp = paste0(
+      "ExecPlan with .* nodes:.*",              # boiler plate for ExecPlan
+      "ProjectNode.*",                          # output columns
+      "HashJoinNode.*",                         # join
+      "ProjectNode.*",                          # input columns for the second 
table
+      "\"doubled_dbl\"\\: multiply_checked\\(dbl, 2\\).*", # the mutate
+      "TableSourceNode.*",                      # second table
+      "ProjectNode.*",                          # input columns for the first 
table
+      "TableSourceNode"                         # first table
+    )
+  )
+
+  # test with join - explain()
+  expect_output(
+    tbl %>%
+      arrow_table() %>%
+      left_join(
+        example_data %>%
+          arrow_table() %>%
+          mutate(doubled_dbl = dbl * 2) %>%
+          select(int, doubled_dbl),
+        by = "int"
+      ) %>%
+      select(int, verses, doubled_dbl) %>%
+      explain(),
+    regexp = paste0(
+      "ExecPlan with .* nodes:.*",              # boiler plate for ExecPlan
+      "ProjectNode.*",                          # output columns
+      "HashJoinNode.*",                         # join
+      "ProjectNode.*",                          # input columns for the second 
table
+      "\"doubled_dbl\"\\: multiply_checked\\(dbl, 2\\).*", # mutate
+      "TableSourceNode.*",                      # second table
+      "ProjectNode.*",                          # input columns for the first 
table
+      "TableSourceNode"                         # first table
+    )
+  )
+
+  expect_output(
+    mtcars %>%
+      arrow_table() %>%
+      filter(mpg > 20) %>%
+      arrange(desc(wt)) %>%
+      show_exec_plan(),
+    regexp = paste0(
+      "ExecPlan with .* nodes:.*",   # boiler plate for ExecPlan
+      "OrderBySinkNode.*wt.*DESC.*", # arrange goes via the OrderBy sink node
+      "ProjectNode.*",               # output columns
+      "FilterNode.*",                # filter node
+      "TableSourceNode.*"            # entry point
+    )
+  )
+
+  expect_output(
+    mtcars %>%
+      arrow_table() %>%
+      filter(mpg > 20) %>%
+      arrange(desc(wt)) %>%
+      show_query(),
+    regexp = paste0(
+      "ExecPlan with .* nodes:.*",    # boiler plate for ExecPlan
+      "OrderBySinkNode.*wt.*DESC.*",  # arrange goes via the OrderBy sink node
+      "ProjectNode.*",                # output columns
+      "FilterNode.*",                 # filter node
+      "TableSourceNode.*"             # entry point
+    )
+  )
+
+  expect_output(
+    mtcars %>%
+      arrow_table() %>%
+      filter(mpg > 20) %>%
+      arrange(desc(wt)) %>%
+      explain(),
+    regexp = paste0(
+      "ExecPlan with .* nodes:.*",   # boiler plate for ExecPlan
+      "OrderBySinkNode.*wt.*DESC.*", # arrange goes via the OrderBy sink node
+      "ProjectNode.*",               # output columns
+      "FilterNode.*",                # filter node
+      "TableSourceNode.*"            # entry point
+    )
+  )
+
+  expect_output(
+    mtcars %>%
+      arrow_table() %>%
+      filter(mpg > 20) %>%
+      arrange(desc(wt)) %>%
+      head(3) %>%
+      show_exec_plan(),
+    # for some reason the FilterNode disappears when head/tail are involved +

Review Comment:
   These all sound worthy of future investigation and cleanup, please make a 
jira



##########
r/R/dplyr.R:
##########
@@ -219,6 +219,31 @@ tail.arrow_dplyr_query <- function(x, n = 6L, ...) {
   x
 }
 
+#' Show the details of an Arrow Execution Plan
+#'
+#' This is a function which gives more details about the Execution Plan 
(`ExecPlan`)
+#' of an `arrow_dplyr_query` object. It is similar to `dplyr::explain()`.
+#'
+#' @param x an `arrow_dplyr_query` to print the `ExecPlan` for.
+#'
+#' @return The argument, invisibly.
+#' @export
+#'
+#' @examplesIf arrow_with_dataset() & requireNamespace("dplyr", quietly = TRUE)
+#' library(dplyr)
+#' mtcars %>%
+#'   arrow_table() %>%
+#'   filter(mpg > 20) %>%
+#'   mutate(x = gear/carb) %>%
+#'   show_exec_plan()
+show_exec_plan <- function(x) {

Review Comment:
   My recommendation was the latter, but I don't object to also having a 
standalone show_exec_plan()



##########
r/R/query-engine.R:
##########
@@ -191,7 +192,7 @@ ExecPlan <- R6Class("ExecPlan",
       }
       node
     },
-    Run = function(node) {
+    Run = function(node, explain = FALSE) {

Review Comment:
   IIUC this is a really bad idea: you're evaluating the whole query just to 
print it. 



##########
r/R/dplyr.R:
##########
@@ -219,6 +219,45 @@ tail.arrow_dplyr_query <- function(x, n = 6L, ...) {
   x
 }
 
+#' Show the details of an Arrow Execution Plan
+#'
+#' This is a function which gives more details about the logical query plan
+#' that will be executed when evaluating an `arrow_dplyr_query` object.
+#' It calls the C++ `ExecPlan` object's print method.
+#' Functionally, it is similar to `dplyr::explain()`.

Review Comment:
   Probably worth documenting that this is used in (or is used as) the 
`dplyr::explain()` and `dplyr::show_query()` methods.



##########
r/R/query-engine.R:
##########
@@ -259,9 +260,39 @@ ExecPlan <- R6Class("ExecPlan",
         ...
       )
     },
+    # SinkNodes (involved in arrange and/or head/tail operations) are created 
in
+    # ExecPlan_run and are not captured by the regular print method. We take a
+    # similar approach to expose them before calling the print method.
+    BuildAndShow = function(node) {
+      assert_is(node, "ExecNode")
+
+      # Sorting and head/tail (if sorted) are handled in the SinkNode,
+      # created in ExecPlan_run
+      sorting <- node$extras$sort %||% list()
+      select_k <- node$extras$head %||% -1L
+      has_sorting <- length(sorting) > 0
+      if (has_sorting) {
+        if (!is.null(node$extras$tail)) {
+          # Reverse the sort order and take the top K, then after we'll reverse
+          # the resulting rows so that it is ordered as expected
+          sorting$orders <- !sorting$orders
+          select_k <- node$extras$tail
+        }
+        sorting$orders <- as.integer(sorting$orders)
+      }

Review Comment:
   This should really be factored out, not because we use it a bunch of places, 
but because it's delicate logic that we don't want to screw up by touching it 
in one place and forgetting that it's copied elsewhere. For now, can you leave 
a code comment up in $Run() that points out that this code is also copied down 
in $BuildAndShow, so whenever we touch this in the future, we know to look both 
places?



##########
r/tests/testthat/test-dplyr-query.R:
##########
@@ -433,3 +433,343 @@ test_that("query_can_stream()", {
       query_can_stream()
   )
 })
+
+test_that("show_exec_plan(), show_query() and explain()", {
+  # minimal test - this fails if we don't coerce the input to 
`show_exec_plan()`
+  # to be an `arrow_dplyr_query`
+  expect_output(
+    mtcars %>%
+      arrow_table() %>%
+      show_exec_plan(),
+    regexp = paste0(
+      "ExecPlan with .* nodes:.*", # boiler plate for ExecPlan
+      "ProjectNode.*",             # output columns
+      "TableSourceNode"            # entry point
+    )
+  )
+
+  # minimal test - show_query()
+  expect_output(
+    mtcars %>%
+      arrow_table() %>%
+      show_query(),
+    regexp = paste0(
+      "ExecPlan with .* nodes:.*", # boiler plate for ExecPlan
+      "ProjectNode.*",             # output new columns
+      "TableSourceNode"            # entry point
+    )
+  )
+
+  # minimal test - explain()
+  expect_output(
+    mtcars %>%
+      arrow_table() %>%
+      explain(),
+    regexp = paste0(
+      "ExecPlan with .* nodes:.*", # boiler plate for ExecPlan
+      "ProjectNode.*",             # output columns
+      "TableSourceNode"            # entry point
+    )
+  )
+
+  # arrow_table and mutate
+  expect_output(
+    tbl %>%
+      arrow_table() %>%
+      filter(dbl > 2, chr != "e") %>%
+      select(chr, int, lgl) %>%
+      mutate(int_plus_ten = int + 10) %>%
+      show_exec_plan(),
+    regexp = paste0(
+      "ExecPlan with .* nodes:.*",           # boiler plate for ExecPlan
+      "chr, int, lgl, \"int_plus_ten\".*",   # selected columns
+      "FilterNode.*",                        # filter node
+      "(dbl > 2).*",                         # filter expressions
+      "chr != \"e\".*",
+      "TableSourceNode"                      # entry point
+    )
+  )
+
+  # arrow_table and mutate - show_query()
+  expect_output(
+    tbl %>%
+      arrow_table() %>%
+      filter(dbl > 2, chr != "e") %>%
+      select(chr, int, lgl) %>%
+      mutate(int_plus_ten = int + 10) %>%
+      show_query(),
+    regexp = paste0(
+      "ExecPlan with .* nodes:.*",           # boiler plate for ExecPlan
+      "chr, int, lgl, \"int_plus_ten\".*",   # selected columns
+      "FilterNode.*",                        # filter node
+      "(dbl > 2).*",                         # filter expressions
+      "chr != \"e\".*",
+      "TableSourceNode"                      # entry point
+    )
+  )
+
+  # arrow_table and mutate - explain()
+  expect_output(
+    tbl %>%
+      arrow_table() %>%
+      filter(dbl > 2, chr != "e") %>%
+      select(chr, int, lgl) %>%
+      mutate(int_plus_ten = int + 10) %>%
+      explain(),
+    regexp = paste0(
+      "ExecPlan with .* nodes:.*",           # boiler plate for ExecPlan
+      "chr, int, lgl, \"int_plus_ten\".*",   # selected columns
+      "FilterNode.*",                        # filter node
+      "(dbl > 2).*",                         # filter expressions
+      "chr != \"e\".*",
+      "TableSourceNode"                      # entry point
+    )
+  )
+
+  # record_batch and mutate
+  expect_output(
+    tbl %>%
+      record_batch() %>%
+      filter(dbl > 2, chr != "e") %>%
+      select(chr, int, lgl) %>%
+      mutate(int_plus_ten = int + 10) %>%
+      show_exec_plan(),
+    regexp = paste0(
+      "ExecPlan with .* nodes:.*",           # boiler plate for ExecPlan
+      "chr, int, lgl, \"int_plus_ten\".*",   # selected columns
+      "(dbl > 2).*",                         # the filter expressions
+      "chr != \"e\".*",
+      "TableSourceNode"                      # the entry point"
+    )
+  )
+
+  # test with group_by and summarise
+  expect_output(
+    tbl %>%
+      arrow_table() %>%
+      group_by(lgl) %>%
+      summarise(avg = mean(dbl, na.rm = TRUE)) %>%
+      ungroup() %>%
+      show_exec_plan(),
+    regexp = paste0(
+      "ExecPlan with .* nodes:.*",            # boiler plate for ExecPlan
+      "ProjectNode.*",                        # output columns
+      "GroupByNode.*",                        # the group_by statement
+      "keys=.*lgl.*",                         # the key for the aggregations
+      "aggregates=.*hash_mean.*avg.*",        # the aggregations
+      "ProjectNode.*",                        # the input columns
+      "TableSourceNode"                       # the entry point
+    )
+  )
+
+  # test with group_by and summarise - show_query()
+  expect_output(
+    tbl %>%
+      arrow_table() %>%
+      group_by(lgl) %>%
+      summarise(avg = mean(dbl, na.rm = TRUE)) %>%
+      ungroup() %>%
+      show_query(),
+    regexp = paste0(
+      "ExecPlan with .* nodes:.*",            # boiler plate for ExecPlan
+      "ProjectNode.*",                        # output columns
+      "GroupByNode.*",                        # the group_by statement
+      "keys=.*lgl.*",                         # the key for the aggregations
+      "aggregates=.*hash_mean.*avg.*",        # the aggregations
+      "ProjectNode.*",                        # the input columns
+      "TableSourceNode"                       # the entry point
+    )
+  )
+
+  # test with group_by and summarise - explain()
+  expect_output(
+    tbl %>%
+      arrow_table() %>%
+      group_by(lgl) %>%
+      summarise(avg = mean(dbl, na.rm = TRUE)) %>%
+      ungroup() %>%
+      explain(),
+    regexp = paste0(
+      "ExecPlan with .* nodes:.*",            # boiler plate for ExecPlan
+      "ProjectNode.*",                        # output columns
+      "GroupByNode.*",                        # group_by statement
+      "keys=.*lgl.*",                         # key for the aggregations
+      "aggregates=.*hash_mean.*avg.*",        # aggregations
+      "ProjectNode.*",                        # input columns
+      "TableSourceNode"                       # entry point
+    )
+  )
+
+  # test with join
+  expect_output(
+    tbl %>%
+      arrow_table() %>%
+      left_join(
+        example_data %>%
+          arrow_table() %>%
+          mutate(doubled_dbl = dbl * 2) %>%
+          select(int, doubled_dbl),
+        by = "int"
+      ) %>%
+      select(int, verses, doubled_dbl) %>%
+      show_exec_plan(),
+    regexp = paste0(
+      "ExecPlan with .* nodes:.*",              # boiler plate for ExecPlan
+      "ProjectNode.*",                          # output columns
+      "HashJoinNode.*",                         # the join
+      "ProjectNode.*",                          # input columns for the second 
table
+      "\"doubled_dbl\"\\: multiply_checked\\(dbl, 2\\).*", # mutate
+      "TableSourceNode.*",                      # second table
+      "ProjectNode.*",                          # input columns for the first 
table
+      "TableSourceNode"                         # first table
+    )
+  )
+
+  # test with join - show_query()
+  expect_output(
+    tbl %>%
+      arrow_table() %>%
+      left_join(
+        example_data %>%
+          arrow_table() %>%
+          mutate(doubled_dbl = dbl * 2) %>%
+          select(int, doubled_dbl),
+        by = "int"
+      ) %>%
+      select(int, verses, doubled_dbl) %>%
+      show_query(),
+    regexp = paste0(
+      "ExecPlan with .* nodes:.*",              # boiler plate for ExecPlan
+      "ProjectNode.*",                          # output columns
+      "HashJoinNode.*",                         # join
+      "ProjectNode.*",                          # input columns for the second 
table
+      "\"doubled_dbl\"\\: multiply_checked\\(dbl, 2\\).*", # the mutate
+      "TableSourceNode.*",                      # second table
+      "ProjectNode.*",                          # input columns for the first 
table
+      "TableSourceNode"                         # first table
+    )
+  )
+
+  # test with join - explain()
+  expect_output(
+    tbl %>%
+      arrow_table() %>%
+      left_join(
+        example_data %>%
+          arrow_table() %>%
+          mutate(doubled_dbl = dbl * 2) %>%
+          select(int, doubled_dbl),
+        by = "int"
+      ) %>%
+      select(int, verses, doubled_dbl) %>%
+      explain(),
+    regexp = paste0(
+      "ExecPlan with .* nodes:.*",              # boiler plate for ExecPlan
+      "ProjectNode.*",                          # output columns
+      "HashJoinNode.*",                         # join
+      "ProjectNode.*",                          # input columns for the second 
table
+      "\"doubled_dbl\"\\: multiply_checked\\(dbl, 2\\).*", # mutate
+      "TableSourceNode.*",                      # second table
+      "ProjectNode.*",                          # input columns for the first 
table
+      "TableSourceNode"                         # first table
+    )
+  )
+
+  expect_output(
+    mtcars %>%
+      arrow_table() %>%
+      filter(mpg > 20) %>%
+      arrange(desc(wt)) %>%
+      show_exec_plan(),
+    regexp = paste0(
+      "ExecPlan with .* nodes:.*",   # boiler plate for ExecPlan
+      "OrderBySinkNode.*wt.*DESC.*", # arrange goes via the OrderBy sink node
+      "ProjectNode.*",               # output columns
+      "FilterNode.*",                # filter node
+      "TableSourceNode.*"            # entry point
+    )
+  )
+
+  expect_output(
+    mtcars %>%
+      arrow_table() %>%
+      filter(mpg > 20) %>%
+      arrange(desc(wt)) %>%
+      show_query(),
+    regexp = paste0(
+      "ExecPlan with .* nodes:.*",    # boiler plate for ExecPlan
+      "OrderBySinkNode.*wt.*DESC.*",  # arrange goes via the OrderBy sink node
+      "ProjectNode.*",                # output columns
+      "FilterNode.*",                 # filter node
+      "TableSourceNode.*"             # entry point
+    )
+  )
+
+  expect_output(
+    mtcars %>%
+      arrow_table() %>%
+      filter(mpg > 20) %>%
+      arrange(desc(wt)) %>%
+      explain(),
+    regexp = paste0(
+      "ExecPlan with .* nodes:.*",   # boiler plate for ExecPlan
+      "OrderBySinkNode.*wt.*DESC.*", # arrange goes via the OrderBy sink node
+      "ProjectNode.*",               # output columns
+      "FilterNode.*",                # filter node
+      "TableSourceNode.*"            # entry point
+    )
+  )
+
+  expect_output(
+    mtcars %>%
+      arrow_table() %>%
+      filter(mpg > 20) %>%
+      arrange(desc(wt)) %>%
+      head(3) %>%
+      show_exec_plan(),
+    # for some reason the FilterNode disappears when head/tail are involved +
+    # we do not have additional information regarding the SinkNode +
+    # the entry point is now a SourceNode and not a TableSourceNode
+    regexp = paste0(
+      "ExecPlan with .* nodes:.*", # boiler plate for ExecPlan
+      "SinkNode.*",                #
+      "ProjectNode.*",             # output columns
+      "SourceNode.*"               # entry point
+    )
+  )
+
+  expect_output(

Review Comment:
   Why are all of the tests copied 3 times? Seems like we only need one version 
that shows that show_exec_plan, show_query, and explain all do the same thing. 
Later, if/when explain does something different, we can add the tests that show 
that.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

[GitHub] [arrow] nealrichardson commented on a diff in pull request #13541: ARROW-15016: [R] `show_exec_plan` for an `arrow_dplyr_query`

Reply via email to