paleolimbot commented on code in PR #13541: URL: https://github.com/apache/arrow/pull/13541#discussion_r923720163
########## r/tests/testthat/test-dplyr-query.R: ########## @@ -433,3 +433,343 @@ test_that("query_can_stream()", { query_can_stream() ) }) + +test_that("show_exec_plan(), show_query() and explain()", { + # minimal test - this fails if we don't coerce the input to `show_exec_plan()` + # to be an `arrow_dplyr_query` + expect_output( Review Comment: Can you use `expect_snapshot()` for these? Or is the output not the same across platforms? ########## r/tests/testthat/test-dataset-dplyr.R: ########## @@ -340,3 +340,82 @@ test_that("dplyr method not implemented messages", { fixed = TRUE ) }) + +test_that("show_exec_plan(), show_query() and explain() with datasets", { + ds <- open_dataset(dataset_dir, partitioning = schema(part = uint8())) Review Comment: See below - I think you should try `expect_snapshot()` to make these less verbose. I also think that you don't need this many tests - just enough to cover the code you've added. If you extract `ExecPlan_prepare()`, which I think you should, then you only need one or two tests per verb (in my opinion). To me as a reviewer, the extra tests imply extra code paths which may be misleading. ########## r/R/dplyr.R: ########## @@ -219,6 +219,45 @@ tail.arrow_dplyr_query <- function(x, n = 6L, ...) { x } +#' Show the details of an Arrow Execution Plan +#' +#' This is a function which gives more details about the logical query plan +#' that will be executed when evaluating an `arrow_dplyr_query` object. +#' It calls the C++ `ExecPlan` object's print method. +#' Functionally, it is similar to `dplyr::explain()`. +#' +#' @param x an `arrow_dplyr_query` to print the `ExecPlan` for. +#' +#' @return `x`, invisibly. +#' @export +#' +#' @examplesIf arrow_with_dataset() & requireNamespace("dplyr", quietly = TRUE) Review Comment: ```suggestion #' @examplesIf arrow_with_dataset() && requireNamespace("dplyr", quietly = TRUE) ``` ########## r/R/query-engine.R: ########## @@ -259,9 +260,40 @@ ExecPlan <- R6Class("ExecPlan", ... ) }, + ToString = function() ExecPlan_ToString(self), Review Comment: Is there any circumstance in which `ToString()` should return something that isn't `ToStringWithSink()`? I would personally expect `ToString()` to always return the exec plan that will be executed, in other words what you have above: ```r adq <- as_adq(x) plan <- ExecPlan$create() final_node <- plan$Build(adq) plan$ToStringWithSink(final_node) ``` You could also remove this function since I don't think the output gets used anywhere (unless I'm missing some tests). ########## r/R/query-engine.R: ########## @@ -259,9 +260,40 @@ ExecPlan <- R6Class("ExecPlan", ... ) }, + ToString = function() ExecPlan_ToString(self), + # SinkNodes (involved in arrange and/or head/tail operations) are created in + # ExecPlan_run and are not captured by the regulat print method. We take a Review Comment: ```suggestion # ExecPlan_run and are not captured by the regular print method. We take a ``` ########## r/src/compute-exec.cpp: ########## @@ -125,6 +125,46 @@ std::shared_ptr<arrow::Schema> ExecNode_output_schema( return node->output_schema(); } +// [[arrow::export]] +std::string ExecPlan_ToString(const std::shared_ptr<compute::ExecPlan>& plan) { + return plan->ToString(); +} Review Comment: See above comment - I think you should remove this since the output would be misleading (alternatively, provide a way to access this value in `show_exec_plan()` and document why it is useful). ########## r/src/compute-exec.cpp: ########## @@ -125,6 +125,46 @@ std::shared_ptr<arrow::Schema> ExecNode_output_schema( return node->output_schema(); } +// [[arrow::export]] +std::string ExecPlan_ToString(const std::shared_ptr<compute::ExecPlan>& plan) { + return plan->ToString(); +} + +// [[arrow::export]] +std::string ExecPlan_ToStringWithSink( + const std::shared_ptr<compute::ExecPlan>& plan, + const std::shared_ptr<compute::ExecNode>& final_node, cpp11::list sort_options, + int64_t head = -1) { Review Comment: There's some duplication here that I'm worried somebody might forget when updating ExecPlan_run. One strategy is to add a comment to ExecPlan_run (and here) noting that the two implementations must be kept in sync. Probably better is to copy part of what I started in ARROW-16444 ( https://github.com/paleolimbot/arrow/blob/r-udf/r/src/compute-exec.cpp#L60-L91 ) that extracts the common functionality into one function. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: github-unsubscr...@arrow.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org