[GitHub] [arrow] wjones127 commented on a diff in pull request #13563: ARROW-16776: [R] dplyr::glimpse method for arrow table and datasets

GitBox Mon, 11 Jul 2022 09:13:38 -0700


wjones127 commented on code in PR #13563:
URL: https://github.com/apache/arrow/pull/13563#discussion_r918118858



##########
r/R/dplyr.R:
##########
@@ -276,13 +278,48 @@ source_data <- function(x) {
   }
 }
 
-is_collapsed <- function(x) inherits(x$.data, "arrow_dplyr_query")
+all_sources <- function(x) {
+  if (is.null(x)) {
+    x
+  } else if (!inherits(x, "arrow_dplyr_query")) {
+    list(x)
+  } else {
+    c(
+      all_sources(x$.data),
+      all_sources(x$join$right_data),
+      all_sources(x$union_all$right_data)
+    )
+  }
+}
 
-has_aggregation <- function(x) {
-  # TODO: update with joins (check right side data too)
-  !is.null(x$aggregations) || (is_collapsed(x) && has_aggregation(x$.data))
+query_can_stream <- function(x) {

Review Comment:
   > We could build an ExecPlan, but it wouldn't tell us anything about how it 
would perform, would it?
   
   I'm not super close to the ExecPlan code, but I thought they were composed 
of a graph of nodes that could be traversed and analyzed, just like our 
`arrow_dplyr_query` structure. Am I wrong on that?
   
   > I'm trying to detect cases where I can just take head() of the data 
without having to scan an entire dataset.
   
   I was just thinking that having such a method on `ExecPlan` would be useful 
in general. 



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscr...@arrow.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[GitHub] [arrow] wjones127 commented on a diff in pull request #13563: ARROW-16776: [R] dplyr::glimpse method for arrow table and datasets

Reply via email to