westonpace commented on code in PR #13563:
URL: https://github.com/apache/arrow/pull/13563#discussion_r918160386
##########
r/R/dplyr.R:
##########
@@ -276,13 +278,48 @@ source_data <- function(x) {
}
}
-is_collapsed <- function(x) inherits(x$.data, "arrow_dplyr_query")
+all_sources <- function(x) {
+ if (is.null(x)) {
+ x
+ } else if (!inherits(x, "arrow_dplyr_query")) {
+ list(x)
+ } else {
+ c(
+ all_sources(x$.data),
+ all_sources(x$join$right_data),
+ all_sources(x$union_all$right_data)
+ )
+ }
+}
-has_aggregation <- function(x) {
- # TODO: update with joins (check right side data too)
- !is.null(x$aggregations) || (is_collapsed(x) && has_aggregation(x$.data))
+query_can_stream <- function(x) {
Review Comment:
> I was just thinking that having such a method on ExecPlan would be useful
in general.
Possibly. We'd probably want to define it more formally. SQL has `LIMIT X`
and Substrait's equivalent is
[`FetchRel`](https://substrait.io/relations/logical_relations/#fetch-operation).
Neither of these are exactly what is being detected here. For example, it is
legal to have `SELECT SUM(x) FROM table LIMIT 1` but it wouldn't actually limit
any data being read.
We could define it as "single pipeline queries" but a pipeline breaker
doesn't necessarily mean a query is non-streaming (for example, hash-join is
sometimes permitted as "streaming" in this example but it is always a pipeline
breaker).
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]