nealrichardson commented on issue #43627: URL: https://github.com/apache/arrow/issues/43627#issuecomment-2284259559
I think I've figured it out. The query plans look the same but they're not--the `0:SourceNode` doesn't show what options it has, and that's concealing what's up. (I thought I wrote an issue years ago about improving those print methods but I can't find it now.) The old path did a `dplyr::select()` to subset the columns before doing the aggregation, and that projection is getting pushed down into the SourceNode. But the `1:ProjectNode` projection, which selects the columns for the `2:GroupByNode` aggregation, doesn't get pushed down. So without that `select()`, we're reading a bunch of data into memory from the Parquet files that we immediately throw away. I'll make a PR to add back that `select()`, and I'll fix the comment from before that was misleading and led me to believe that it wasn't doing anything of value. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected]
