nealrichardson commented on issue #43627:
URL: https://github.com/apache/arrow/issues/43627#issuecomment-2284259559

   I think I've figured it out. The query plans look the same but they're 
not--the `0:SourceNode` doesn't show what options it has, and that's concealing 
what's up. (I thought I wrote an issue years ago about improving those print 
methods but I can't find it now.) The old path did a `dplyr::select()` to 
subset the columns before doing the aggregation, and that projection is getting 
pushed down into the SourceNode. But the `1:ProjectNode` projection, which 
selects the columns for the `2:GroupByNode` aggregation, doesn't get pushed 
down. So without that `select()`, we're reading a bunch of data into memory 
from the Parquet files that we immediately throw away.
   
   I'll make a PR to add back that `select()`, and I'll fix the comment from 
before that was misleading and led me to believe that it wasn't doing anything 
of value.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Reply via email to