Re: [I] [R] filtering a dataset down to a single partition (dplyr interface) prior to collection still accesses other partitions, leading to slow reads [arrow]

via GitHub Wed, 11 Dec 2024 09:38:42 -0800


amoeba commented on issue #44725:
URL: https://github.com/apache/arrow/issues/44725#issuecomment-2536661060


   Hi @debrouwere, thanks for the issue.
   
   To my knowledge, `open_dataset` is eagerly evaluated and so whether it's 
involved in a dplyr pipeline or not won't change how long it takes. I should 
double-check this but if that's true, it might be an area for improvement. What 
do you get if you run your code snippets on the same Dataset instance? e.g.,
   
   ```r
   ds <- open_dataset('build/pisa.rx')
   
   ds |>
     filter(country == 'Belgium', cycle == 2022) |>
     select(starts_with('w_')) |>
     collect()
   
   # And run other tests here to calculate timings
   ```
   
   Also, I didn't really understand your last comment about data.frames vs. 
tibbles and performance. Could you explain that a bit more?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Re: [I] [R] filtering a dataset down to a single partition (dplyr interface) prior to collection still accesses other partitions, leading to slow reads [arrow]

Reply via email to