Re: [I] [R] filtering a dataset down to a single partition (dplyr interface) prior to collection still accesses other partitions, leading to slow reads [arrow]

via GitHub Thu, 28 Nov 2024 12:56:02 -0800


debrouwere commented on issue #44725:
URL: https://github.com/apache/arrow/issues/44725#issuecomment-2506748135


   Okay, so, it turns out that performance is much much better (and no more 
"invalid metadata$r") across the board for a data.frame as opposed to a tibble, 
and there is no longer much of a difference in performance between directly 
reading in a partition vs. filtering on it before collecting. (For writing, 
doesn't matter if it reads the data into a tibble.) That solves my immediate 
problem but I'll leave this issue open because I imagine Arrow/Parquet is 
supposed to work with tibbles too?
   
   Also, for completeness, I did upgrade to Arrow 18 and that didn't help 
either.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Re: [I] [R] filtering a dataset down to a single partition (dplyr interface) prior to collection still accesses other partitions, leading to slow reads [arrow]

Reply via email to