Re: [I] [R] filtering a dataset down to a single partition (dplyr interface) prior to collection still accesses other partitions, leading to slow reads [arrow]

via GitHub Thu, 12 Dec 2024 05:02:47 -0800


debrouwere commented on issue #44725:
URL: https://github.com/apache/arrow/issues/44725#issuecomment-2538848374


   I wonder if you meant lazily evaluated, in the sense that, until `collect` 
is called, no data is read?
   
   With regards to tibbles and performance, my bad, I can see now that my 
comment was unclear. In R, I can take any tibble and then write that to disk as 
a hive-style Parquet dataset with `write_dataset`. Although writing to Parquet 
from a tibble doesn't produce any errors, I have noticed that if I then try to 
read the resulting dataset, it results in very slow reads and `invalid 
metadata$r` errors, which is what my original bug report was about. However, if 
instead I persist the dataset to disk using `my_tibble |> as.data.frame() |> 
write_dataset(...)` subsequent reads are much faster and don't result in 
`invalid metadata$r` errors. Therefore, my guess is that there is something 
suboptimal / weird about how data from tibbles is converted into Parquet. To be 
completely clear, I am talking about read performance, not about whether 
analysis with a data.frame or with a tibble is faster in R, which wouldn't 
really have anything to do with Parquet.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Re: [I] [R] filtering a dataset down to a single partition (dplyr interface) prior to collection still accesses other partitions, leading to slow reads [arrow]

Reply via email to