[GitHub] [arrow] cboettig commented on issue #33312: [R] unify_schemas=FALSE does not improve open_dataset() read times

via GitHub Wed, 15 Feb 2023 20:08:35 -0800


cboettig commented on issue #33312:
URL: https://github.com/apache/arrow/issues/33312#issuecomment-1432485314


   @westonpace Thanks! yeah, the timing I see is similar to the timing to list 
contents of the bucket recursively (`s3$ls(recursive=TRUE)`, (as you noted in 
https://github.com/apache/arrow/issues/34145) so that probably explains the 
additional overhead between the above examples rather than the unify_schema 
process.  I'll keep an eye on whatever you come up with in 
https://github.com/apache/arrow/issues/34213.  
   
   As you noted there, performance is much better when we can work in the same 
'datacenter' (i.e. have our MINIO host be on a VM in the same datacenter as the 
compute), but we want to be able to support access to our typical end-user who 
will typically be on a laptop and usually be requesting a small subset of the 
partitions.  In some cases we can write wrapper functions such that we call 
open_dataset() directly on the desired partition rather than the dataset root, 
it feels hacky but maybe that is indeed the best strategy(?)  It's fast but not 
nearly as ergonomic as allowing the arrow + dplyr::filter() to select those 
paths from the dataset root.  


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

[GitHub] [arrow] cboettig commented on issue #33312: [R] unify_schemas=FALSE does not improve open_dataset() read times

Reply via email to