suremarc commented on PR #7620:
URL: 
https://github.com/apache/arrow-datafusion/pull/7620#issuecomment-1750983578

   > @alamb @suremarc Sorry for the delay.
   > 
   > > making a Caching ObjectStore implementation?
   > 
   > As i have check the some remote store client like `hdfs` there are no 
cache result in client side, IMO doing this is `ObjectStore` may be difficult 
define the work scope (ensure correctness), in datafusion we can support in in 
session level to not effect others 🤔
   
   I agree it would be more difficult, having implemented some caching of my 
own recently. A full implementation would require dealing with 
invalidation/consistency, particularly for update operations, as well as 
supporting nontrivial APIs like `list_with_delimiter`. It would be nice if we 
could do it there but it's unclear to me if it can be done perfectly.
   
   On that note, I'd like to point out that this cache implementation won't 
avoid hitting object storage if your table has partition columns, because in 
that case DataFusion will instead call `list_with_delimiter` recursively, 
starting from the top-level path. It would still be an improvement, but I am 
not sure if this will speed up query planning as much as you want. Just wanted 
to make sure you're aware of this, as I previously had thought that simply 
caching `list_all_files` would avoid hitting object storage altogether. 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Reply via email to