suremarc commented on PR #7620: URL: https://github.com/apache/arrow-datafusion/pull/7620#issuecomment-1750983578
> @alamb @suremarc Sorry for the delay. > > > making a Caching ObjectStore implementation? > > As i have check the some remote store client like `hdfs` there are no cache result in client side, IMO doing this is `ObjectStore` may be difficult define the work scope (ensure correctness), in datafusion we can support in in session level to not effect others 🤔 I agree it would be more difficult, having implemented some caching of my own recently. A full implementation would require dealing with invalidation/consistency, particularly for update operations, as well as supporting nontrivial APIs like `list_with_delimiter`. It would be nice if we could do it there but it's unclear to me if it can be done perfectly. On that note, I'd like to point out that this cache implementation won't avoid hitting object storage if your table has partition columns, because in that case DataFusion will instead call `list_with_delimiter` recursively, starting from the top-level path. It would still be an improvement, but I am not sure if this will speed up query planning as much as you want. Just wanted to make sure you're aware of this, as I previously had thought that simply caching `list_all_files` would avoid hitting object storage altogether. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected]
