tustvold commented on PR #15018: URL: https://github.com/apache/datafusion/pull/15018#issuecomment-2708822164
So I've not had time to look in huge detail, and would echo Andrew's concerns around AsyncWrite and friends, ObjectStore intentionally is not formulated in terms of them. However, taking a step back the proposed abstraction doesn't appear to operate at a higher-level than ObjectStore or OpenDAL, it is still focused on shuffling lists of objects/files and passing opaque bytes around. It is unclear to me that introducing a third abstraction at this same level really unlocks many new use-cases / optimisation opportunities? That seems a tough sell to me, given the potential disruption The original ticket stated > The benefit is that users can implement innovative features like datafusion-storage-cudf or datafusion-storage-io_uring without being constrained by the current I/O abstraction of object-store or OpenDAL. However, it is unclear to me why either of these would not fit the ObjectStore interface, and therefore how datafusion-storage would need to differ in order to accommodate them? Ultimately either of these is going to need some bridge through to DF, as I doubt a `!Send` DataFusion is on the cards anytime soon, and this could easily be mapped to ObjectStore. > This would allow them to maintain useful features such as context management and add additional requirements to the trait while letting datafusion-storage-object-store and datafusion-storage-opendal handle the extra work. Is there perhaps some functionality that OpenDAL provides that can't be exposed [`object_store_opendal`](https://opendal.apache.org/docs/object-store-opendal/object_store_opendal/), and a new datafusion-storage abstraction would allow exposing it? Could we add this to the ObjectStore interface? > With the growth of DF, we have to continuously add more features to object_store, making it increasingly difficult to compose, as described in https://github.com/apache/arrow-rs/issues/7171. The challenge here, and what I was getting at in the ticket, is that much of the functionality was being implemented at the ObjectStore boundary, where it might be better served being implemented either higher or lower in the stack. We've addressed the latter by introducing HttpClient in object_store. I see datafusion_storage as an opportunity to address the former. To put this concretely, say I am wanting to implement caching of parquet files, I don't want to be caching raw bytes and byte ranges. Instead I want to be able cache the metadata separately, and then perhaps have some internal data structure for quickly identifying row groups, etc... Similarly for CSV files, I might want the ability to cache file schemas, or meta information about number of rows, etc... This is the level that historically hasn't really existed in a coherent form in DF. There have existed things like AsyncFileReaderFactory, etc... but they're kind of ad-hoc, relatively hard to use, and not part of a coherent global design. It is possible/probable that the recent DataSource work is this abstraction already, but I haven't followed it close enough to weigh in on this with any authority. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org --------------------------------------------------------------------- To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For additional commands, e-mail: github-h...@datafusion.apache.org