tustvold commented on PR #15018:
URL: https://github.com/apache/datafusion/pull/15018#issuecomment-2708822164

   So I've not had time to look in huge detail, and would echo Andrew's 
concerns around AsyncWrite and friends, ObjectStore intentionally is not 
formulated in terms of them.
   
   However, taking a step back the proposed abstraction doesn't appear to 
operate at a higher-level than ObjectStore or OpenDAL, it is still focused on 
shuffling lists of objects/files and passing opaque bytes around. It is unclear 
to me that introducing a third abstraction at this same level really unlocks 
many new use-cases / optimisation opportunities? That seems a tough sell to me, 
given the potential disruption
   
   The original ticket stated
   
   > The benefit is that users can implement innovative features like 
datafusion-storage-cudf or datafusion-storage-io_uring without being 
constrained by the current I/O abstraction of object-store or OpenDAL.
   
   However, it is unclear to me why either of these would not fit the 
ObjectStore interface, and therefore how datafusion-storage would need to 
differ in order to accommodate them? Ultimately either of these is going to 
need some bridge through to DF, as I doubt a `!Send` DataFusion is on the cards 
anytime soon, and this could easily be mapped to ObjectStore.
   
   > This would allow them to maintain useful features such as context 
management and add additional requirements to the trait while letting 
datafusion-storage-object-store and datafusion-storage-opendal handle the extra 
work.
   
   Is there perhaps some functionality that OpenDAL provides that can't be 
exposed 
[`object_store_opendal`](https://opendal.apache.org/docs/object-store-opendal/object_store_opendal/),
 and a new datafusion-storage abstraction would allow exposing it? Could we add 
this to the ObjectStore interface?
   
   > With the growth of DF, we have to continuously add more features to 
object_store, making it increasingly difficult to compose, as described in 
https://github.com/apache/arrow-rs/issues/7171.
   
   The challenge here, and what I was getting at in the ticket, is that much of 
the functionality was being implemented at the ObjectStore boundary, where it 
might be better served being implemented either higher or lower in the stack. 
We've addressed the latter by introducing HttpClient in object_store. I see 
datafusion_storage as an opportunity to address the former. 
   
   To put this concretely, say I am wanting to implement caching of parquet 
files, I don't want to be caching raw bytes and byte ranges. Instead I want to 
be able cache the metadata separately, and then perhaps have some internal data 
structure for quickly identifying row groups, etc...
   
   Similarly for CSV files, I might want the ability to cache file schemas, or 
meta information about number of rows, etc...
   
   This is the level that historically hasn't really existed in a coherent form 
in DF. There have existed things like AsyncFileReaderFactory, etc... but 
they're kind of ad-hoc, relatively hard to use, and not part of a coherent 
global design. It is possible/probable that the recent DataSource work is this 
abstraction already, but I haven't followed it close enough to weigh in on this 
with any authority.
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org
For additional commands, e-mail: github-h...@datafusion.apache.org

Reply via email to