Re: [PR] [POC] feat: Add datafusion-storage [datafusion]

via GitHub Sun, 09 Mar 2025 05:03:08 -0700


tustvold commented on PR #15018:
URL: https://github.com/apache/datafusion/pull/15018#issuecomment-2708822164

So I've not had time to look in huge detail, and would echo Andrew's
concerns around AsyncWrite and friends, ObjectStore intentionally is not
formulated in terms of them.

However, taking a step back the proposed abstraction doesn't appear to
operate at a higher-level than ObjectStore or OpenDAL, it is still focused on
shuffling lists of objects/files and passing opaque bytes around. It is unclear
to me that introducing a third abstraction at this same level really unlocks
many new use-cases / optimisation opportunities? That seems a tough sell to me,
given the potential disruption

The original ticket stated

> The benefit is that users can implement innovative features like
datafusion-storage-cudf or datafusion-storage-io_uring without being
constrained by the current I/O abstraction of object-store or OpenDAL.

However, it is unclear to me why either of these would not fit the
ObjectStore interface, and therefore how datafusion-storage would need to
differ in order to accommodate them? Ultimately either of these is going to
need some bridge through to DF, as I doubt a `!Send` DataFusion is on the cards
anytime soon, and this could easily be mapped to ObjectStore.

> This would allow them to maintain useful features such as context
management and add additional requirements to the trait while letting
datafusion-storage-object-store and datafusion-storage-opendal handle the extra
work.

Is there perhaps some functionality that OpenDAL provides that can't be
exposed
[`object_store_opendal`](https://opendal.apache.org/docs/object-store-opendal/object_store_opendal/),
and a new datafusion-storage abstraction would allow exposing it? Could we add
this to the ObjectStore interface?

> With the growth of DF, we have to continuously add more features to
object_store, making it increasingly difficult to compose, as described in
https://github.com/apache/arrow-rs/issues/7171.

The challenge here, and what I was getting at in the ticket, is that much of
the functionality was being implemented at the ObjectStore boundary, where it
might be better served being implemented either higher or lower in the stack.
We've addressed the latter by introducing HttpClient in object_store. I see
datafusion_storage as an opportunity to address the former.

To put this concretely, say I am wanting to implement caching of parquet
files, I don't want to be caching raw bytes and byte ranges. Instead I want to
be able cache the metadata separately, and then perhaps have some internal data
structure for quickly identifying row groups, etc...

Similarly for CSV files, I might want the ability to cache file schemas, or
meta information about number of rows, etc...

This is the level that historically hasn't really existed in a coherent form
in DF. There have existed things like AsyncFileReaderFactory, etc... but
they're kind of ad-hoc, relatively hard to use, and not part of a coherent
global design. It is possible/probable that the recent DataSource work is this
abstraction already, but I haven't followed it close enough to weigh in on this
with any authority.

--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Re: [PR] [POC] feat: Add datafusion-storage [datafusion]

Reply via email to