alamb opened a new issue, #14: URL: https://github.com/apache/arrow-rs-object-store/issues/14
# Please describe what you are trying to do. TLDR: let's combine forces rather than all reimplementing caching / chunking / etc in `object_store`! The [`ObjectStore`](https://docs.rs/object_store/latest/object_store/trait.ObjectStore.html) trait is flexible and it is common to compose a stack of `ObjectStore` with one wrapping underlying stores For example, the [`ThrottledStore`](https://docs.rs/object_store/latest/object_store/throttle/struct.ThrottledStore.html) and [`LimitStore`](https://docs.rs/object_store/latest/object_store/limit/struct.LimitStore.html) provided with the object store crate does exactly this ``` ┌──────────────────────────────┐ │ ThrottledStore │ │(adds user configured delays) │ └──────────────────────────────┘ ▲ │ │ ┌──────────────────────────────┐ │ Inner ObjectStore │ │ (for example, AmazonS3) │ └──────────────────────────────┘ ``` ## Many Different Behaviors There are many types of behaviors that can be implemented this way. Some examples I am aware of: 1. The [`ThrottledStore`](https://docs.rs/object_store/latest/object_store/throttle/struct.ThrottledStore.html) and [`LimitStore`](https://docs.rs/object_store/latest/object_store/limit/struct.LimitStore.html) provided with the object store crate 5. Runs on a different tokio runtime (such as the [`DeltaIOStorageBackend`](https://github.com/delta-io/delta-rs/blob/e30ab7e366eb209718c87acb6974a815503181bc/crates/core/src/storage/mod.rs#L116-L120) in delta rs from @ion-elgreco. 2. Limit the total size of any individual request (e.g. the `LimitedRequestSizeObjectStore ` from https://github.com/apache/datafusion/issues/15067) 2. Break single large requests into multiple concurrent small requests ("chunking") - @crepererum is working on this I think in influx 4. Caches results of requests locally using memory / disk (see [ObjectStoreMemCache](https://github.com/influxdata/influxdb3_core/tree/main/object_store_mem_cache) in influxdb3_core), and [this one](https://github.com/slatedb/slatedb/blob/main/src%2Fcached_object_store%2Fobject_store.rs) in slatedb @criccomini (thanks @ion-elgreco for the pointer) 6. Collect statistics / traces and report metrics (see [ObjectStoreMetrics](https://github.com/influxdata/influxdb3_core/tree/main/object_store_metrics) in influxdb3_core) 7. Visualization of object store requests over time ## Desired behavior is varied and application specific Also, depending on the needs of the particular app, the ideal behavior / policy is likely different. For example, 1. In the case of https://github.com/apache/datafusion/issues/15067, splitting one large request into several small requests made in series is likely the desired approach (maximize the chance they succeed) 2. If you are trying to maximize read bandwidth in a cloud server setting, splitting up ("Chunking") large requests into several parallel ones may be desired 3. If you are trying to minimize costs (for example doing bulk reorganizations / compactions on historical data that are not latency sensitive), using a single request for large objects (what is done today) might be desired 4. Maybe you want to adapt more dynamically to network and object store conditions [as described in Exploiting Cloud Object Storage for High-Performance Analytics](https://vldb.org/pvldb/vol16/p2769-durner.pdf) So the point is that I don't think any one individual policy will work for all use cases (though we can certainly discuss changing the default policy) Since `ObjectStore` is already composable, I already see projects implementing these types of things independently (for example, delta-rs and influxdb_iox both have a cross runtime object stores, and @mildbyte from splitgraph implemented some sort of visualization of object store requests over time) I believe this is similar to the OpenDAL [concept of `layers`](https://docs.rs/opendal/latest/opendal/#compose-layers) but @Xuanwo please correct me if I am wrong # Desired Solution I would like it ti be easier for users of object_store to access such features without having implement custom wrappers in parallel independently # Alternatives ## New `object_store_util` crate One alternative is to make a new crate, named`object_store_util` or similar mirroring [`futures-util`](https://crates.io/crates/futures-util) and [`tokio-util`](https://crates.io/crates/tokio-util) that has a bunch of these ObjectStore combinators This could be housed outside of the apache organization, but I think it would be most valuable for the community if it was inside ## Add additional policies to provided implmenetations An alternate is to implement a more sophisticated default implementations (for example, add more options to the [`AmazonS3`](https://docs.rs/object_store/latest/object_store/aws/struct.AmazonS3.html) implementation. One upside of this approach is it could take advantage of implementation specific features One downside is additional code and configuration complexity, especially as the different strategies are all applicable to multiple stores (e.g. GCP, S3 and Azure). Another downside is specifying the policy might be complex (like specifying concurrency along with chunking and under what circumstances should each be used) **Additional context** - https://github.com/apache/datafusion/issues/15067 - https://github.com/apache/datafusion/pull/14286 - https://github.com/delta-io/delta-rs/issues/2595 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: github-unsubscr...@arrow.apache.org.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org