alamb opened a new issue, #14:
URL: https://github.com/apache/arrow-rs-object-store/issues/14

   # Please describe what you are trying to do.
   TLDR: let's combine forces rather than all reimplementing caching / chunking 
/ etc in `object_store`!
   
   The 
[`ObjectStore`](https://docs.rs/object_store/latest/object_store/trait.ObjectStore.html)
 trait is flexible and it is common to compose a stack of `ObjectStore` with 
one wrapping underlying stores 
   
   For example, the 
[`ThrottledStore`](https://docs.rs/object_store/latest/object_store/throttle/struct.ThrottledStore.html)
 and 
[`LimitStore`](https://docs.rs/object_store/latest/object_store/limit/struct.LimitStore.html)
 provided with the object store crate does exactly this
   
   ```
   ┌──────────────────────────────┐
   │        ThrottledStore        │
   │(adds user configured delays) │
   └──────────────────────────────┘
                   ▲               
                   │               
                   │               
   ┌──────────────────────────────┐
   │      Inner ObjectStore       │
   │   (for example, AmazonS3)    │
   └──────────────────────────────┘
   ```
   
   ## Many Different Behaviors
   There are many types of behaviors that can be implemented this way. Some 
examples I am aware of:
   1. The 
[`ThrottledStore`](https://docs.rs/object_store/latest/object_store/throttle/struct.ThrottledStore.html)
 and 
[`LimitStore`](https://docs.rs/object_store/latest/object_store/limit/struct.LimitStore.html)
 provided with the object store crate
   5. Runs on a different tokio runtime (such as the 
[`DeltaIOStorageBackend`](https://github.com/delta-io/delta-rs/blob/e30ab7e366eb209718c87acb6974a815503181bc/crates/core/src/storage/mod.rs#L116-L120)
 in delta rs from @ion-elgreco. 
   2.  Limit the total size of any individual request (e.g. the 
`LimitedRequestSizeObjectStore ` from 
https://github.com/apache/datafusion/issues/15067)
   2. Break single large requests into multiple concurrent small requests 
("chunking") - @crepererum is working on this I think in influx
   4. Caches results of  requests locally using memory / disk (see 
[ObjectStoreMemCache](https://github.com/influxdata/influxdb3_core/tree/main/object_store_mem_cache)
 in influxdb3_core), and [this 
one](https://github.com/slatedb/slatedb/blob/main/src%2Fcached_object_store%2Fobject_store.rs)
 in slatedb @criccomini  (thanks @ion-elgreco for the pointer)
   6. Collect statistics / traces and report metrics (see 
[ObjectStoreMetrics](https://github.com/influxdata/influxdb3_core/tree/main/object_store_metrics)
 in influxdb3_core)
   7. Visualization of object store requests over time
   
   ## Desired behavior is varied and application specific
   
   Also, depending on the needs of the particular app, the ideal behavior / 
policy is likely different. 
   
   For example, 
   1. In the case of  https://github.com/apache/datafusion/issues/15067, 
splitting one large request into several small requests made in series is 
likely the desired approach (maximize the chance they succeed)
   2. If you are trying to maximize read bandwidth in a cloud server setting, 
splitting up ("Chunking") large requests into several parallel ones may be 
desired
   3. If you are trying to minimize costs (for example doing bulk 
reorganizations / compactions on historical data that are not latency 
sensitive), using a single request for large objects (what is done today) might 
be desired
   4. Maybe you want to adapt more dynamically to network and object store 
conditions [as described in Exploiting Cloud Object Storage for 
High-Performance Analytics](https://vldb.org/pvldb/vol16/p2769-durner.pdf)
   
   So the point is that I don't think any one individual policy will work for 
all use cases (though we can certainly discuss changing the default policy)
   
   Since `ObjectStore` is already composable, I already see projects 
implementing these types of things independently (for example, delta-rs and 
influxdb_iox both have a cross runtime object stores, and @mildbyte from 
splitgraph implemented some sort of visualization of object store requests over 
time)
   
   I believe this is similar to the OpenDAL [concept of 
`layers`](https://docs.rs/opendal/latest/opendal/#compose-layers) but @Xuanwo 
please correct me if I am wrong
   
   # Desired Solution
   
   I would like it ti be easier for users of object_store to access such 
features without having implement custom wrappers in parallel independently
   
   
   # Alternatives
   
   ## New `object_store_util` crate
   One alternative is to make a new crate, named`object_store_util` or similar 
mirroring [`futures-util`](https://crates.io/crates/futures-util) and 
[`tokio-util`](https://crates.io/crates/tokio-util) that has a bunch of these 
ObjectStore combinators
   
   This could be housed outside of the apache organization, but I think it 
would be most valuable for the community if it was inside 
   
   ## Add additional policies to provided implmenetations
   An alternate is to implement a more sophisticated default implementations 
(for example, add more options to the 
[`AmazonS3`](https://docs.rs/object_store/latest/object_store/aws/struct.AmazonS3.html)
 implementation. 
   
   One upside of this approach is it could take advantage of implementation 
specific features 
   
   One downside is additional code and configuration complexity, especially as 
the  different strategies are all applicable to multiple stores (e.g. GCP, S3 
and Azure). Another downside is specifying the policy might be complex (like 
specifying concurrency along with chunking and under what circumstances should 
each be used)
   
   
   **Additional context**
   - https://github.com/apache/datafusion/issues/15067
   - https://github.com/apache/datafusion/pull/14286
   - https://github.com/delta-io/delta-rs/issues/2595


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscr...@arrow.apache.org.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

Reply via email to