tustvold commented on issue #7251: URL: https://github.com/apache/arrow-rs/issues/7251#issuecomment-2708211530
Thank you for starting this discussion, I think we should definitely provide more utilities/primitives in this space. > The [ThrottledStore](https://docs.rs/object_store/latest/object_store/throttle/struct.ThrottledStore.html) and [LimitStore](https://docs.rs/object_store/latest/object_store/limit/struct.LimitStore.html) provided with the object store crate FWIW these should probably be deprecated and re-implemented at the HttpClient level. > Collect statistics / traces and report metrics (see [ObjectStoreMetrics](https://github.com/influxdata/influxdb3_core/tree/main/object_store_metrics) in influxdb3_core) > Runs on a different tokio runtime (such as the [DeltaIOStorageBackend](https://github.com/delta-io/delta-rs/blob/e30ab7e366eb209718c87acb6974a815503181bc/crates/core/src/storage/mod.rs#L116-L120) in delta rs from @ion-elgreco. > Collect statistics / traces and report metrics (see [ObjectStoreMetrics](https://github.com/influxdata/influxdb3_core/tree/main/object_store_metrics) in influxdb3_core) > Visualization of object store requests over time Now we have the HttpClient abstraction, I think this is the level I would encourage implementing most of these. > Limit the total size of any individual request (e.g. the LimitedRequestSizeObjectStore from https://github.com/apache/datafusion/issues/15067) > Break single large requests into multiple concurrent small requests ("chunking") - @crepererum is working on this I think in influx > Limit the total size of any individual request (e.g. the LimitedRequestSizeObjectStore from https://github.com/apache/datafusion/issues/15067) This feels like something better built into some sort of TransferManager that sits on top of the ObjectStore API, as opposed to baking it in at the ObjectStore level. Perhaps in a similar vein to [BufWriter](https://docs.rs/object_store/latest/object_store/buffered/struct.BufWriter.html). This would, for example, allow registering a single ObjectStore, but then having different IO configurations for different areas of the stack. It would also potentially allow for greater concurrency, as the ObjectStore API has no mechanism by which chunks fetched in parallel could be returned out of order. This would be especially useful when downloading files to disk, as it avoids needing to hold chunks in memory unnecessarily. See #5277 for some prior discussion. > Add additional policies to provided implementations FWIW all the first-party implementations share a lot of the same underlying logic, e.g. with things like GetClient, and so it may actually not be all that bad -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected]
