clbarnes commented on issue #4611: URL: https://github.com/apache/arrow-rs/issues/4611#issuecomment-1842891314
> Perhaps you could expand upon why you do not know the sizes of the files I mentioned our use case in another issue; copied below > As part of the [zarr](https://zarr-specs.readthedocs.io/en/latest/v3/core/v3.0.html) project, we plan to store large tensors on a variety of backends (local/ HTTP/ object store), which are chunked into many separate files/ objects. As part of the [sharding](https://zarr-specs.readthedocs.io/en/latest/v3/codecs/sharding-indexed/v1.0.html) specification, each chunk (=shard) could contain many sub-chunks which are independently encoded and then concatenated. We'd want to read a footer to find the byte addresses of sub-chunks (see https://github.com/apache/arrow-rs/issues/4611 ), and then read (possibly multiple) byte ranges from the shard. We don't want to list all existing chunks ahead of time as there could easily be many millions, and this could even change under our feet if we're writing the tensor as we go. As chunks may be compressed with arbitrary codecs, we can't predict how many bytes they'll be even if we know how large the chunks are; we just need to read the footer (which indexes sub-chunks) so that we then know which bits of the object to read. I suppose in this use case we never need to read the suffix at the same time as the rest of the chunk, so we could have a separate method for suffix-getting with a default implementation of using a HEAD then GET which is documented as possibly being slow. > I dunno, generally the approach of this crate is to encourage people towards patterns that behave equally well across all backends, as opposed to ones that will have store-specific performance pitfalls. Patterns, yes, but I hope we've demonstrated that sometimes people actually need a suffix. All stores can do it (with 2 requests), some stores can just do it better (with 1) - should we refuse to use optimisations which are only available to certain stores? If people already know the length (from listing or whatever), then they don't need to use the method documented as being possibly slow. > GetOptionsExt is crate private Ah, yes, that's unfortunate. So 3rd party stores currently just wrangle their own options? I think the minimal-impact course is to keep everything as it is and just add something like ```rust pub trait ObjectStore: std::fmt::Display + Send + Sync + Debug + 'static { ... /// Get the last `nbytes` of an object. /// /// If the object size is not known, the default implementation first finds out with a HEAD request. /// Stores which support suffix requests directly should override this behaviour. async fn get_suffix(&self, location: &Path, nbytes: usize, object_size: Option<usize>) -> Result<GetResult> { // if size is None, find out with a head request // then do self.get_range } } ``` Instantly works for everyone, the performance concerns are well-documented, there's an ergonomic path for people who need a suffix and already know the size, and an easy optimisation path for stores which do support it. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected]
