clbarnes commented on issue #4611:
URL: https://github.com/apache/arrow-rs/issues/4611#issuecomment-1842891314

   > Perhaps you could expand upon why you do not know the sizes of the files
   
   I mentioned our use case in another issue; copied below 
   
   > As part of the 
[zarr](https://zarr-specs.readthedocs.io/en/latest/v3/core/v3.0.html) project, 
we plan to store large tensors on a variety of backends (local/ HTTP/ object 
store), which are chunked into many separate files/ objects. As part of the 
[sharding](https://zarr-specs.readthedocs.io/en/latest/v3/codecs/sharding-indexed/v1.0.html)
 specification, each chunk (=shard) could contain many sub-chunks which are 
independently encoded and then concatenated. We'd want to read a footer to find 
the byte addresses of sub-chunks (see 
https://github.com/apache/arrow-rs/issues/4611 ), and then read (possibly 
multiple) byte ranges from the shard.
   
   We don't want to list all existing chunks ahead of time as there could 
easily be many millions, and this could even change under our feet if we're 
writing the tensor as we go. As chunks may be compressed with arbitrary codecs, 
we can't predict how many bytes they'll be even if we know how large the chunks 
are; we just need to read the footer (which indexes sub-chunks) so that we then 
know which bits of the object to read.
   
   I suppose in this use case we never need to read the suffix at the same time 
as the rest of the chunk, so we could have a separate method for suffix-getting 
with a default implementation of using a HEAD then GET which is documented as 
possibly being slow.
   
   > I dunno, generally the approach of this crate is to encourage people 
towards patterns that behave equally well across all backends, as opposed to 
ones that will have store-specific performance pitfalls.
   
   Patterns, yes, but I hope we've demonstrated that sometimes people actually 
need a suffix. All stores can do it (with 2 requests), some stores can just do 
it better (with 1) - should we refuse to use optimisations which are only 
available to certain stores? If people already know the length (from listing or 
whatever), then they don't need to use the method documented as being possibly 
slow.
   
   > GetOptionsExt is crate private
   
   Ah, yes, that's unfortunate. So 3rd party stores currently just wrangle 
their own options?
   
   I think the minimal-impact course is to keep everything as it is and just 
add something like
   
   ```rust
   pub trait ObjectStore: std::fmt::Display + Send + Sync + Debug + 'static {
       ...
   
       /// Get the last `nbytes` of an object.
       /// 
       /// If the object size is not known, the default implementation first 
finds out with a HEAD request.
       /// Stores which support suffix requests directly should override this 
behaviour.
       async fn get_suffix(&self, location: &Path, nbytes: usize, object_size: 
Option<usize>) -> Result<GetResult> {
           // if size is None, find out with a head request
           // then do self.get_range
       }
   }
   ```
   
   Instantly works for everyone, the performance concerns are well-documented, 
there's an ergonomic path for people who need a suffix and already know the 
size, and an easy optimisation path for stores which do support it.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Reply via email to