H-Plus-Time commented on PR #6157:
URL: https://github.com/apache/arrow-rs/pull/6157#issuecomment-2309140943
> I'm pretty apprehensive about changing the default behaviour to use suffix
requests given there are stores that are known to not support them.
>
> I wonder what you think of instead performing the initial suffix read of
the prefetch size manually, and then constructing an AsyncFileReader with these
bytes cached? This initial fetch request would tell you the total size of the
object and so no changes would be necessary to traits, allowing us to keep the
current simple interface and avoid potential compatibility footguns, but also
allow you to avoid making additional HEAD requests for your use case?
I agree with defaulting to swapping back the default non-suffix requests
(consensus seems to be reached on that).
The manual approach isn't an option in my main use-case (browser side fetch
as the underlying IO method) - usually resources are cross-origin, and
Content-Range is not CORS-safelisted (the target origin must include it in
Access-Control-Expose-Headers), so you can't determine the object's size from a
ranged GET request alone without considerable intervention.
Shifting the suffix request path to a secondary `load_without_size`, and
probably re-emphasising the file_size parameter (that is, make it an explicit
choice on the users' part to _not_ give the reader that information), I think
that would be the least disruptive change.
Perhaps the following would be sufficient:
```rust
/// store.rs
impl ParquetObjectReader {
pub fn new(store: Arc<dyn ObjectStore>, meta: ObjectMeta) -> Self
pub fn new_without_size(store: Arc<dyn ObjectStore>, location: Path) ->
Self
}
/// metadata.rs
impl<F: MetadataFetch> MetadataLoader<F> {
pub async fn load(mut fetch: F, file_size: usize, prefetch: Option<usize>)
-> Result<Self>
pub async fn load_without_size(mut fetch: F, prefetch: Option<usize>) ->
Result<Self>
}
// fetch_parquet_metadata elided for brevity - file_size: usize ->
Option<usize>
```
With the above, the remaining footguns (calling fetch_parquet_metadata with
file_size=None, calling load_without_size, or supplying a suffix range directly
to get_bytes) would require quite a lot of intent.
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]