kylebarron commented on issue #340: URL: https://github.com/apache/arrow-rs-object-store/issues/340#issuecomment-3879577208
At the Python level we're wrestling with this in https://github.com/developmentseed/obspec-utils/pull/65: what abstraction lets us handle both object stores and http sources that may or may not have a content length defined? For Parquet specifically, the Parquet crate's [`AsyncFileReader`](https://docs.rs/parquet/latest/parquet/arrow/async_reader/trait.AsyncFileReader.html) trait is very general. **But**, since Parquet's metadata is at the end of the file, you need to _either_ be able to perform suffix range requests _or_ be able to know the content length. For HTTP sources that may or may not guarantee a content length, it's hard to generically load Parquet sources. In the context of DataFusion specifically, I think DataFusion is pretty tied to the `ObjectStore` abstraction. I'm not sure how to swap out data fetching in DataFusion for something more general, even if you only need data fetching, not other `ObjectStore` methods. I guess that's where you define your custom `TableProvider` then? Not sure if you can override the `AsyncFileReader` that the DataFusion Parquet reader uses? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected]
