kylebarron commented on issue #340:
URL: 
https://github.com/apache/arrow-rs-object-store/issues/340#issuecomment-3879577208

   At the Python level we're wrestling with this in 
https://github.com/developmentseed/obspec-utils/pull/65: what abstraction lets 
us handle both object stores and http sources that may or may not have a 
content length defined?
   
   For Parquet specifically, the Parquet crate's 
[`AsyncFileReader`](https://docs.rs/parquet/latest/parquet/arrow/async_reader/trait.AsyncFileReader.html)
 trait is very general. **But**, since Parquet's metadata is at the end of the 
file, you need to _either_ be able to perform suffix range requests _or_ be 
able to know the content length. For HTTP sources that may or may not guarantee 
a content length, it's hard to generically load Parquet sources.
   
   In the context of DataFusion specifically, I think DataFusion is pretty tied 
to the `ObjectStore` abstraction. I'm not sure how to swap out data fetching in 
DataFusion for something more general, even if you only need data fetching, not 
other `ObjectStore` methods. I guess that's where you define your custom 
`TableProvider` then? Not sure if you can override the `AsyncFileReader` that 
the DataFusion Parquet reader uses?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Reply via email to