tustvold opened a new issue, #2992: URL: https://github.com/apache/arrow-datafusion/issues/2992
**Is your feature request related to a problem or challenge? Please describe what you are trying to do.** We are seeing a number of projects with differing requirements for how the interaction with object store and parquet should proceed: * Fetching multiple byte ranges in parallel - https://github.com/apache/arrow-datafusion/issues/2949 * Fetching data from sources that aren't typical object stores - https://github.com/apache/arrow-rs/issues/2230#issuecomment-1200144042 Something clearly isn't right here, and it's creating friction preventing users from getting things working. **Describe the solution you'd like** The general philosophy of DataFusion is to be pluggable, and allow for easy extension where the defaults are not applicable to the use-case. This is particularly important for the interfaces to data storage, where a lot of application-specific trade-offs will occur. I would therefore like to propose adding an option to `ParquetExec` to specify `ParquetOpenFn` (name to be discussed). ``` type ParquetOpenFn = Box<dyn Fn(ObjectMeta) -> Result<Box<dyn AsyncFileReader>>> ``` This will be called by `ParquetOpener` to construct the `AsyncFileReader` passed to `ParquetRecordBatchStream` By default this would simply construct a `ParquetFileReader` as currently, but the user would be able to override this with a custom implementation as desired. This would allow: * Interacting with ObjectStore differently - #2949 * Calling out to something that isn't even an ObjectStore such as a custom tiered storage engine - https://github.com/apache/arrow-rs/issues/2230#issuecomment-1200144042 * Almost certainly something else Thoughts @thinkharderdev @Cheappie @alamb @crepererum ? **Describe alternatives you've considered** We could not do this **Additional context** -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected]
