sergiimk opened a new issue #1384: URL: https://github.com/apache/arrow-datafusion/issues/1384
**Is your feature request related to a problem or challenge? Please describe what you are trying to do.** Current API (`register_parquet`) allows only to specify a single parquet file, or an entire directory (prefix). There are cases where application has **extra knowledge** that can help narrow down the list of files needed to execute a query to a much smaller subset - e.g. reduce the number of files scanned from thousands to just a few. **Describe the solution you'd like** It would be very useful to have an API that creates a table from a **list of specific parquet files**. **Describe alternatives you've considered** In datafusion `v5` I had a custom implementation of `ParquetTable` (`TableProvider`) that accepted a `Vec<Path>`. In `v6` the logic of which files are read is now spread between `ListingOptions`, `ObjectStore`, and `ListingTable` and I don't see an easy way to customize it. I tried to implement a wrapper on top of `LocalFileSystem` (`ObjectStore`) ... but, oh my, working with async streams is a pain. **Additional context** I'm working on a stream processing tool that persists stream data in a series of parquet files like so: ``` /my-data-stream/ |-- 2c4e3cee4 |-- 37bf92a55 |-- 3c31ccace | ... ``` There can be many thousands of files in a stream and the tool often queries just a few most recent event batches (e.g. `tail -n 100` - show last 100 events in the stream). I'd like to avoid the overhead of scanning the entire directory and ordering records by time. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected]
