[GitHub] [arrow-datafusion] sergiimk opened a new issue #1384: Support `register_parquet` from a list of files

GitBox Mon, 29 Nov 2021 16:23:43 -0800


sergiimk opened a new issue #1384:
URL: https://github.com/apache/arrow-datafusion/issues/1384



   **Is your feature request related to a problem or challenge? Please describe 
what you are trying to do.**
   Current API (`register_parquet`) allows only to specify a single parquet 
file, or an entire directory (prefix).
   
   There are cases where application has **extra knowledge** that can help 
narrow down the list of files needed to execute a query to a much smaller 
subset - e.g. reduce the number of files scanned from thousands to just a few.
   
   **Describe the solution you'd like**
   It would be very useful to have an API that creates a table from a **list of 
specific parquet files**.
   
   **Describe alternatives you've considered**
   In datafusion `v5` I had a custom implementation of `ParquetTable` 
(`TableProvider`) that accepted a `Vec<Path>`.
   
   In `v6` the logic of which files are read is now spread between 
`ListingOptions`, `ObjectStore`, and `ListingTable` and I don't see an easy way 
to customize it. 
   
   I tried to implement a wrapper on top of `LocalFileSystem` (`ObjectStore`) 
... but, oh my, working with async streams is a pain.
   
   **Additional context**
   I'm working on a stream processing tool that persists stream data in a 
series of parquet files like so:
   
   ```
   /my-data-stream/
   |-- 2c4e3cee4
   |-- 37bf92a55
   |-- 3c31ccace
   | ...
   ```
   
   There can be many thousands of files in a stream and the tool often queries 
just a few most recent event batches (e.g. `tail -n 100` - show last 100 events 
in the stream). I'd like to avoid the overhead of scanning the entire directory 
and ordering records by time.
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

[GitHub] [arrow-datafusion] sergiimk opened a new issue #1384: Support `register_parquet` from a list of files

Reply via email to