[I] No efficient way to load a subset of files from partitioned table [arrow-datafusion]

via GitHub Thu, 18 Jan 2024 07:23:57 -0800


rspears74 opened a new issue, #8906:
URL: https://github.com/apache/arrow-datafusion/issues/8906


   ### Is your feature request related to a problem or challenge?
   
   As far as I can tell, there is no good way to load a subset of files from a 
partitioned table. Using `ListingTable` or another `TableProvider` like 
`DeltaTableProvider` from `deltalake`, I'm able to `read_table`, but this loads 
the entire table. I can also load a list of parquet files with `read_parquet`, 
but this doesn't work with partitioned tables if the partitions are not 
"materialized" columns in the raw parquet. The only way I've found to load 
partitioned files is by iterating over a list of file paths, and doing the 
entire `TableProvider`/`read_table` process on each one individually, and 
`union`ing the results together.
   
   ### Describe the solution you'd like
   
   It seems like it would be nice to be able to create a `TableProvider` with a 
table path, then pass some sort of file "whitelist" in. Maybe a 
`read_table_files(TableProvider, impl IntoIterator<Item = String>)`.
   
   ### Describe alternatives you've considered
   
   As stated above, I've tried reading the files one-by-one and `union`ing 
results, but it's shockingly inefficient compared to reading all files at once.
   
   ### Additional context
   
   _No response_


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

[I] No efficient way to load a subset of files from partitioned table [arrow-datafusion]

Reply via email to