[GitHub] [arrow-datafusion] alamb opened a new issue #924: Add a separate configuration setting for parallelism of scanning parquet files

GitBox Sun, 22 Aug 2021 04:06:27 -0700


alamb opened a new issue #924:
URL: https://github.com/apache/arrow-datafusion/issues/924



   **Is your feature request related to a problem or challenge? Please describe 
what you are trying to do.**
   When reading multiple parquet files, DataFusion will sometimes request many 
file handles from the OS concurrently. This is both inefficient (each file 
handles takes up memory, requires system calls, etc) as well as leads to "too 
many open files" types errors. 
   
   Depending on how fast IO comes in and the details of the Tokio scheduler, 
sometimes it will have far too many open files at once (it might end up opening 
100 input parquet files, for example, even if there are only 8 cores available 
for processing) 
   
   
   
   
   **Describe the solution you'd like**
   As described by @Dandandan  in 
https://github.com/apache/arrow-datafusion/pull/706/files#r667508175 it would 
be nice to decouple the setting for number of concurrent parquet files scanned 
with the number of target partitions for other operators.
   
   So the idea would be to add a new config setting `parquet_partitions` or  
perhaps`filesource_partitions`  that would control the number of parquet 
"partitions" created and thus the number of file handles to run datafusion plans
   
   
   **Describe alternatives you've considered**
   @andygrove has mentioned the Ballista scheduler is more sophisticated in 
this area and hopefully we can move some of those improvements down into the 
core DataFusion engine
   
   
   
   **Additional context**
   There are reports in arrow-rs of "too many open files" 
https://github.com/apache/arrow-rs/issues/47#issuecomment-903221764 which may 
also be helped by this feature, though there is probably more work as well


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

[GitHub] [arrow-datafusion] alamb opened a new issue #924: Add a separate configuration setting for parallelism of scanning parquet files

Reply via email to