[I] Improve performance of `list_files_for_scan` when not collecting statistics [arrow-datafusion]

via GitHub Tue, 13 Feb 2024 08:59:08 -0800


matthewmturner opened a new issue, #9219:
URL: https://github.com/apache/arrow-datafusion/issues/9219


   ### Is your feature request related to a problem or challenge?
   
   Right now DataFusions planning performance is a bottleneck for our 
application.  We noticed that there is a non-negligible amount of work being 
done in `get_statistics_with_limit` even when collecting statistics is 
disabled.  In particular the work done within `multiunzip` which constructs the 
`ColumnStatistics` is what we view as being unnecessary when collecting 
statistics is disabled.  We would like to add some logic to improve performance 
when collect statistics is disabled.
   
   ### Describe the solution you'd like
   
   On the `list_files_for_scan` method of `ListingTable` we would like to 
update the logic for getting the list of `PartitionedFiles` based on the value 
of `self.options.collect_stat`.
   
   So going from
   ```rust
   let (files, statistics) = get_statistics_with_limit(files, self.schema(), 
limit).await?;
   ```
   
   to something like
   ```rust
   let (files, statistics) = match self.options.collect_stat {
       true => get_statistics_with_limit(files, self.schema(), limit).await?,
       false => get_files_with_unknown_stats(files, self.schema(), limit).await?
   }
   ```
   
   Where `get_files_with_unknown_stats` avoids the call to `multiunzup`.
   
   ### Describe alternatives you've considered
   
   An alternative approach could be adding a parameter to 
`get_statistics_with_limit` for `collect_stats` and calling `multiunzip` based 
on that.
   
   ### Additional context
   
   Our application has low latency requirement and in our current setup 
DataFusion's planning performance is our bottleneck.  We will eventually be 
turning on statistics collection but right now we cant and so we are looking to 
improve planning performance where we can.
   
   Based on our internal benchmarks we saw physical planning performance 
improve by 16-43% after making the above mentioned change (we have a lot of 
files so impact can be large, for smaller number of files the impact will 
probably not be as big or could be neglible).


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

[I] Improve performance of `list_files_for_scan` when not collecting statistics [arrow-datafusion]

Reply via email to