matthewmturner opened a new issue, #9219:
URL: https://github.com/apache/arrow-datafusion/issues/9219
### Is your feature request related to a problem or challenge?
Right now DataFusions planning performance is a bottleneck for our
application. We noticed that there is a non-negligible amount of work being
done in `get_statistics_with_limit` even when collecting statistics is
disabled. In particular the work done within `multiunzip` which constructs the
`ColumnStatistics` is what we view as being unnecessary when collecting
statistics is disabled. We would like to add some logic to improve performance
when collect statistics is disabled.
### Describe the solution you'd like
On the `list_files_for_scan` method of `ListingTable` we would like to
update the logic for getting the list of `PartitionedFiles` based on the value
of `self.options.collect_stat`.
So going from
```rust
let (files, statistics) = get_statistics_with_limit(files, self.schema(),
limit).await?;
```
to something like
```rust
let (files, statistics) = match self.options.collect_stat {
true => get_statistics_with_limit(files, self.schema(), limit).await?,
false => get_files_with_unknown_stats(files, self.schema(), limit).await?
}
```
Where `get_files_with_unknown_stats` avoids the call to `multiunzup`.
### Describe alternatives you've considered
An alternative approach could be adding a parameter to
`get_statistics_with_limit` for `collect_stats` and calling `multiunzip` based
on that.
### Additional context
Our application has low latency requirement and in our current setup
DataFusion's planning performance is our bottleneck. We will eventually be
turning on statistics collection but right now we cant and so we are looking to
improve planning performance where we can.
Based on our internal benchmarks we saw physical planning performance
improve by 16-43% after making the above mentioned change (we have a lot of
files so impact can be large, for smaller number of files the impact will
probably not be as big or could be neglible).
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]