suremarc commented on issue #15191: URL: https://github.com/apache/datafusion/issues/15191#issuecomment-2758090094
> > The only reason it is not needed here is because there are fewer files than `target_partitions`, so this will not work if we increase the number of files or reduce `target_partitions`. If we set `target_partitions` to 1 then it requires a sort: > > I reread the codebase, and also think so. > > [FileScanConfig::split_groups_by_statistics](https://github.com/apache/datafusion/blob/main/datafusion/datasource/src/file_scan_config.rs#L569) definitely can solve the problem, then we can remove unnecessary `SortExec` which will be significant gains! > > One question: Is there something that makes it difficult to turn on `split_groups_by_statistics` by default? I left my [full answer](https://github.com/apache/datafusion/issues/10336#issuecomment-2758082825) on that issue as I don't want to take over this issue too much, but TL;DR we need benchmarks for tables with large numbers of files. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org --------------------------------------------------------------------- To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For additional commands, e-mail: github-h...@datafusion.apache.org