leoyvens commented on PR #15473:
URL: https://github.com/apache/datafusion/pull/15473#issuecomment-2766568954

   I took some time to play with this, so I can provide an anecdotal report.
   
   **Conclusion**
   In my setup, this PR is a clear win to execution times.
   
   **Configurations**
   I compared three datafusion-cli configurations:
   - Main, default: with `split_file_groups_by_statistics = false`
   - Main, split by stat: with `split_file_groups_by_statistics = true`
   - Branch, split by stat: This branch with `split_file_groups_by_statistics = 
true`
   
   With `collect_statistics` always set to `true`.
   
   **Data and Query**
   Dataset contains ~90 files, `n_cpus=60`. The table is created with:
   ```sql
   create external table ordered_table stored as parquet location '<my_data>' 
with order (block_num);
   ```
   There is zero overlap across files in the order column.
   
   The query then is:
   ```sql
   select * from ordered_table order by block_num limit 5000000;
   ```
   
   **Plan analysis**
   - Main, default: `DataSourceExec` with `60` groups, then a 
`SortPreservingMergeExec`.
     - Groups: `[['000000000.parquet', '000439510.parquet'], 
['000879020.parquet', '001318530.parquet'],  ['001758040.parquet', 
'002197550.parquet'], ...]`
   - Main, split by stat: `DataSourceExec` with single sorted group, no 
`SortPreservingMergeExec`.
   - Branch, split by stat: `DataSourceExec` with `60` groups, then a 
`SortPreservingMergeExec`.
      - Groups: `[['000000000.parquet', '021996315.parquet'],  
['000439510.parquet', '021998105.parquet'], ['000879020.parquet', 
'021999879.parquet'] ...]`
   
   What is very interesting to observe is that my filenames are 
lexicographically sorted, so the default file grouping is pessimal, forcing 
`SortPreservingMergeExec` to read one group after the other. Meanwhile, the 
grouping in this branch is optimal and allows the merge exec to reads groups in 
parallel.
   
   **Timings**
   Not rigorously taken, object store noise is expected:
   - Main, default: 6 to 10 seconds.
   - Main, split by stat: 6 to 10 seconds.
   - Branch: Consistent ~4 seconds.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org
For additional commands, e-mail: github-h...@datafusion.apache.org

Reply via email to