zhuqi-lucas commented on PR #21182:
URL: https://github.com/apache/datafusion/pull/21182#issuecomment-4146722599

   @adriangb I've removed the `redistribute_files_across_groups_by_statistics` 
logic after deeper analysis. Here's the reasoning:
   
   **The problem with redistribution:**
   
   The planning-phase bin-packing produces interleaved groups like:
   ```
   Group 0: [f1(1-10), f3(21-30)]
   Group 1: [f2(11-20), f4(31-40)]
   ```
   
   If we redistribute consecutively:
   ```
   Group 0: [f1(1-10), f2(11-20)]    ← all values < Group 1
   Group 1: [f3(21-30), f4(31-40)]
   ```
   
   This looks cleaner, but SPM would read **all** of Group 0 first (values 
always smaller), then Group 1. The other partition sits completely idle — 
**effectively single-threaded I/O**.
   
   **Why interleaved is better:**
   
   With the original interleaved groups, SPM alternates pulling from both 
partitions:
   ```
   SPM: pull P0 [1-10] → pull P1 [11-20] → pull P0 [21-30] → pull P1 [31-40]
   ```
   Both partitions are actively scanning files simultaneously — **true parallel 
I/O**.
   
   The core optimization (per-partition sort elimination via statistics-based 
non-overlapping detection) works the same either way. So I removed the 
redistribution to keep the code simpler and preserve parallel I/O.
   
   Latest commit: removed `redistribute_files_across_groups_by_statistics`, 
updated comments with the reasoning, and replaced the redistribution tests with 
a test that verifies groups are preserved as-is.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to