zhuqi-lucas commented on PR #21182: URL: https://github.com/apache/datafusion/pull/21182#issuecomment-4182101292
Update: found the root cause of Q1/Q3 regression and a fix. **Root cause**: `SortPreservingMergeExec` uses `spawn_buffered(stream, 1)` — only 1 batch prefetched per partition. With SortExec, all data is pre-buffered in memory so SPM reads are I/O-free. Without SortExec (our optimization), SPM pulls directly from DataSourceExec, hitting Parquet I/O on each poll. The merge loop stalls waiting for I/O. **Fix**: increase SPM buffer from 1 to 32. This lets background tasks prefetch more batches, decoupling I/O from the merge computation. Local results (release, 16 partitions): | Query | Main (buf=1) | PR (buf=1) | PR (buf=32) | |-------|-------------|------------|-------------| | Q1 full scan | 110ms | 180ms | **80ms** | | Q2 LIMIT 100 | 9ms | 3ms | **3ms** | | Q3 SELECT * | 239ms | 305ms | **197ms** | | Q4 LIMIT 100 | 35ms | 7ms | **7ms** | All queries faster than main, zero regression. All tests pass. Pushing now. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected] --------------------------------------------------------------------- To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
