Dandandan opened a new pull request, #21640: URL: https://github.com/apache/datafusion/pull/21640
## Which issue does this PR close? Follow-up to #21351 (dynamic work scheduling in FileStream). ## Rationale for this change When a scan has few large files, idle sibling streams have nothing to steal even after #21351 enables work sharing. This PR splits large files by byte range so work can be distributed more evenly across partitions. ## What changes are included in this PR? Adds morsel splitting to `SharedWorkSource::pop_front()`: - **Splitting**: when queue depth < `2 * target_partitions` and a file's projected size >= 2 MiB, splits the file in half by byte range and pushes the second half back onto the shared queue - **Projected size estimation**: uses per-column `byte_size` from `PartitionedFile.statistics` when available (e.g. from Parquet column stats), otherwise falls back to `raw_file_size * (projected_cols / total_cols)` - Minimum morsel size of 1 MiB — files are never split below this threshold ## Are these changes tested? Existing tests pass. Additional tests for the splitting logic to be added. ## Are there any user-facing changes? Faster performance for queries scanning few large files with multiple partitions. 🤖 Generated with [Claude Code](https://claude.com/claude-code) -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected] --------------------------------------------------------------------- To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
