Dandandan opened a new pull request, #22924: URL: https://github.com/apache/datafusion/pull/22924
## Which issue does this PR close? N/A. ## Rationale for this change Shared file scans can lose parallelism at the tail of a Parquet scan when fewer unopened files remain than active output streams. This change lets Parquet split remaining large row groups into smaller, page-aligned morsels when sibling streams would otherwise go idle. ## What changes are included in this PR? - Adds a `SplitHint` mechanism to shared file work sources and lets morselizers donate surplus ready morsels to sibling streams. - Implements Parquet access-plan splitting by compressed-size target, including sub-row-group page-aligned row ranges when offset indexes are available. - Adds the `datafusion.execution.parquet.morsel_split_size` read option and propagates it through docs, information schema output, and proto serialization. - Adds unit coverage for shared work-source donation, access-plan splitting, and Parquet split-hint stream planning. ## Are these changes tested? - `cargo fmt --all` - `cargo clippy --all-targets --all-features -- -D warnings` - `cargo test -p datafusion-datasource work_source --lib` - `cargo test -p datafusion-datasource-parquet split --lib` - `cargo test -p datafusion-proto-common parquet --lib` ## Are there any user-facing changes? Yes. A new Parquet read configuration option, `datafusion.execution.parquet.morsel_split_size`, controls the target compressed byte size for tail-work morsel splitting. Setting it to `NULL` disables splitting. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected] --------------------------------------------------------------------- To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
