Dandandan opened a new pull request, #22924:
URL: https://github.com/apache/datafusion/pull/22924

   ## Which issue does this PR close?
   
   N/A.
   
   ## Rationale for this change
   
   Shared file scans can lose parallelism at the tail of a Parquet scan when 
fewer unopened files remain than active output streams. This change lets 
Parquet split remaining large row groups into smaller, page-aligned morsels 
when sibling streams would otherwise go idle.
   
   ## What changes are included in this PR?
   
   - Adds a `SplitHint` mechanism to shared file work sources and lets 
morselizers donate surplus ready morsels to sibling streams.
   - Implements Parquet access-plan splitting by compressed-size target, 
including sub-row-group page-aligned row ranges when offset indexes are 
available.
   - Adds the `datafusion.execution.parquet.morsel_split_size` read option and 
propagates it through docs, information schema output, and proto serialization.
   - Adds unit coverage for shared work-source donation, access-plan splitting, 
and Parquet split-hint stream planning.
   
   ## Are these changes tested?
   
   - `cargo fmt --all`
   - `cargo clippy --all-targets --all-features -- -D warnings`
   - `cargo test -p datafusion-datasource work_source --lib`
   - `cargo test -p datafusion-datasource-parquet split --lib`
   - `cargo test -p datafusion-proto-common parquet --lib`
   
   ## Are there any user-facing changes?
   
   Yes. A new Parquet read configuration option, 
`datafusion.execution.parquet.morsel_split_size`, controls the target 
compressed byte size for tail-work morsel splitting. Setting it to `NULL` 
disables splitting.
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to