Re: [PR] Add declared file scan output partitioning [datafusion]

via GitHub Mon, 22 Jun 2026 07:59:05 -0700


gene-bordegaray commented on code in PR #22657:
URL: https://github.com/apache/datafusion/pull/22657#discussion_r3453233535



##########
datafusion/datasource/src/file_scan_config/mod.rs:
##########
@@ -198,14 +198,24 @@ pub struct FileScanConfig {
     /// would be incorrect if there are filters being applied, thus this 
should be accessed
     /// via [`FileScanConfig::statistics`].
     pub(crate) statistics: Statistics,
-    /// When true, file_groups are organized by partition column values
-    /// and output_partitioning will return Hash partitioning on partition 
columns.
-    /// This allows the optimizer to skip hash repartitioning for aggregates 
and joins
-    /// on partition columns.
+    /// When true, `file_groups` are organized by partition column values and
+    /// [`Self::output_partitioning`] derives hash partitioning on those 
columns.
+    /// This allows the optimizer to skip hash repartitioning for aggregates 
and
+    /// joins on partition columns.
+    ///
+    /// Because grouping is by whole file, this may reduce I/O parallelism when
+    /// partition sizes are uneven.

Review Comment:
   thought this covers the docs I cleaned here: 
https://github.com/apache/datafusion/pull/22657/changes#diff-a07222d670257887f5118197c485861c96635e2da6c2bf0007d2c21dda7df82aL709



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Re: [PR] Add declared file scan output partitioning [datafusion]

Reply via email to