Re: [PR] feat: support file-level parquet row selections [datafusion]

via GitHub Wed, 17 Jun 2026 23:33:41 -0700


kosiew commented on code in PR #22940:
URL: https://github.com/apache/datafusion/pull/22940#discussion_r3433606677



##########
datafusion/datasource-parquet/src/access_plan.rs:
##########
@@ -169,6 +204,110 @@ impl ParquetAccessPlan {
         }
     }
 
+    /// Create a new `ParquetAccessPlan` from a file-level [`RowSelection`].
+    ///
+    /// The selection is interpreted across all rows in the file, in row group
+    /// order, and is split into row-group level access using 
`row_group_meta_data`.
+    /// Fully skipped row groups become [`RowGroupAccess::Skip`], fully 
selected
+    /// row groups become [`RowGroupAccess::Scan`], and partially selected row
+    /// groups become [`RowGroupAccess::Selection`].
+    ///
+    /// # Errors
+    ///
+    /// Returns an error if the selection does not specify exactly the same
+    /// number of rows as the file metadata.
+    pub fn try_new_from_overall_row_selection(
+        selection: RowSelection,
+        row_group_meta_data: &[RowGroupMetaData],
+    ) -> Result<Self> {
+        let selectors: Vec<RowSelector> = selection.into();

Review Comment:
   The performance concern around `RowSelection::split_off` makes sense. That 
said, this manual `current` / `leading` / `mixed` cursor logic is a bit harder 
to audit, especially since the full selector vector is still materialized up 
front.
   
   If we keep this version, it would be helpful to add a short comment or 
benchmark note explaining the measured reason for the manual approach. 
Otherwise, the earlier `split_off` shape feels easier to reason about because 
it directly captures the boundary-splitting invariant.



##########
datafusion/core/tests/parquet/external_access_plan.rs:
##########


Review Comment:
   Nice improvement wiring the `ParquetRowSelection` through `PartitionedFile`. 
One thing that would make this test stronger is checking the actual rows 
returned, not only the row count.
   
   Right now, a bug that selected the wrong row or rows with the same 
cardinality could still pass. Could we add expected pretty output, expected 
column values, or extend `TestFull` so these row-selection cases can assert the 
returned batches?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Re: [PR] feat: support file-level parquet row selections [datafusion]

Reply via email to