haohuaijin commented on code in PR #22940:
URL: https://github.com/apache/datafusion/pull/22940#discussion_r3432780300
##########
datafusion/datasource-parquet/src/access_plan.rs:
##########
@@ -169,6 +204,110 @@ impl ParquetAccessPlan {
}
}
+ /// Create a new `ParquetAccessPlan` from a file-level [`RowSelection`].
+ ///
+ /// The selection is interpreted across all rows in the file, in row group
+ /// order, and is split into row-group level access using
`row_group_meta_data`.
+ /// Fully skipped row groups become [`RowGroupAccess::Skip`], fully
selected
+ /// row groups become [`RowGroupAccess::Scan`], and partially selected row
+ /// groups become [`RowGroupAccess::Selection`].
+ ///
+ /// # Errors
+ ///
+ /// Returns an error if the selection does not specify exactly the same
+ /// number of rows as the file metadata.
+ pub fn try_new_from_overall_row_selection(
+ selection: RowSelection,
+ row_group_meta_data: &[RowGroupMetaData],
+ ) -> Result<Self> {
+ let selectors: Vec<RowSelector> = selection.into();
Review Comment:
Thanks for you detail suggestion. The main reason I wrote it this way is
that using `split_off` will allocate new memory and also traverse the row
groups two more time(`split_off` one time, `row_count` one time,
`skipped_row_count` one time). I can write a benchmark later to see how much
impact this has.
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]