tustvold commented on a change in pull request #2048:
URL: https://github.com/apache/arrow-datafusion/pull/2048#discussion_r832334400
##########
File path: datafusion/src/physical_plan/file_format/parquet.rs
##########
@@ -418,31 +407,33 @@ impl<'a> PruningStatistics for
RowGroupPruningStatistics<'a> {
fn build_row_group_predicate(
pruning_predicate: &PruningPredicate,
metrics: ParquetFileMetrics,
- row_group_metadata: &[RowGroupMetaData],
-) -> Box<dyn Fn(&RowGroupMetaData, usize) -> bool> {
- let parquet_schema = pruning_predicate.schema().as_ref();
-
- let pruning_stats = RowGroupPruningStatistics {
- row_group_metadata,
- parquet_schema,
- };
- let predicate_values = pruning_predicate.prune(&pruning_stats);
-
- match predicate_values {
- Ok(values) => {
- // NB: false means don't scan row group
- let num_pruned = values.iter().filter(|&v| !*v).count();
- metrics.row_groups_pruned.add(num_pruned);
- Box::new(move |_, i| values[i])
- }
- // stats filter array could not be built
- // return a closure which will not filter out any row groups
- Err(e) => {
- debug!("Error evaluating row group predicate values {}", e);
- metrics.predicate_evaluation_errors.add(1);
- Box::new(|_r, _i| true)
- }
- }
+) -> Box<dyn FnMut(&RowGroupMetaData, usize) -> bool> {
+ let pruning_predicate = pruning_predicate.clone();
+ Box::new(
+ move |row_group_metadata: &RowGroupMetaData, _i: usize| -> bool {
+ let parquet_schema = pruning_predicate.schema().as_ref();
+ let pruning_stats = RowGroupPruningStatistics {
+ row_group_metadata,
+ parquet_schema,
+ };
+ let predicate_values = pruning_predicate.prune(&pruning_stats);
Review comment:
That would definitely be one option, but I'm not sure why it needs to be
lazy. `SerializedFileReader` already exposes the `ParquetMetadata` which in
turn exposes the `[RowGroupMetaData]`. Why wouldn't the caller just specify the
row groups to scan, much like it specifies the column indexes for a projection?
Would this not be both simpler and more flexible?
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]