suremarc commented on code in PR #13296: URL: https://github.com/apache/datafusion/pull/13296#discussion_r1833111049
########## datafusion/core/src/datasource/physical_plan/file_scan_config.rs: ########## @@ -310,22 +310,12 @@ impl FileScanConfig { sort_order: &LexOrdering, ) -> Result<Vec<Vec<PartitionedFile>>> { let flattened_files = file_groups.iter().flatten().collect::<Vec<_>>(); - // First Fit: - // * Choose the first file group that a file can be placed into. - // * If it fits into no existing file groups, create a new one. - // - // By sorting files by min values and then applying first-fit bin packing, - // we can produce the smallest number of file groups such that - // files within a group are in order and non-overlapping. - // - // Source: Applied Combinatorics (Keller and Trotter), Chapter 6.8 - // https://www.appliedcombinatorics.org/book/s_posets_dilworth-intord.html Review Comment: I moved this and the relevant code into a new method, `MinMaxStatistics::first_fit` ########## datafusion/physical-plan/src/statistics.rs: ########## @@ -131,14 +95,35 @@ impl MinMaxStatistics { .collect::<Vec<_>>(), }; + let projected_statistics = statistics + .into_iter() + .cloned() + .map(|s| s.project(Some(&projection))) + .collect::<Vec<_>>(); + + // Helper function to get min/max statistics + let get_min_max = |i: usize| -> Result<(Vec<ScalarValue>, Vec<ScalarValue>)> { + Ok(projected_statistics + .iter() Review Comment: I moved this code later so it uses the _projected_ statistics, i.e. it only relies on stats for sorting columns. I was seeing the code error because some columns had unknown statistics. ########## datafusion/physical-plan/src/execution_plan.rs: ########## @@ -397,6 +397,13 @@ pub trait ExecutionPlan: Debug + DisplayAs + Send + Sync { Ok(Statistics::new_unknown(&self.schema())) } + fn statistics_by_partition(&self) -> Result<Vec<Statistics>> { + Ok(vec![ + self.statistics()?; + self.properties().partitioning.partition_count() + ]) + } + Review Comment: As stated in the PR description, this is what a proposed API would look like for statistics by partition, though it is certainly not final. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org --------------------------------------------------------------------- To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For additional commands, e-mail: github-h...@datafusion.apache.org