blaginin commented on code in PR #14685:
URL: https://github.com/apache/datafusion/pull/14685#discussion_r1974936267
##########
datafusion/core/src/datasource/physical_plan/file_scan_config.rs:
##########
@@ -345,6 +345,32 @@ impl FileScanConfig {
/// Set the projection of the files
pub fn with_projection(mut self, projection: Option<Vec<usize>>) -> Self {
self.projection = projection;
+ self.with_updated_statistics()
+ }
+
+ // Update source statistics with the current projection data
+ fn with_updated_statistics(mut self) -> Self {
+ let max_projection_column = *self
+ .projection
+ .as_ref()
+ .and_then(|proj| proj.iter().max())
+ .unwrap_or(&0);
+
+ if max_projection_column
+ >= self.file_schema.fields().len() +
self.table_partition_cols.len()
+ {
+ // we don't yet have enough information (file schema info or
partition column info) to perform projection
+ return self;
+ }
+
+ let (
+ _projected_schema,
+ _constraints,
+ projected_statistics,
+ _projected_output_ordering,
+ ) = self.project();
+
+ self.source = self.source.with_statistics(projected_statistics);
Review Comment:
That's a great idea! We can't use `self.source.statistics` directly, because
statistics match projection we're applying - so I had to apply projection
before
(https://github.com/apache/datafusion/pull/14685/commits/89ed225dcbe97ce9e9d1245d12e05637e3637f35)
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]