berkaysynnada commented on code in PR #15539: URL: https://github.com/apache/datafusion/pull/15539#discussion_r2039125932
########## datafusion/datasource/src/statistics.rs: ########## @@ -519,29 +526,30 @@ pub fn compute_all_files_statistics( file_schema: SchemaRef, collect_stats: bool, inexact_stats: bool, -) -> Result<(Vec<FileGroup>, Statistics)> { +) -> (Vec<FileGroup>, Statistics) { + if !collect_stats { Review Comment: This is the other strange part. compute_summary_statistics() is called from 2 places: one of them calls it as true, and the other place is here. But here we also know that it cannot be false ########## datafusion/datasource/src/statistics.rs: ########## @@ -410,23 +410,24 @@ pub async fn get_statistics_with_limit( } /// Generic function to compute statistics across multiple items that have statistics -fn compute_summary_statistics<T, I>( +/// If `items` is empty or all items don't have statistics, it returns `None`. +pub fn compute_summary_statistics<T, I>( items: I, - file_schema: &SchemaRef, Review Comment: The files in compute_summary_statistics() do all have the same schema? Why do you say that sometimes we don't have the schema? ########## datafusion/datasource/src/statistics.rs: ########## @@ -479,22 +484,24 @@ where /// * `collect_stats` - Whether to collect statistics (if false, returns original file group) /// /// # Returns -/// A new file group with summary statistics attached +/// A new file group with summary statistics attached if there is statistics pub fn compute_file_group_statistics( - file_group: FileGroup, - file_schema: SchemaRef, + mut file_group: FileGroup, collect_stats: bool, Review Comment: Having this collect_stats argument is strange, isn't it? We can check it from the caller side and not call the function if it is not intended at all? ########## datafusion/datasource/src/statistics.rs: ########## @@ -410,23 +410,24 @@ pub async fn get_statistics_with_limit( } /// Generic function to compute statistics across multiple items that have statistics -fn compute_summary_statistics<T, I>( +/// If `items` is empty or all items don't have statistics, it returns `None`. Review Comment: My intention was this: when we call repartition_file_groups(), the file groups have the exact statistics if the sources provide so. While repartitioning them by preserving order or according to the size, we can know which PartitionedFile goes which group at the end. For example, lets say FileA has 2 chunks(min1: 0 max1: 10, min2: 15 max2: 20)and FileB has one(min1: 7 max1: 17). During the repartitioning, we can observe how these files are repartitioned, and let's say the final state is Partition1: [FileA/chunk1 + FileB] Partition2: [FileA/chunk2]. Then, we can know that these partitions have these exact statistics: Partition1: [min:0 max:17] and Partition2: [min: 15 max:20]. We don't need to recompute these. Am I missing some points? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org --------------------------------------------------------------------- To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For additional commands, e-mail: github-h...@datafusion.apache.org