Re: [PR] Fix: after repartitioning, the `PartitionedFile` and `FileGroup` statistics should be inexact/recomputed [datafusion]

via GitHub Fri, 11 Apr 2025 02:59:52 -0700


berkaysynnada commented on code in PR #15539:
URL: https://github.com/apache/datafusion/pull/15539#discussion_r2039125932



##########
datafusion/datasource/src/statistics.rs:
##########
@@ -519,29 +526,30 @@ pub fn compute_all_files_statistics(
     file_schema: SchemaRef,
     collect_stats: bool,
     inexact_stats: bool,
-) -> Result<(Vec<FileGroup>, Statistics)> {
+) -> (Vec<FileGroup>, Statistics) {
+    if !collect_stats {

Review Comment:
   This is the other strange part. compute_summary_statistics() is called from 
2 places: one of them calls it as true, and the other place is here. But here 
we also know that it cannot be false



##########
datafusion/datasource/src/statistics.rs:
##########
@@ -410,23 +410,24 @@ pub async fn get_statistics_with_limit(
 }
 
 /// Generic function to compute statistics across multiple items that have 
statistics
-fn compute_summary_statistics<T, I>(
+/// If `items` is empty or all items don't have statistics, it returns `None`.
+pub fn compute_summary_statistics<T, I>(
     items: I,
-    file_schema: &SchemaRef,

Review Comment:
   The files in compute_summary_statistics() do all have the same schema?  Why 
do you say that sometimes we don't have the schema?



##########
datafusion/datasource/src/statistics.rs:
##########
@@ -479,22 +484,24 @@ where
 /// * `collect_stats` - Whether to collect statistics (if false, returns 
original file group)
 ///
 /// # Returns
-/// A new file group with summary statistics attached
+/// A new file group with summary statistics attached if there is statistics
 pub fn compute_file_group_statistics(
-    file_group: FileGroup,
-    file_schema: SchemaRef,
+    mut file_group: FileGroup,
     collect_stats: bool,

Review Comment:
   Having this collect_stats argument is strange, isn't it? We can check it 
from the caller side and not call the function if it is not intended at all?



##########
datafusion/datasource/src/statistics.rs:
##########
@@ -410,23 +410,24 @@ pub async fn get_statistics_with_limit(
 }
 
 /// Generic function to compute statistics across multiple items that have 
statistics
-fn compute_summary_statistics<T, I>(
+/// If `items` is empty or all items don't have statistics, it returns `None`.

Review Comment:
   My intention was this: when we call repartition_file_groups(), the file 
groups have the exact statistics if the sources provide so. While 
repartitioning them by preserving order or according to the size, we can know 
which PartitionedFile goes which group at the end. For example, lets say FileA 
has 2 chunks(min1: 0 max1: 10, min2: 15 max2: 20)and FileB has one(min1: 7 
max1: 17). During the repartitioning, we can observe how these files are 
repartitioned, and let's say the final state is Partition1: [FileA/chunk1 + 
FileB] Partition2: [FileA/chunk2]. Then, we can know that these partitions have 
these exact statistics:
   Partition1: [min:0 max:17] and Partition2: [min: 15 max:20]. We don't need 
to recompute these. 
   
   Am I missing some points?
   



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org
For additional commands, e-mail: github-h...@datafusion.apache.org

Re: [PR] Fix: after repartitioning, the `PartitionedFile` and `FileGroup` statistics should be inexact/recomputed [datafusion]

Reply via email to