berkaysynnada commented on code in PR #15539:
URL: https://github.com/apache/datafusion/pull/15539#discussion_r2039125932
##########
datafusion/datasource/src/statistics.rs:
##########
@@ -519,29 +526,30 @@ pub fn compute_all_files_statistics(
file_schema: SchemaRef,
collect_stats: bool,
inexact_stats: bool,
-) -> Result<(Vec<FileGroup>, Statistics)> {
+) -> (Vec<FileGroup>, Statistics) {
+ if !collect_stats {
Review Comment:
This is the other strange part. compute_summary_statistics() is called from
2 places: one of them calls it as true, and the other place is here. But here
we also know that it cannot be false
##########
datafusion/datasource/src/statistics.rs:
##########
@@ -410,23 +410,24 @@ pub async fn get_statistics_with_limit(
}
/// Generic function to compute statistics across multiple items that have
statistics
-fn compute_summary_statistics<T, I>(
+/// If `items` is empty or all items don't have statistics, it returns `None`.
+pub fn compute_summary_statistics<T, I>(
items: I,
- file_schema: &SchemaRef,
Review Comment:
The files in compute_summary_statistics() do all have the same schema? Why
do you say that sometimes we don't have the schema?
##########
datafusion/datasource/src/statistics.rs:
##########
@@ -479,22 +484,24 @@ where
/// * `collect_stats` - Whether to collect statistics (if false, returns
original file group)
///
/// # Returns
-/// A new file group with summary statistics attached
+/// A new file group with summary statistics attached if there is statistics
pub fn compute_file_group_statistics(
- file_group: FileGroup,
- file_schema: SchemaRef,
+ mut file_group: FileGroup,
collect_stats: bool,
Review Comment:
Having this collect_stats argument is strange, isn't it? We can check it
from the caller side and not call the function if it is not intended at all?
##########
datafusion/datasource/src/statistics.rs:
##########
@@ -410,23 +410,24 @@ pub async fn get_statistics_with_limit(
}
/// Generic function to compute statistics across multiple items that have
statistics
-fn compute_summary_statistics<T, I>(
+/// If `items` is empty or all items don't have statistics, it returns `None`.
Review Comment:
My intention was this: when we call repartition_file_groups(), the file
groups have the exact statistics if the sources provide so. While
repartitioning them by preserving order or according to the size, we can know
which PartitionedFile goes which group at the end. For example, lets say FileA
has 2 chunks(min1: 0 max1: 10, min2: 15 max2: 20)and FileB has one(min1: 7
max1: 17). During the repartitioning, we can observe how these files are
repartitioned, and let's say the final state is Partition1: [FileA/chunk1 +
FileB] Partition2: [FileA/chunk2]. Then, we can know that these partitions have
these exact statistics:
Partition1: [min:0 max:17] and Partition2: [min: 15 max:20]. We don't need
to recompute these.
Am I missing some points?
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]