kosiew commented on code in PR #21149:
URL: https://github.com/apache/datafusion/pull/21149#discussion_r3025756006


##########
datafusion/datasource/src/statistics.rs:
##########
@@ -320,16 +320,19 @@ pub async fn get_statistics_with_limit(
         file.statistics = Some(Arc::clone(&file_stats));
         result_files.push(file);
 
-        // First file, we set them directly from the file statistics.
-        num_rows = file_stats.num_rows;
-        total_byte_size = file_stats.total_byte_size;
-        for (index, file_column) in
-            file_stats.column_statistics.clone().into_iter().enumerate()
-        {
-            col_stats_set[index].null_count = file_column.null_count;
-            col_stats_set[index].max_value = file_column.max_value;
-            col_stats_set[index].min_value = file_column.min_value;
-            col_stats_set[index].sum_value = 
file_column.sum_value.cast_to_sum_type();
+        if collect_stats {

Review Comment:
   Thanks for fixing the returned summary here. I think this introduces a 
separate issue in the `limit` path though.
   
   `get_statistics_with_limit` still relies on `num_rows` to decide whether it 
can stop early. With `collect_stats=false`, `num_rows` now stays `Absent`, so 
`conservative_num_rows` falls back to `usize::MIN` and we keep scanning the 
stream instead of stopping early.
   
   Before this change, a known first-file row count could still let us stop 
after the first file. With this patch, even something like `limit=0` can 
continue reading later files.
   
   Could we preserve a separate row-count accumulator, or at least keep the 
first-file row count around for the limit check, while still returning a bare 
`Statistics` value when `collect_stats` is false? I think we should also add a 
regression test for `collect_stats=false` with a limit so this behavior is 
locked in.



##########
datafusion/datasource/src/statistics.rs:
##########
@@ -320,16 +320,19 @@ pub async fn get_statistics_with_limit(
         file.statistics = Some(Arc::clone(&file_stats));
         result_files.push(file);
 
-        // First file, we set them directly from the file statistics.
-        num_rows = file_stats.num_rows;
-        total_byte_size = file_stats.total_byte_size;
-        for (index, file_column) in
-            file_stats.column_statistics.clone().into_iter().enumerate()
-        {
-            col_stats_set[index].null_count = file_column.null_count;
-            col_stats_set[index].max_value = file_column.max_value;
-            col_stats_set[index].min_value = file_column.min_value;
-            col_stats_set[index].sum_value = 
file_column.sum_value.cast_to_sum_type();
+        if collect_stats {

Review Comment:
   Small refactor suggestion. It may be worth extracting the "seed stats from 
first file" logic into a helper that can optionally populate the summary 
accumulator.
   
   This block and the later merge loop now both need to stay in sync on fields 
like `byte_size` and the `sum_value` casting. Pulling that into one place would 
make future stats-field additions a bit less likely to drift.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to