zhuqi-lucas commented on code in PR #21956:
URL: https://github.com/apache/datafusion/pull/21956#discussion_r3246109227


##########
datafusion/datasource-parquet/src/source.rs:
##########
@@ -482,6 +485,107 @@ impl ParquetSource {
     pub(crate) fn reverse_row_groups(&self) -> bool {
         self.reverse_row_groups
     }
+
+    /// Extract the (column name, descending) tuple used by file-level
+    /// reordering. Driven entirely from the sort-pushdown channel
+    /// (`sort_order_for_reorder` + `reverse_row_groups`) — set by
+    /// `try_pushdown_sort`. We do not consult any dynamic-filter
+    /// metadata here: `DynamicFilterPhysicalExpr` is for runtime
+    /// threshold pruning, not for telling the source how to schedule
+    /// reads.
+    fn extract_topk_sort_info(&self) -> Option<(String, bool)> {
+        let sort_order = self.sort_order_for_reorder.as_ref()?;
+        let first = sort_order.first();
+        let col = first
+            .expr
+            .downcast_ref::<datafusion_physical_expr::expressions::Column>()?;
+        Some((col.name().to_string(), self.reverse_row_groups))
+    }
+
+    /// Extract the sort key from a file's statistics for reordering.
+    fn sort_key_for_file(
+        file: &datafusion_datasource::PartitionedFile,
+        col_idx: usize,
+        descending: bool,
+    ) -> Option<datafusion_common::ScalarValue> {
+        let stats = file.statistics.as_ref()?;
+        let col_stats = stats.column_statistics.get(col_idx)?;
+        if descending {
+            col_stats.min_value.get_value().cloned()
+        } else {
+            col_stats.max_value.get_value().cloned()
+        }
+    }
+}
+
+/// Threshold (fraction in `[0, 1]`) for the overlap guard in
+/// [`ParquetSource::reorder_files`]. When at least this fraction of
+/// adjacent file pairs (in sorted-by-min order) have overlapping
+/// `[min, max]` ranges, file reorder is skipped — file-level pruning
+/// cannot help and the reorder cost would dominate.

Review Comment:
   I think it was coming from the benchmark data from previous PR trigger, i 
can remove it and trigger again to see it.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to