alamb commented on code in PR #22300:
URL: https://github.com/apache/datafusion/pull/22300#discussion_r3305847354


##########
datafusion/catalog/src/table.rs:
##########
@@ -467,6 +469,27 @@ impl<'a> ScanArgs<'a> {
     pub fn limit(&self) -> Option<usize> {
         self.limit
     }
+
+    /// Set the statistics the caller would like the provider to answer for
+    /// this scan, if it can do so cheaply.

Review Comment:
   ```suggestion
       /// Specifies the statistics the caller may use when optimizing the 
query.
       ///
       /// This is intended for to allow the provider to cheaply provide 
statistics that may help
       /// such as those it has in an in memory catalog or from some other 
metadata source. 
   ```



##########
datafusion/catalog/src/table.rs:
##########
@@ -467,6 +469,27 @@ impl<'a> ScanArgs<'a> {
     pub fn limit(&self) -> Option<usize> {
         self.limit
     }
+
+    /// Set the statistics the caller would like the provider to answer for
+    /// this scan, if it can do so cheaply.
+    ///
+    /// Providers read these via [`Self::statistics_requests()`]; anything a

Review Comment:
   What i the Provider here? Do you mean "Table Provider"?



##########
datafusion/expr/src/logical_plan/plan.rs:
##########
@@ -2793,6 +2794,19 @@ pub struct TableScan {
     pub filters: Vec<Expr>,
     /// Optional number of rows to read
     pub fetch: Option<usize>,
+    /// Statistics the planner would like the provider to answer for this
+    /// scan, typically attached by a custom optimizer rule from the
+    /// surrounding plan shape (e.g. Min/Max for sort keys). Threaded into
+    /// the table provider via `ScanArgs::statistics_requests` at
+    /// physical-planning time. Advisory and empty by default; DataFusion's
+    /// own rules never populate it.
+    ///
+    /// A [`BTreeSet`], not a `Vec`: optimizer rules run to a fixpoint, so the
+    /// collection must be idempotent under re-derivation — a rule `insert`s
+    /// the requests it wants and re-running is a no-op (and composes with
+    /// other rules) without each rule having to dedupe. The ordering also
+    /// keeps the resulting plan deterministic.

Review Comment:
   I think this could be made significantly more concise and not lose any 
meaning
   ```suggestion
       /// Statistics the planner would like the provider to answer for this
       /// scan, typically attached by a custom optimizer rule from the
       /// surrounding plan (e.g. Min/Max for sort keys). 
       ///
       /// A [`BTreeSet`], not a `Vec` to keep the resulting plan deterministic.
   ```



##########
datafusion/expr/src/logical_plan/plan.rs:
##########
@@ -2867,21 +2881,100 @@ impl Hash for TableScan {
 impl TableScan {
     /// Initialize TableScan with appropriate schema from the given
     /// arguments.
+    #[deprecated(since = "54.0.0", note = "use `TableScanBuilder` instead")]
     pub fn try_new(
         table_name: impl Into<TableReference>,
         table_source: Arc<dyn TableSource>,
         projection: Option<Vec<usize>>,
         filters: Vec<Expr>,
         fetch: Option<usize>,
     ) -> Result<Self> {
-        let table_name = table_name.into();
+        TableScanBuilder::new(table_name, table_source)
+            .with_projection(projection)
+            .with_filters(filters)
+            .with_fetch(fetch)
+            .build()
+    }
+}
+
+/// Builder for [`TableScan`].

Review Comment:
   this is very nice



##########
datafusion/catalog/src/table.rs:
##########
@@ -467,6 +469,27 @@ impl<'a> ScanArgs<'a> {
     pub fn limit(&self) -> Option<usize> {
         self.limit
     }
+
+    /// Set the statistics the caller would like the provider to answer for
+    /// this scan, if it can do so cheaply.
+    ///
+    /// Providers read these via [`Self::statistics_requests()`]; anything a
+    /// provider cannot answer cheaply it simply ignores. DataFusion's own
+    /// `TableProvider`s ignore this field — it exists so a request can be
+    /// threaded from a custom optimizer rule (which annotates
+    /// `TableScan::statistics_requests`) through to a custom provider.
+    pub fn with_statistics_requests(
+        mut self,
+        statistics_requests: &'a [StatisticsRequest],
+    ) -> Self {
+        self.statistics_requests = statistics_requests;
+        self
+    }
+
+    /// Get the statistics requests for the scan. Empty if none were set.

Review Comment:
   ```suggestion
       /// Get the statistics requests for the scan. Empty if none were set.
       ///
       /// See [`Self::with_statistics_requests`] for more details
   ```



##########
datafusion/expr-common/src/statistics.rs:
##########
@@ -1694,3 +1694,44 @@ mod tests {
         all_ops.into_iter().collect()
     }
 }
+
+// ---------------------------------------------------------------------------

Review Comment:
   I think this comment should be made into a real `///` comment and folded 
into the doc strings of StatisticsRequests
   
   By being here it is likely invisible / harder to find I think



##########
datafusion/expr-common/src/statistics.rs:
##########
@@ -1694,3 +1694,44 @@ mod tests {
         all_ops.into_iter().collect()
     }
 }
+
+// ---------------------------------------------------------------------------
+// Query-aware statistics requests.
+//
+// A small extension to the existing `Statistics` model: instead of "give me
+// everything you have for every column", a caller can ask for a specific list
+// of stats by name. `StatisticsRequest` is just that vocabulary — DataFusion
+// itself does not populate or consume it. It exists so a request can be
+// threaded from a `TableScan` (see `TableScan::statistics_requests`) through
+// `ScanArgs::statistics_requests` to a `TableProvider`, which is enough for a
+// query-aware statistics feature to be implemented outside of DataFusion.
+// ---------------------------------------------------------------------------
+
+use datafusion_common::Column;
+
+/// A statistic a caller would like a provider to supply, if it can do so
+/// cheaply.
+///
+/// Each variant maps onto a field of [`datafusion_common::Statistics`] /
+/// [`datafusion_common::ColumnStatistics`], so a provider that already
+/// populates one can answer the request trivially.
+#[derive(Debug, Clone, PartialEq, Eq, PartialOrd, Ord)]
+pub enum StatisticsRequest {
+    /// Smallest non-null value of `column`.
+    Min(Column),

Review Comment:
   A column has a bunch of owned strings: 
https://github.com/alamb/datafusion/blob/0181c7906b47f9b17fbde92bf30b2d939a80230c/datafusion/common/src/column.rs#L30-L37
   
   What do you think about making this cheaper by making it take `&Column` or 
`Arc<Column>` rather than `Column`?



##########
datafusion/expr/src/logical_plan/builder.rs:
##########
@@ -515,8 +515,11 @@ impl LogicalPlanBuilder {
         filters: Vec<Expr>,
         fetch: Option<usize>,
     ) -> Result<Self> {
-        let table_scan =
-            TableScan::try_new(table_name, table_source, projection, filters, 
fetch)?;
+        let table_scan = TableScanBuilder::new(table_name, table_source)

Review Comment:
   this is a nice new API



##########
datafusion/expr/src/logical_plan/tree_node.rs:
##########
@@ -608,23 +608,15 @@ impl LogicalPlan {
                     Transformed::new(plan, exprs.transformed, exprs.tnr)
                 }
             }
-            LogicalPlan::TableScan(TableScan {
-                table_name,
-                source,
-                projection,
-                projected_schema,
-                filters,
-                fetch,
-            }) => filters.map_elements(f)?.update_data(|filters| {
-                LogicalPlan::TableScan(TableScan {
-                    table_name,
-                    source,
-                    projection,
-                    projected_schema,
-                    filters,
-                    fetch,
+            LogicalPlan::TableScan(mut scan) => {

Review Comment:
   I feel like the older code was easier to verify that it didn't carry along 
any other fields with expressions -- why change it here?
   
   Specificially, I worry if we ever add another field to `Filter` then we 
would be more likely to miss this callsite



##########
datafusion/core/tests/user_defined/statistics_requests.rs:
##########
@@ -0,0 +1,216 @@
+// Licensed to the Apache Software Foundation (ASF) under one
+// or more contributor license agreements.  See the NOTICE file
+// distributed with this work for additional information
+// regarding copyright ownership.  The ASF licenses this file
+// to you under the Apache License, Version 2.0 (the
+// "License"); you may not use this file except in compliance
+// with the License.  You may obtain a copy of the License at
+//
+//   http://www.apache.org/licenses/LICENSE-2.0
+//
+// Unless required by applicable law or agreed to in writing,
+// software distributed under the License is distributed on an
+// "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+// KIND, either express or implied.  See the License for the
+// specific language governing permissions and limitations
+// under the License.
+
+//! End-to-end test that a *custom* optimizer rule can annotate a
+//! `TableScan` with `StatisticsRequest`s and have them reach a *custom*
+//! `TableProvider`'s `scan_with_args`.
+//!
+//! DataFusion ships no rule that populates `TableScan::statistics_requests`

Review Comment:
   👍 



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to