This is an automated email from the ASF dual-hosted git repository.

github-bot pushed a commit to branch main
in repository https://gitbox.apache.org/repos/asf/datafusion.git


The following commit(s) were added to refs/heads/main by this push:
     new bec6714d8a docs: Document the TableProvider evaluation order for 
filter, limit and projection (#21091)
bec6714d8a is described below

commit bec6714d8a0251af3beb9967a21174290da0a318
Author: Andrew Lamb <[email protected]>
AuthorDate: Mon Mar 23 14:02:58 2026 -0400

    docs: Document the TableProvider evaluation order for filter, limit and 
projection (#21091)
    
    ## Which issue does this PR close?
    
    - Follow on to https://github.com/apache/datafusion/pull/21057
    
    ## Rationale for this change
    
    As mentioned by @hareshkh on
    https://github.com/apache/datafusion/pull/21057#discussion_r2966432536:
    
    It is not clear from the existing documentation that the (logical)
    evaluation order for push down operations is 'filter -> limit ->
    projection'
    
    this is the actual order implemented by the built in providers, but it
    wasn't documented anywhere explicitly
    
    ## What changes are included in this PR?
    
    1. Explicitly document the evaluation order on TableProvider
    2. Some drive by cleanups of the documentation
    
    ## Are these changes tested?
    
    By CI
    
    ## Are there any user-facing changes?
    
    <!--
    If there are user-facing changes then we may require documentation to be
    updated before approving the PR.
    -->
    
    <!--
    If there are any breaking changes to public APIs, please add the `api
    change` label.
    -->
---
 datafusion/catalog/src/table.rs | 80 ++++++++++++++++++++++++++---------------
 1 file changed, 51 insertions(+), 29 deletions(-)

diff --git a/datafusion/catalog/src/table.rs b/datafusion/catalog/src/table.rs
index c9b4e974c8..361589c5b6 100644
--- a/datafusion/catalog/src/table.rs
+++ b/datafusion/catalog/src/table.rs
@@ -84,10 +84,10 @@ pub trait TableProvider: Debug + Sync + Send {
         None
     }
 
-    /// Create an [`ExecutionPlan`] for scanning the table with optionally
-    /// specified `projection`, `filter` and `limit`, described below.
+    /// Create an [`ExecutionPlan`] for scanning the table with optional
+    /// `projection`, `filter`, and `limit`, described below.
     ///
-    /// The `ExecutionPlan` is responsible scanning the datasource's
+    /// The returned `ExecutionPlan` is responsible for scanning the 
datasource's
     /// partitions in a streaming, parallelized fashion.
     ///
     /// # Projection
@@ -96,33 +96,30 @@ pub trait TableProvider: Debug + Sync + Send {
     /// specified. The projection is a set of indexes of the fields in
     /// [`Self::schema`].
     ///
-    /// DataFusion provides the projection to scan only the columns actually
-    /// used in the query to improve performance, an optimization  called
-    /// "Projection Pushdown". Some datasources, such as Parquet, can use this
-    /// information to go significantly faster when only a subset of columns is
-    /// required.
+    /// DataFusion provides the projection so the scan reads only the columns
+    /// actually used in the query, an optimization called "Projection
+    /// Pushdown". Some datasources, such as Parquet, can use this information
+    /// to go significantly faster when only a subset of columns is required.
     ///
     /// # Filters
     ///
     /// A list of boolean filter [`Expr`]s to evaluate *during* the scan, in 
the
     /// manner specified by [`Self::supports_filters_pushdown`]. Only rows for
-    /// which *all* of the `Expr`s evaluate to `true` must be returned (aka the
-    /// expressions are `AND`ed together).
+    /// which *all* of the `Expr`s evaluate to `true` must be returned (that 
is,
+    /// the expressions are `AND`ed together).
     ///
-    /// To enable filter pushdown you must override
-    /// [`Self::supports_filters_pushdown`] as the default implementation does
-    /// not and `filters` will be empty.
+    /// To enable filter pushdown, override
+    /// [`Self::supports_filters_pushdown`]. The default implementation does 
not
+    /// push down filters, and `filters` will be empty.
     ///
-    /// DataFusion pushes filtering into the scans whenever possible
-    /// ("Filter Pushdown"), and depending on the format and the
-    /// implementation of the format, evaluating the predicate during the scan
-    /// can increase performance significantly.
+    /// DataFusion pushes filters into scans whenever possible ("Filter
+    /// Pushdown"). Depending on the data format and implementation, evaluating
+    /// predicates during the scan can significantly improve performance.
     ///
     /// ## Note: Some columns may appear *only* in Filters
     ///
-    /// In certain cases, a query may only use a certain column in a Filter 
that
-    /// has been completely pushed down to the scan. In this case, the
-    /// projection will not contain all the columns found in the filter
+    /// In some cases, a query may use a column only in a filter and the
+    /// projection will not contain all columns referenced by the filter
     /// expressions.
     ///
     /// For example, given the query `SELECT t.a FROM t WHERE t.b > 5`,
@@ -154,15 +151,40 @@ pub trait TableProvider: Debug + Sync + Send {
     ///
     /// # Limit
     ///
-    /// If `limit` is specified,  must only produce *at least* this many rows,
-    /// (though it may return more).  Like Projection Pushdown and Filter
-    /// Pushdown, DataFusion pushes `LIMIT`s  as far down in the plan as
-    /// possible, called "Limit Pushdown" as some sources can use this
-    /// information to improve their performance. Note that if there are any
-    /// Inexact filters pushed down, the LIMIT cannot be pushed down. This is
-    /// because inexact filters do not guarantee that every filtered row is
-    /// removed, so applying the limit could lead to too few rows being 
available
-    /// to return as a final result.
+    /// If `limit` is specified, the scan must produce *at least* this many
+    /// rows, though it may return more. Like Projection Pushdown and Filter
+    /// Pushdown, DataFusion pushes `LIMIT`s as far down in the plan as
+    /// possible. This is called "Limit Pushdown", and some sources can use the
+    /// information to improve performance.
+    ///
+    /// Note: If any pushed-down filters are `Inexact`, the `LIMIT` cannot be
+    /// pushed down. Inexact filters do not guarantee that every filtered row 
is
+    /// removed, so applying the limit could leave too few rows to return in 
the
+    /// final result.
+    ///
+    /// # Evaluation Order
+    ///
+    /// The logical evaluation order is `filters`, then `limit`, then
+    /// `projection`.
+    ///
+    /// Note that `limit` applies to the filtered result, not to the unfiltered
+    /// input, and `projection` affects only which columns are returned, not
+    /// which rows qualify.
+    ///
+    /// For example, if a scan receives:
+    ///
+    /// - `projection = [a]`
+    /// - `filters = [b > 5]`
+    /// - `limit = Some(3)`
+    ///
+    /// It must logically produce results equivalent to:
+    ///
+    /// ```text
+    /// PROJECTION a (LIMIT 3 (SCAN WHERE b > 5))
+    /// ```
+    ///
+    /// As noted above, columns referenced only by pushed-down filters may be
+    /// absent from `projection`.
     async fn scan(
         &self,
         state: &dyn Session,


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to