alamb commented on code in PR #8023:
URL: https://github.com/apache/arrow-datafusion/pull/8023#discussion_r1379877679
##########
datafusion/optimizer/src/push_down_filter.rs:
##########
@@ -33,31 +34,93 @@ use itertools::Itertools;
use std::collections::{HashMap, HashSet};
use std::sync::Arc;
-/// Push Down Filter optimizer rule pushes filter clauses down the plan
+/// Optimizer rule for pushing (moving) filter expressions down in a plan so
+/// they are applied as early as possible.
+///
/// # Introduction
-/// A filter-commutative operation is an operation whose result of
filter(op(data)) = op(filter(data)).
-/// An example of a filter-commutative operation is a projection; a
counter-example is `limit`.
///
-/// The filter-commutative property is column-specific. An aggregate grouped
by A on SUM(B)
-/// can commute with a filter that depends on A only, but does not commute
with a filter that depends
-/// on SUM(B).
+/// The goal of this rule is to improve query performance by eliminating
+/// redundant work.
+///
+/// For example, given a plan that sorts all values where `a > 10`:
+///
+/// ```text
+/// Filter (a > 10)
+/// Sort (a, b)
+/// ```
+///
+/// A better plan is to filter the data *before* the Sort, which sorts fewer
+/// rows and therefore does less work overall:
+///
+/// ```text
+/// Sort (a, b)
+/// Filter (a > 10) <-- filter is moved before the sort
Review Comment:
```suggestion
/// Filter (a > 10) <-- Filter is moved before the sort
```
##########
datafusion/optimizer/src/push_down_filter.rs:
##########
@@ -33,31 +34,93 @@ use itertools::Itertools;
use std::collections::{HashMap, HashSet};
use std::sync::Arc;
-/// Push Down Filter optimizer rule pushes filter clauses down the plan
+/// Optimizer rule for pushing (moving) filter expressions down in a plan so
+/// they are applied as early as possible.
+///
/// # Introduction
-/// A filter-commutative operation is an operation whose result of
filter(op(data)) = op(filter(data)).
-/// An example of a filter-commutative operation is a projection; a
counter-example is `limit`.
///
-/// The filter-commutative property is column-specific. An aggregate grouped
by A on SUM(B)
-/// can commute with a filter that depends on A only, but does not commute
with a filter that depends
-/// on SUM(B).
+/// The goal of this rule is to improve query performance by eliminating
+/// redundant work.
+///
+/// For example, given a plan that sorts all values where `a > 10`:
+///
+/// ```text
+/// Filter (a > 10)
+/// Sort (a, b)
+/// ```
+///
+/// A better plan is to filter the data *before* the Sort, which sorts fewer
+/// rows and therefore does less work overall:
+///
+/// ```text
+/// Sort (a, b)
+/// Filter (a > 10) <-- filter is moved before the sort
+/// ```
+///
+/// However it is not always possible to push filters down. For example, given
a
+/// plan that finds the top 3 values and then keeps only those that are greater
+/// than 10, if the filter is pushed below the limit it would produce a
+/// different result.
+///
+/// ```text
+/// Filter (a > 10) <-- can not move this Filter before the limit
+/// Limit (fetch=3)
+/// Sort (a, b)
+/// ```
+///
+///
+/// More formally, a filter-commutative operation is an operation `op` that
+/// satisfies `filter(op(data)) = op(filter(data))`.
+///
+/// The filter-commutative property is plan and column-specific. A filter on
`a`
+/// can be pushed through a `Aggregate(group_by = [a], agg=[SUM(b))`. However,
a
+/// a filter on `SUM(b)` can not be pushed through the same aggregate.
Review Comment:
```suggestion
/// filter on `SUM(b)` can not be pushed through the same aggregate.
```
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]