alamb commented on code in PR #19722:
URL: https://github.com/apache/datafusion/pull/19722#discussion_r2710111871


##########
datafusion/optimizer/src/simplify_expressions/expr_simplifier.rs:
##########
@@ -1969,12 +1972,93 @@ impl TreeNodeRewriter for Simplifier<'_> {
                 }))
             }
 
+            // =======================================
+            // preimage_in_comparison
+            // =======================================
+            //
+            // For case:
+            // date_part('YEAR', expr) op literal
+            //
+            // Background:
+            // Datasources such as Parquet can prune partitions using simple 
predicates,
+            // but they cannot do so for complex expressions.
+            // For a complex predicate like `date_part('YEAR', c1) < 2000`, 
pruning is not possible.
+            // After rewriting it to `c1 < 2000-01-01`, pruning becomes 
feasible.
+            // Rewrites use inclusive lower and exclusive upper bounds when
+            // translating an equality into a range.
+            // NOTE: we only consider immutable UDFs with non Null literal RHS 
values and
+            // UDFs that provide both `preimage` and `column_expr`.

Review Comment:
   I think the reference to column_expr may be out of date



##########
datafusion/expr/src/preimage.rs:
##########
@@ -0,0 +1,28 @@
+// Licensed to the Apache Software Foundation (ASF) under one
+// or more contributor license agreements.  See the NOTICE file
+// distributed with this work for additional information
+// regarding copyright ownership.  The ASF licenses this file
+// to you under the Apache License, Version 2.0 (the
+// "License"); you may not use this file except in compliance
+// with the License.  You may obtain a copy of the License at
+//
+//   http://www.apache.org/licenses/LICENSE-2.0
+//
+// Unless required by applicable law or agreed to in writing,
+// software distributed under the License is distributed on an
+// "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+// KIND, either express or implied.  See the License for the
+// specific language governing permissions and limitations
+// under the License.
+
+use datafusion_expr_common::interval_arithmetic::Interval;
+
+use crate::Expr;
+
+pub enum PreimageResult {
+    /// No preimage exists for the specified value
+    None,
+    /// The expression always evaluates to the specified constant
+    /// given that `expr` is within the interval
+    Range { expr: Expr, interval: Box<Interval> },

Review Comment:
   Is there any reason to `Box` this? I think it might be simpler if it was 
`Interval`



##########
datafusion/optimizer/src/simplify_expressions/udf_preimage.rs:
##########
@@ -0,0 +1,368 @@
+// Licensed to the Apache Software Foundation (ASF) under one
+// or more contributor license agreements.  See the NOTICE file
+// distributed with this work for additional information
+// regarding copyright ownership.  The ASF licenses this file
+// to you under the Apache License, Version 2.0 (the
+// "License"); you may not use this file except in compliance
+// with the License.  You may obtain a copy of the License at
+//
+//   http://www.apache.org/licenses/LICENSE-2.0
+//
+// Unless required by applicable law or agreed to in writing,
+// software distributed under the License is distributed on an
+// "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+// KIND, either express or implied.  See the License for the
+// specific language governing permissions and limitations
+// under the License.
+
+use datafusion_common::{Result, internal_err, tree_node::Transformed};
+use datafusion_expr::{Expr, Operator, and, lit, or, simplify::SimplifyContext};
+use datafusion_expr_common::interval_arithmetic::Interval;
+
+/// Rewrites a binary expression using its "preimage"
+///
+/// Specifically it rewrites expressions of the form `<expr> OP x` (e.g. 
`<expr> =
+/// x`) where `<expr>` is known to have a pre-image (aka the entire single
+/// range for which it is valid) and `x` is not `NULL`
+///
+/// This rewrite is described in the [ClickHouse Paper] and is particularly

Review Comment:
   We probably don't need to repeat this description to so many times -- we can 
probably consolidate everything into `ScalarUDFImpl::preimge` and leave a link 
here



##########
datafusion/optimizer/src/simplify_expressions/udf_preimage.rs:
##########
@@ -0,0 +1,368 @@
+// Licensed to the Apache Software Foundation (ASF) under one
+// or more contributor license agreements.  See the NOTICE file
+// distributed with this work for additional information
+// regarding copyright ownership.  The ASF licenses this file
+// to you under the Apache License, Version 2.0 (the
+// "License"); you may not use this file except in compliance
+// with the License.  You may obtain a copy of the License at
+//
+//   http://www.apache.org/licenses/LICENSE-2.0
+//
+// Unless required by applicable law or agreed to in writing,
+// software distributed under the License is distributed on an
+// "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+// KIND, either express or implied.  See the License for the
+// specific language governing permissions and limitations
+// under the License.
+
+use datafusion_common::{Result, internal_err, tree_node::Transformed};
+use datafusion_expr::{Expr, Operator, and, lit, or, simplify::SimplifyContext};
+use datafusion_expr_common::interval_arithmetic::Interval;
+
+/// Rewrites a binary expression using its "preimage"
+///
+/// Specifically it rewrites expressions of the form `<expr> OP x` (e.g. 
`<expr> =
+/// x`) where `<expr>` is known to have a pre-image (aka the entire single
+/// range for which it is valid) and `x` is not `NULL`
+///
+/// This rewrite is described in the [ClickHouse Paper] and is particularly
+/// useful for simplifying expressions `date_part` or equivalent functions. The
+/// idea is that if you have an expression like `date_part(YEAR, k) = 2024` 
and you
+/// can find a [preimage] for `date_part(YEAR, k)`, which is the range of dates
+/// covering the entire year of 2024. Thus, you can rewrite the expression to
+/// `k >= '2024-01-01' AND k < '2025-01-01'`, which uses an inclusive lower 
bound
+/// and exclusive upper bound and is often more optimizable.
+///
+/// [ClickHouse Paper]:  https://www.vldb.org/pvldb/vol17/p3731-schulze.pdf
+/// [preimage]: https://en.wikipedia.org/wiki/Image_(mathematics)#Inverse_image
+///
+pub(super) fn rewrite_with_preimage(
+    _info: &SimplifyContext,
+    preimage_interval: Interval,
+    op: Operator,
+    expr: Expr,
+) -> Result<Transformed<Expr>> {
+    let (lower, upper) = preimage_interval.into_bounds();
+    let (lower, upper) = (lit(lower), lit(upper));
+
+    let rewritten_expr = match op {
+        // <expr> < x   ==>  <expr> < lower
+        // <expr> >= x  ==>  <expr> >= lower
+        Operator::Lt => expr.lt(lower),
+        Operator::GtEq => expr.gt_eq(lower),
+        // <expr> > x ==> <expr> >= upper
+        Operator::Gt => expr.gt_eq(upper),
+        // <expr> <= x ==> <expr> < upper
+        Operator::LtEq => expr.lt(upper),
+        // <expr> = x ==> (<expr> >= lower) and (<expr> < upper)
+        //
+        // <expr> is not distinct from x ==> (<expr> is NULL and x is NULL) or 
((<expr> >= lower) and (<expr> < upper))

Review Comment:
   ```suggestion
           // <expr> is not distinct from x ==> (<expr> is NULL) or ((<expr> >= 
lower) and (<expr> < upper))
   ```



##########
datafusion/optimizer/src/simplify_expressions/expr_simplifier.rs:
##########
@@ -1969,12 +1972,93 @@ impl TreeNodeRewriter for Simplifier<'_> {
                 }))
             }
 
+            // =======================================
+            // preimage_in_comparison
+            // =======================================
+            //
+            // For case:
+            // date_part('YEAR', expr) op literal
+            //
+            // Background:
+            // Datasources such as Parquet can prune partitions using simple 
predicates,
+            // but they cannot do so for complex expressions.
+            // For a complex predicate like `date_part('YEAR', c1) < 2000`, 
pruning is not possible.
+            // After rewriting it to `c1 < 2000-01-01`, pruning becomes 
feasible.
+            // Rewrites use inclusive lower and exclusive upper bounds when
+            // translating an equality into a range.
+            // NOTE: we only consider immutable UDFs with non Null literal RHS 
values and
+            // UDFs that provide both `preimage` and `column_expr`.
+            Expr::BinaryExpr(BinaryExpr { left, op, right }) => {
+                use datafusion_expr::Operator::*;
+                let is_preimage_op = matches!(
+                    op,
+                    Eq | NotEq
+                        | Lt
+                        | LtEq
+                        | Gt
+                        | GtEq
+                        | IsDistinctFrom
+                        | IsNotDistinctFrom
+                );
+                if !is_preimage_op || is_null(&right) {
+                    return Ok(Transformed::no(Expr::BinaryExpr(BinaryExpr {
+                        left,
+                        op,
+                        right,
+                    })));
+                }
+
+                if let PreimageResult::Range { interval, expr } =
+                    get_preimage(left.as_ref(), right.as_ref(), info)?
+                {
+                    rewrite_with_preimage(info, *interval, op, expr)?
+                } else if let Some(swapped) = op.swap() {
+                    if let PreimageResult::Range { interval, expr } =
+                        get_preimage(right.as_ref(), left.as_ref(), info)?
+                    {
+                        rewrite_with_preimage(info, *interval, swapped, expr)?
+                    } else {
+                        Transformed::no(Expr::BinaryExpr(BinaryExpr { left, 
op, right }))
+                    }
+                } else {
+                    Transformed::no(Expr::BinaryExpr(BinaryExpr { left, op, 
right }))
+                }
+            }
+
             // no additional rewrites possible
             expr => Transformed::no(expr),
         })
     }
 }
 
+fn get_preimage(
+    left_expr: &Expr,
+    right_expr: &Expr,
+    info: &SimplifyContext,
+) -> Result<PreimageResult> {
+    let Expr::ScalarFunction(ScalarFunction { func, args }) = left_expr else {
+        return Ok(PreimageResult::None);
+    };
+    if !is_literal_or_literal_cast(right_expr) {
+        return Ok(PreimageResult::None);
+    }
+    if func.signature().volatility != Volatility::Immutable {

Review Comment:
   Also for a follow on PR, I think it would be safe to rewrite stable 
functions (whose values don't change during the statement)



##########
datafusion/optimizer/src/simplify_expressions/expr_simplifier.rs:
##########
@@ -1969,12 +1972,101 @@ impl TreeNodeRewriter for Simplifier<'_> {
                 }))
             }
 
+            // =======================================
+            // preimage_in_comparison
+            // =======================================
+            //
+            // For case:
+            // date_part('YEAR', expr) op literal
+            //
+            // Background:
+            // Datasources such as Parquet can prune partitions using simple 
predicates,
+            // but they cannot do so for complex expressions.
+            // For a complex predicate like `date_part('YEAR', c1) < 2000`, 
pruning is not possible.
+            // After rewriting it to `c1 < 2000-01-01`, pruning becomes 
feasible.
+            // Rewrites use inclusive lower and exclusive upper bounds when
+            // translating an equality into a range.
+            // NOTE: we only consider immutable UDFs with literal RHS values 
and
+            // UDFs that provide both `preimage` and `column_expr`.
+            Expr::BinaryExpr(BinaryExpr { left, op, right }) => {
+                use datafusion_expr::Operator::*;
+                let is_preimage_op = matches!(
+                    op,
+                    Eq | NotEq
+                        | Lt
+                        | LtEq
+                        | Gt
+                        | GtEq
+                        | IsDistinctFrom
+                        | IsNotDistinctFrom
+                );
+                if !is_preimage_op {
+                    return Ok(Transformed::no(Expr::BinaryExpr(BinaryExpr {
+                        left,
+                        op,
+                        right,
+                    })));
+                }
+
+                if let (Some(interval), Some(col_expr)) =
+                    get_preimage(left.as_ref(), right.as_ref(), info)?
+                {
+                    rewrite_with_preimage(info, interval, op, 
Box::new(col_expr))?
+                } else if let Some(swapped) = op.swap() {
+                    if let (Some(interval), Some(col_expr)) =
+                        get_preimage(right.as_ref(), left.as_ref(), info)?
+                    {
+                        rewrite_with_preimage(
+                            info,
+                            interval,
+                            swapped,
+                            Box::new(col_expr),
+                        )?
+                    } else {
+                        Transformed::no(Expr::BinaryExpr(BinaryExpr { left, 
op, right }))
+                    }
+                } else {
+                    Transformed::no(Expr::BinaryExpr(BinaryExpr { left, op, 
right }))
+                }
+            }
+
             // no additional rewrites possible
             expr => Transformed::no(expr),
         })
     }
 }
 
+fn get_preimage(
+    left_expr: &Expr,
+    right_expr: &Expr,
+    info: &SimplifyContext,
+) -> Result<(Option<Interval>, Option<Expr>)> {
+    let Expr::ScalarFunction(ScalarFunction { func, args }) = left_expr else {
+        return Ok((None, None));
+    };
+    if !is_literal_or_literal_cast(right_expr) {

Review Comment:
   I think this is still an open question, but it is ok to handle as a follow 
on PR (aka widen the expressions)



##########
datafusion/expr/src/preimage.rs:
##########
@@ -0,0 +1,28 @@
+// Licensed to the Apache Software Foundation (ASF) under one
+// or more contributor license agreements.  See the NOTICE file
+// distributed with this work for additional information
+// regarding copyright ownership.  The ASF licenses this file
+// to you under the Apache License, Version 2.0 (the
+// "License"); you may not use this file except in compliance
+// with the License.  You may obtain a copy of the License at
+//
+//   http://www.apache.org/licenses/LICENSE-2.0
+//
+// Unless required by applicable law or agreed to in writing,
+// software distributed under the License is distributed on an
+// "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+// KIND, either express or implied.  See the License for the
+// specific language governing permissions and limitations
+// under the License.
+
+use datafusion_expr_common::interval_arithmetic::Interval;
+
+use crate::Expr;
+
+pub enum PreimageResult {

Review Comment:
   I recommend some small comment with context here
   
   Like
   ```suggestion
   /// Return from [`ScalarUDFImpl::preimage`]
   pub enum PreimageResult {
   ```



##########
datafusion/optimizer/src/simplify_expressions/udf_preimage.rs:
##########
@@ -0,0 +1,368 @@
+// Licensed to the Apache Software Foundation (ASF) under one
+// or more contributor license agreements.  See the NOTICE file
+// distributed with this work for additional information
+// regarding copyright ownership.  The ASF licenses this file
+// to you under the Apache License, Version 2.0 (the
+// "License"); you may not use this file except in compliance
+// with the License.  You may obtain a copy of the License at
+//
+//   http://www.apache.org/licenses/LICENSE-2.0
+//
+// Unless required by applicable law or agreed to in writing,
+// software distributed under the License is distributed on an
+// "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+// KIND, either express or implied.  See the License for the
+// specific language governing permissions and limitations
+// under the License.
+
+use datafusion_common::{Result, internal_err, tree_node::Transformed};
+use datafusion_expr::{Expr, Operator, and, lit, or, simplify::SimplifyContext};
+use datafusion_expr_common::interval_arithmetic::Interval;
+
+/// Rewrites a binary expression using its "preimage"
+///
+/// Specifically it rewrites expressions of the form `<expr> OP x` (e.g. 
`<expr> =
+/// x`) where `<expr>` is known to have a pre-image (aka the entire single
+/// range for which it is valid) and `x` is not `NULL`
+///
+/// This rewrite is described in the [ClickHouse Paper] and is particularly
+/// useful for simplifying expressions `date_part` or equivalent functions. The
+/// idea is that if you have an expression like `date_part(YEAR, k) = 2024` 
and you
+/// can find a [preimage] for `date_part(YEAR, k)`, which is the range of dates
+/// covering the entire year of 2024. Thus, you can rewrite the expression to
+/// `k >= '2024-01-01' AND k < '2025-01-01'`, which uses an inclusive lower 
bound
+/// and exclusive upper bound and is often more optimizable.
+///
+/// [ClickHouse Paper]:  https://www.vldb.org/pvldb/vol17/p3731-schulze.pdf
+/// [preimage]: https://en.wikipedia.org/wiki/Image_(mathematics)#Inverse_image
+///
+pub(super) fn rewrite_with_preimage(
+    _info: &SimplifyContext,
+    preimage_interval: Interval,
+    op: Operator,
+    expr: Expr,
+) -> Result<Transformed<Expr>> {
+    let (lower, upper) = preimage_interval.into_bounds();
+    let (lower, upper) = (lit(lower), lit(upper));
+
+    let rewritten_expr = match op {
+        // <expr> < x   ==>  <expr> < lower
+        // <expr> >= x  ==>  <expr> >= lower

Review Comment:
   minor nit is that this comment I think is on the wrong long



##########
datafusion/optimizer/src/simplify_expressions/expr_simplifier.rs:
##########
@@ -1969,12 +1972,93 @@ impl TreeNodeRewriter for Simplifier<'_> {
                 }))
             }
 
+            // =======================================

Review Comment:
   This code is now really looking great



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to