alamb commented on code in PR #7467:
URL: https://github.com/apache/arrow-datafusion/pull/7467#discussion_r1321905232
##########
datafusion/optimizer/src/simplify_expressions/expr_simplifier.rs:
##########
@@ -149,6 +153,76 @@ impl<S: SimplifyInfo> ExprSimplifier<S> {
expr.rewrite(&mut expr_rewrite)
}
+
+ /// Input guarantees and simplify the expression.
+ ///
+ /// The guarantees can simplify expressions. For example, if a column `x`
is
+ /// guaranteed to be `3`, then the expression `x > 1` can be replaced by
the
+ /// literal `true`.
+ ///
+ /// The guarantees are provided as an iterator of `(Expr,
NullableInterval)`
+ /// pairs, where the [Expr] is a column reference and the
[NullableInterval]
+ /// is an interval representing the known possible values of that column.
+ ///
+ /// ```rust
+ /// use arrow::datatypes::{DataType, Field, Schema};
+ /// use datafusion_expr::{col, lit, Expr};
+ /// use datafusion_common::{Result, ScalarValue, ToDFSchema};
+ /// use datafusion_physical_expr::execution_props::ExecutionProps;
+ /// use datafusion_physical_expr::intervals::{Interval, NullableInterval};
+ /// use datafusion_optimizer::simplify_expressions::{
+ /// ExprSimplifier, SimplifyContext};
+ ///
+ /// let schema = Schema::new(vec![
+ /// Field::new("x", DataType::Int64, false),
+ /// Field::new("y", DataType::UInt32, false),
+ /// Field::new("z", DataType::Int64, false),
+ /// ])
+ /// .to_dfschema_ref().unwrap();
+ ///
+ /// // Create the simplifier
+ /// let props = ExecutionProps::new();
+ /// let context = SimplifyContext::new(&props)
+ /// .with_schema(schema);
+ /// let simplifier = ExprSimplifier::new(context);
+ ///
+ /// // Expression: (x >= 3) AND (y + 2 < 10) AND (z > 5)
+ /// let expr_x = col("x").gt_eq(lit(3_i64));
+ /// let expr_y = (col("y") + lit(2_u32)).lt(lit(10_u32));
+ /// let expr_z = col("z").gt(lit(5_i64));
+ /// let expr = expr_x.and(expr_y).and(expr_z.clone());
+ ///
+ /// let guarantees = vec![
+ /// // x ∈ [3, 5]
+ /// (
+ /// col("x"),
+ /// NullableInterval {
+ /// values: Interval::make(Some(3_i64), Some(5_i64), (false,
false)),
+ /// is_valid: Interval::CERTAINLY_TRUE,
+ /// }
+ /// ),
+ /// // y = 3
+ /// (col("y"), NullableInterval::from(&ScalarValue::UInt32(Some(3)))),
+ /// ];
+ /// let output = simplifier.simplify_with_guarantees(expr,
&guarantees).unwrap();
+ /// // Expression becomes: true AND true AND (z > 5), which simplifies to
+ /// // z > 5.
+ /// assert_eq!(output, expr_z);
+ /// ```
+ pub fn simplify_with_guarantees<'a>(
Review Comment:
Another potential API for this might be to store the guarantees on the
simplifier -- like
```rust
let expr = ExprSimplifier::new(context)
.with_guarantees(guarantees)
.simplify()?
```
The downside is that the guarantees would have to be owned (aka a `Vec`)
So I think this API is fine, I just wanted to mention the possibility
##########
datafusion/optimizer/src/simplify_expressions/expr_simplifier.rs:
##########
@@ -149,6 +153,76 @@ impl<S: SimplifyInfo> ExprSimplifier<S> {
expr.rewrite(&mut expr_rewrite)
}
+
+ /// Input guarantees and simplify the expression.
Review Comment:
❤️
##########
datafusion/physical-expr/src/intervals/interval_aritmetic.rs:
##########
@@ -686,6 +715,231 @@ fn calculate_cardinality_based_on_bounds(
}
}
+/// The null status of an [NullableInterval].
+///
+/// This is an internal convenience that can be used in match statements
+/// (unlike Interval).
+#[derive(Debug, Clone, PartialEq, Eq)]
+enum NullStatus {
+ /// The interval is guaranteed to be non-null.
+ Never,
+ /// The interval is guaranteed to be null.
+ Always,
+ /// The interval isn't guaranteed to never be null or always be null.
+ Maybe,
+}
+
+/// An [Interval] that also tracks null status using a boolean interval.
+///
+/// This represents values that may be in a particular range or be null.
+///
+/// # Examples
+///
+/// ```
+/// use datafusion_physical_expr::intervals::{Interval, NullableInterval};
+/// use datafusion_common::ScalarValue;
+///
+/// // [1, 2) U {NULL}
+/// NullableInterval {
+/// values: Interval::make(Some(1), Some(2), (false, true)),
+/// is_valid: Interval::UNCERTAIN,
+/// };
+///
+/// // (0, ∞)
+/// NullableInterval {
+/// values: Interval::make(Some(0), None, (true, true)),
+/// is_valid: Interval::CERTAINLY_TRUE,
+/// };
+///
+/// // {NULL}
+/// NullableInterval::from(&ScalarValue::Int32(None));
+///
+/// // {4}
+/// NullableInterval::from(&ScalarValue::Int32(Some(4)));
+/// ```
+#[derive(Debug, Clone, PartialEq, Eq, Default)]
+pub struct NullableInterval {
+ /// The interval for the values
+ pub values: Interval,
+ /// A boolean interval representing whether the value is certainly valid
+ /// (not null), certainly null, or has an unknown validity. This takes
+ /// precedence over the values in the interval: if this field is equal to
+ /// [Interval::CERTAINLY_FALSE], then the interval is certainly null
+ /// regardless of what `values` is.
+ pub is_valid: Interval,
Review Comment:
I got confused about this enumeration for a while as it seems to use a
boolean interval to represent one of three states. It took me a while to grok
that `CERTAINLY_TRUE` mean that the the interval was not null, though the
concept is very well documented ❤️
I think this representation also might be more error prone as an invalid
state can be encoded (for example, what happens if `is_valid` contains
`ScalarValue::Float32`, or what if the code uses `values` when is_valid is
`CERTAINLY_FALSE`)
Perhaps we can use an `enum` and let the rust compiler and type system
ensure we always have valid states for `NullableInterval` with something like :
```rust
pub enum NullableInterval {
/// The value is always null in this interval
Null,
/// The value may or may not be null in this interval. If it is non null
its value is within
/// the specified values interval
MaybeNull { values : Interval },
/// The value is definitely not null in this interval and is within values
NotNull { vaules: Interval },
}
```
I think that might make the intent of some of the code above clearer as
well, rather than checking for null ness using tests for `CERTAINLY_FALSE`
##########
datafusion/optimizer/src/simplify_expressions/expr_simplifier.rs:
##########
@@ -149,6 +153,76 @@ impl<S: SimplifyInfo> ExprSimplifier<S> {
expr.rewrite(&mut expr_rewrite)
}
+
+ /// Input guarantees and simplify the expression.
+ ///
+ /// The guarantees can simplify expressions. For example, if a column `x`
is
+ /// guaranteed to be `3`, then the expression `x > 1` can be replaced by
the
+ /// literal `true`.
+ ///
+ /// The guarantees are provided as an iterator of `(Expr,
NullableInterval)`
+ /// pairs, where the [Expr] is a column reference and the
[NullableInterval]
+ /// is an interval representing the known possible values of that column.
+ ///
+ /// ```rust
+ /// use arrow::datatypes::{DataType, Field, Schema};
+ /// use datafusion_expr::{col, lit, Expr};
+ /// use datafusion_common::{Result, ScalarValue, ToDFSchema};
+ /// use datafusion_physical_expr::execution_props::ExecutionProps;
+ /// use datafusion_physical_expr::intervals::{Interval, NullableInterval};
+ /// use datafusion_optimizer::simplify_expressions::{
+ /// ExprSimplifier, SimplifyContext};
+ ///
+ /// let schema = Schema::new(vec![
+ /// Field::new("x", DataType::Int64, false),
+ /// Field::new("y", DataType::UInt32, false),
+ /// Field::new("z", DataType::Int64, false),
+ /// ])
+ /// .to_dfschema_ref().unwrap();
+ ///
+ /// // Create the simplifier
+ /// let props = ExecutionProps::new();
+ /// let context = SimplifyContext::new(&props)
+ /// .with_schema(schema);
+ /// let simplifier = ExprSimplifier::new(context);
+ ///
+ /// // Expression: (x >= 3) AND (y + 2 < 10) AND (z > 5)
Review Comment:
this is really cool @wjones127
##########
datafusion/optimizer/src/simplify_expressions/guarantees.rs:
##########
@@ -0,0 +1,541 @@
+// Licensed to the Apache Software Foundation (ASF) under one
+// or more contributor license agreements. See the NOTICE file
+// distributed with this work for additional information
+// regarding copyright ownership. The ASF licenses this file
+// to you under the Apache License, Version 2.0 (the
+// "License"); you may not use this file except in compliance
+// with the License. You may obtain a copy of the License at
+//
+// http://www.apache.org/licenses/LICENSE-2.0
+//
+// Unless required by applicable law or agreed to in writing,
+// software distributed under the License is distributed on an
+// "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+// KIND, either express or implied. See the License for the
+// specific language governing permissions and limitations
+// under the License.
+
+//! Simplifier implementation for
[ExprSimplifier::simplify_with_guarantees()][crate::simplify_expressions::expr_simplifier::ExprSimplifier::simplify_with_guarantees].
+use datafusion_common::{tree_node::TreeNodeRewriter, DataFusionError, Result};
+use datafusion_expr::{expr::InList, lit, Between, BinaryExpr, Expr, Operator};
+use std::collections::HashMap;
+
+use datafusion_physical_expr::intervals::{Interval, IntervalBound,
NullableInterval};
+
+/// Rewrite expressions to incorporate guarantees.
+pub(crate) struct GuaranteeRewriter<'a> {
+ intervals: HashMap<&'a Expr, &'a NullableInterval>,
+}
+
+impl<'a> GuaranteeRewriter<'a> {
+ pub fn new(
+ guarantees: impl IntoIterator<Item = &'a (Expr, NullableInterval)>,
+ ) -> Self {
+ Self {
+ intervals: guarantees.into_iter().map(|(k, v)| (k, v)).collect(),
+ }
+ }
+}
+
+impl<'a> TreeNodeRewriter for GuaranteeRewriter<'a> {
+ type N = Expr;
+
+ fn mutate(&mut self, expr: Expr) -> Result<Expr> {
+ match &expr {
+ Expr::IsNull(inner) => {
+ if let Some(interval) = self.intervals.get(inner.as_ref()) {
+ if interval.is_valid == Interval::CERTAINLY_FALSE {
Review Comment:
stylistically I think I would find this easier to read if it was in a method
like
```rust
if interval.always_null() {
...
} else if interval.alway_not_null() {
..
} else {
...
}
```
If you use the `enum` proposal below for `NullableInteval` this would
naturally become a match statement like
```
match self.intervals.get(inner.as_ref()) {
Some(NullableInterval::Null) => Ok(lit(true)),
Some(NullableInterval::NotNull{..}) => Ok(lit(false)),
_ => Ok(expr)
}
```
Which I think may express the intent of the code more concisely and clearly
##########
datafusion/physical-expr/src/intervals/interval_aritmetic.rs:
##########
@@ -686,6 +715,231 @@ fn calculate_cardinality_based_on_bounds(
}
}
+/// The null status of an [NullableInterval].
+///
+/// This is an internal convenience that can be used in match statements
+/// (unlike Interval).
+#[derive(Debug, Clone, PartialEq, Eq)]
+enum NullStatus {
+ /// The interval is guaranteed to be non-null.
+ Never,
+ /// The interval is guaranteed to be null.
+ Always,
+ /// The interval isn't guaranteed to never be null or always be null.
+ Maybe,
+}
+
+/// An [Interval] that also tracks null status using a boolean interval.
+///
+/// This represents values that may be in a particular range or be null.
+///
+/// # Examples
+///
+/// ```
+/// use datafusion_physical_expr::intervals::{Interval, NullableInterval};
+/// use datafusion_common::ScalarValue;
+///
+/// // [1, 2) U {NULL}
+/// NullableInterval {
+/// values: Interval::make(Some(1), Some(2), (false, true)),
+/// is_valid: Interval::UNCERTAIN,
+/// };
+///
+/// // (0, ∞)
+/// NullableInterval {
+/// values: Interval::make(Some(0), None, (true, true)),
+/// is_valid: Interval::CERTAINLY_TRUE,
+/// };
+///
+/// // {NULL}
+/// NullableInterval::from(&ScalarValue::Int32(None));
+///
+/// // {4}
+/// NullableInterval::from(&ScalarValue::Int32(Some(4)));
+/// ```
+#[derive(Debug, Clone, PartialEq, Eq, Default)]
+pub struct NullableInterval {
+ /// The interval for the values
+ pub values: Interval,
+ /// A boolean interval representing whether the value is certainly valid
+ /// (not null), certainly null, or has an unknown validity. This takes
+ /// precedence over the values in the interval: if this field is equal to
+ /// [Interval::CERTAINLY_FALSE], then the interval is certainly null
+ /// regardless of what `values` is.
+ pub is_valid: Interval,
Review Comment:
I got confused about this enumeration for a while as it seems to use a
boolean interval to represent one of three states. It took me a while to grok
that `CERTAINLY_TRUE` mean that the the interval was not null, though the
concept is very well documented ❤️
I think this representation also might be more error prone as an invalid
state can be encoded (for example, what happens if `is_valid` contains
`ScalarValue::Float32`, or what if the code uses `values` when is_valid is
`CERTAINLY_FALSE`)
Perhaps we can use an `enum` and let the rust compiler and type system
ensure we always have valid states for `NullableInterval` with something like :
```rust
pub enum NullableInterval {
/// The value is always null in this interval
Null,
/// The value may or may not be null in this interval. If it is non null
its value is within
/// the specified values interval
MaybeNull { values : Interval },
/// The value is definitely not null in this interval and is within values
NotNull { vaules: Interval },
}
```
I think that might make the intent of some of the code above clearer as
well, rather than checking for null ness using tests for `CERTAINLY_FALSE`
##########
datafusion/physical-expr/src/intervals/interval_aritmetic.rs:
##########
@@ -383,6 +383,17 @@ impl Interval {
}
}
+ /// Compute the logical negation of this (boolean) interval.
+ pub(crate) fn not(&self) -> Self {
Review Comment:
it seems like this could be misused if the interval did not contain boolean
values -- see my comment below about `NullableInterval` for an alternate
encoding
##########
datafusion/optimizer/src/simplify_expressions/expr_simplifier.rs:
##########
@@ -149,6 +153,76 @@ impl<S: SimplifyInfo> ExprSimplifier<S> {
expr.rewrite(&mut expr_rewrite)
}
+
+ /// Input guarantees and simplify the expression.
Review Comment:
❤️
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]