Re: [PR] Simplify comparisons and binary operations involving NULL [datafusion]

via GitHub Sun, 10 Aug 2025 03:14:40 -0700


alamb commented on code in PR #17088:
URL: https://github.com/apache/datafusion/pull/17088#discussion_r2265213324



##########
datafusion/sqllogictest/test_files/errors.slt:
##########
@@ -168,8 +168,9 @@ CREATE TABLE tab0(col0 INTEGER, col1 INTEGER, col2 INTEGER);
 statement ok
 INSERT INTO tab0 VALUES(83,0,38);
 
-query error DataFusion error: Arrow error: Divide by zero error
+query I

Review Comment:
   This is an interesting change -- we short circuited the evaluation and now 
the error doesn't happen
   
   THis happens in other areas and so I think this change is consistent with 
other parts of DataFusion as well



##########
datafusion/optimizer/src/simplify_expressions/expr_simplifier.rs:
##########
@@ -1947,6 +1858,53 @@ fn has_common_conjunction(lhs: &Expr, rhs: &Expr) -> 
bool {
     iter_conjunction(rhs).any(|e| lhs_set.contains(&e) && !e.is_volatile())
 }
 
+fn binary_op_null_on_null(op: Operator) -> bool {

Review Comment:
   I recommend making this a function on `Operator` so it is more discoverable 
-- similar to 
https://docs.rs/datafusion/latest/datafusion/logical_expr/enum.Operator.html#method.is_numerical_operators



##########
datafusion/sqllogictest/test_files/predicates.slt:
##########
@@ -777,6 +777,52 @@ physical_plan
 16)--------------------RepartitionExec: partitioning=RoundRobinBatch(4), 
input_partitions=1
 17)----------------------DataSourceExec: file_groups={1 group: 
[[WORKSPACE_ROOT/datafusion/core/tests/tpch-csv/part.csv]]}, 
projection=[p_partkey, p_brand], file_type=csv, has_header=true
 
+# Simplification of a binary operator with a NULL value
+
+statement ok
+create table t(x int) as values (1), (2), (3);
+
+query TT
+EXPLAIN FORMAT INDENT SELECT x > NULL FROM t;
+----
+logical_plan
+01)Projection: Boolean(NULL) AS t.x > NULL
+02)--TableScan: t projection=[]
+physical_plan
+01)ProjectionExec: expr=[NULL as t.x > NULL]
+02)--DataSourceExec: partitions=1, partition_sizes=[1]
+
+query TT
+EXPLAIN FORMAT INDENT SELECT * FROM t WHERE x > NULL;
+----
+logical_plan EmptyRelation
+physical_plan EmptyExec
+
+query TT
+EXPLAIN FORMAT INDENT SELECT * FROM t WHERE x < 5 AND (10 * NULL < x);
+----
+logical_plan
+01)Filter: t.x < Int32(5) AND Boolean(NULL)
+02)--TableScan: t projection=[x]
+physical_plan
+01)CoalesceBatchesExec: target_batch_size=8192
+02)--FilterExec: x@0 < 5 AND NULL

Review Comment:
   Isn't `<expr> AND NULL`  always `NULL` too?  -- maybe a potential future 
optimization or I forget the boolean tristate logic rules in this case



##########
datafusion/sqllogictest/test_files/predicates.slt:
##########
@@ -777,6 +777,52 @@ physical_plan
 16)--------------------RepartitionExec: partitioning=RoundRobinBatch(4), 
input_partitions=1
 17)----------------------DataSourceExec: file_groups={1 group: 
[[WORKSPACE_ROOT/datafusion/core/tests/tpch-csv/part.csv]]}, 
projection=[p_partkey, p_brand], file_type=csv, has_header=true
 
+# Simplification of a binary operator with a NULL value
+
+statement ok
+create table t(x int) as values (1), (2), (3);
+
+query TT
+EXPLAIN FORMAT INDENT SELECT x > NULL FROM t;
+----
+logical_plan
+01)Projection: Boolean(NULL) AS t.x > NULL
+02)--TableScan: t projection=[]
+physical_plan
+01)ProjectionExec: expr=[NULL as t.x > NULL]
+02)--DataSourceExec: partitions=1, partition_sizes=[1]
+
+query TT
+EXPLAIN FORMAT INDENT SELECT * FROM t WHERE x > NULL;
+----
+logical_plan EmptyRelation
+physical_plan EmptyExec
+
+query TT
+EXPLAIN FORMAT INDENT SELECT * FROM t WHERE x < 5 AND (10 * NULL < x);
+----
+logical_plan
+01)Filter: t.x < Int32(5) AND Boolean(NULL)
+02)--TableScan: t projection=[x]
+physical_plan
+01)CoalesceBatchesExec: target_batch_size=8192
+02)--FilterExec: x@0 < 5 AND NULL
+03)----DataSourceExec: partitions=1, partition_sizes=[1]
+
+query TT
+EXPLAIN FORMAT INDENT SELECT * FROM t WHERE x < 5 OR (10 * NULL < x);
+----
+logical_plan
+01)Filter: t.x < Int32(5) OR Boolean(NULL)
+02)--TableScan: t projection=[x]
+physical_plan
+01)CoalesceBatchesExec: target_batch_size=8192
+02)--FilterExec: x@0 < 5 OR NULL

Review Comment:
   Yes, the difference between Filters and Projections for null semantics has 
come up before. I think currently there is no way to differentiate
   
   There is a ticket that tracks this idea too:
   - https://github.com/apache/datafusion/issues/6179



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Re: [PR] Simplify comparisons and binary operations involving NULL [datafusion]

Reply via email to