m09526 opened a new issue, #18214: URL: https://github.com/apache/datafusion/issues/18214
### Describe the bug As of DataFusion v49.0.0, the PushDownFilter optimizer rule attempts to simplify and optimize redundant filter conditions on a column. For example filtering on "a > 10 AND a > 20" can be simpified to "a > 20". This was introduced in [#16362](https://github.com/apache/datafusion/pull/16362). However, the case when two filters with the same bounds but different relational operators is handled incorrectly. In the example below, the conditions `a > 1 AND a < 10 AND a >=1 AND a <= 10` are reduced incorrectly to `a >= 1 AND a <= 10`. Converting the < and > to <= and >= incorrectly match more rows than they should. Note: We found that changing the ordering of the filter lines affects whether the bug occurs. If the two filtering lines are swapped, the correct filter of `a > 1 AND a < 10` is produced. ### To Reproduce With DataFusion V50.2.0, the following code: ``` use datafusion::{error::DataFusionError, prelude::*}; #[tokio::main] async fn main() -> Result<(), DataFusionError> { let df = dataframe!["a" => [1,2,3,4], "b" => ["1","2","3","4"]]?; let df = df.filter(col("a").gt(lit(1)).and(col("a").lt(lit(10))))?; let df = df.filter(col("a").gt_eq(lit(1)).and(col("a").lt_eq(lit(10))))?; let t = df.explain(false, false)?; t.show().await?; Ok(()) } ``` produces the following incorrect output: ``` +---------------+----------------------------------------------------------+ | plan_type | plan | +---------------+----------------------------------------------------------+ | logical_plan | Filter: ?table?.a >= Int32(1) AND ?table?.a <= Int32(10) | | | TableScan: ?table? projection=[a, b] | | physical_plan | CoalesceBatchesExec: target_batch_size=8192 | | | FilterExec: a@0 >= 1 AND a@0 <= 10 | | | DataSourceExec: partitions=1, partition_sizes=[1] | | | | +---------------+----------------------------------------------------------+ ``` ### Expected behavior The correct filter conditions would be: ``` +---------------+----------------------------------------------------------+ | plan_type | plan | +---------------+----------------------------------------------------------+ | logical_plan | Filter: ?table?.a > Int32(1) AND ?table?.a < Int32(10) | | | TableScan: ?table? projection=[a, b] | | physical_plan | CoalesceBatchesExec: target_batch_size=8192 | | | FilterExec: a@0 >=1 AND a@0 < 10 | | | DataSourceExec: partitions=1, partition_sizes=[1] | | | | +---------------+----------------------------------------------------------+ ``` ### Additional context In DataFusion V48 and below, we see the following output: ``` +---------------+-------------------------------------------------------------------------------------------------------------+ | plan_type | plan | +---------------+-------------------------------------------------------------------------------------------------------------+ | logical_plan | Filter: ?table?.a >= Int32(1) AND ?table?.a <= Int32(10) AND ?table?.a > Int32(1) AND ?table?.a < Int32(10) | | | TableScan: ?table? projection=[a, b] | | physical_plan | CoalesceBatchesExec: target_batch_size=8192 | | | FilterExec: a@0 >= 1 AND a@0 <= 10 AND a@0 > 1 AND a@0 < 10 | | | DataSourceExec: partitions=1, partition_sizes=[1] | | | | +---------------+-------------------------------------------------------------------------------------------------------------+ ``` The filtering conditions are correct, if redundant. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected] --------------------------------------------------------------------- To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
