m09526 opened a new issue, #18214:
URL: https://github.com/apache/datafusion/issues/18214

   ### Describe the bug
   
   As of DataFusion v49.0.0, the PushDownFilter optimizer rule attempts to 
simplify and optimize redundant filter conditions on a column. For example 
filtering on "a > 10 AND a > 20" can be simpified to "a > 20". This was 
introduced in [#16362](https://github.com/apache/datafusion/pull/16362).
   
   However, the case when two filters with the same bounds but different 
relational operators is handled incorrectly. 
   
   In the example below, the conditions `a > 1 AND a < 10 AND a >=1 AND a <= 
10` are reduced incorrectly to `a >= 1 AND a <= 10`. Converting the < and > to 
<= and >= incorrectly match more rows than they should.
   
   Note: We found that changing the ordering of the filter lines affects 
whether the bug occurs. If the two filtering lines are swapped, the correct 
filter of `a > 1 AND a < 10` is produced.
   
   ### To Reproduce
   
   With DataFusion V50.2.0, the following code:
   ```
   use datafusion::{error::DataFusionError, prelude::*};
   
   #[tokio::main]
   async fn main() -> Result<(), DataFusionError> {
       let df = dataframe!["a" => [1,2,3,4], "b" => ["1","2","3","4"]]?;
       let df = df.filter(col("a").gt(lit(1)).and(col("a").lt(lit(10))))?;
       let df = df.filter(col("a").gt_eq(lit(1)).and(col("a").lt_eq(lit(10))))?;
       let t = df.explain(false, false)?;
       t.show().await?;
       Ok(())
   }
   ```
   
   produces the following incorrect output:
   ```
   +---------------+----------------------------------------------------------+
   | plan_type     | plan                                                     |
   +---------------+----------------------------------------------------------+
   | logical_plan  | Filter: ?table?.a >= Int32(1) AND ?table?.a <= Int32(10) |
   |               |   TableScan: ?table? projection=[a, b]                   |
   | physical_plan | CoalesceBatchesExec: target_batch_size=8192              |
   |               |   FilterExec: a@0 >= 1 AND a@0 <= 10                     |
   |               |     DataSourceExec: partitions=1, partition_sizes=[1]    |
   |               |                                                          |
   +---------------+----------------------------------------------------------+
   ```
   
   ### Expected behavior
   
   The correct filter conditions would be:
   ```
   +---------------+----------------------------------------------------------+
   | plan_type     | plan                                                     |
   +---------------+----------------------------------------------------------+
   | logical_plan  | Filter: ?table?.a > Int32(1) AND ?table?.a < Int32(10) |
   |               |   TableScan: ?table? projection=[a, b]                   |
   | physical_plan | CoalesceBatchesExec: target_batch_size=8192              |
   |               |   FilterExec: a@0 >=1 AND a@0 < 10                     |
   |               |     DataSourceExec: partitions=1, partition_sizes=[1]    |
   |               |                                                          |
   +---------------+----------------------------------------------------------+
   ```
   
   ### Additional context
   
   In DataFusion V48 and below, we see the following output:
   ```
   
+---------------+-------------------------------------------------------------------------------------------------------------+
   | plan_type     | plan                                                       
                                                 |
   
+---------------+-------------------------------------------------------------------------------------------------------------+
   | logical_plan  | Filter: ?table?.a >= Int32(1) AND ?table?.a <= Int32(10) 
AND ?table?.a > Int32(1) AND ?table?.a < Int32(10) |
   |               |   TableScan: ?table? projection=[a, b]                     
                                                 |
   | physical_plan | CoalesceBatchesExec: target_batch_size=8192                
                                                 |
   |               |   FilterExec: a@0 >= 1 AND a@0 <= 10 AND a@0 > 1 AND a@0 < 
10                                               |
   |               |     DataSourceExec: partitions=1, partition_sizes=[1]      
                                                 |
   |               |                                                            
                                                 |
   
+---------------+-------------------------------------------------------------------------------------------------------------+
   ```
   
   The filtering conditions are correct, if redundant.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to