etseidl opened a new issue, #15742:
URL: https://github.com/apache/datafusion/issues/15742

   ### Is your feature request related to a problem or challenge?
   
   I've been working on implementing a new `ColumnOrder` for floating point 
columns in Parquet (https://github.com/apache/arrow-rs/pull/7408), and while 
investigating how to use the new statistics in Datafusion, I found an 
interesting quirk.
   
   I'm looking at explain plans for some queries that should be able to use 
statistics to prune pages. For example:
   ```sql
   > explain select int_col from 
'parquet-testing/data/alltypes_tiny_pages.parquet' where int_col > 10;
   
+---------------+-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
   | plan_type     | plan                                                       
                                                                                
                                                                                
                                                                           |
   
+---------------+-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
   | logical_plan  | Filter: 
parquet-testing/data/alltypes_tiny_pages.parquet.int_col > Int32(10)            
                                                                                
                                                                                
                                              |
   |               |   TableScan: 
parquet-testing/data/alltypes_tiny_pages.parquet projection=[int_col], 
partial_filters=[parquet-testing/data/alltypes_tiny_pages.parquet.int_col > 
Int32(10)]                                                                      
                                                      |
   | physical_plan | CoalesceBatchesExec: target_batch_size=8192                
                                                                                
                                                                                
                                                                           |
   |               |   FilterExec: int_col@0 > 10                               
                                                                                
                                                                                
                                                                           |
   |               |     RepartitionExec: partitioning=RoundRobinBatch(8), 
input_partitions=1                                                              
                                                                                
                                                                                
|
   |               |       DataSourceExec: file_groups={1 group: 
[[Users/seidl/src/datafusion/parquet-testing/data/alltypes_tiny_pages.parquet]]},
 projection=[int_col], file_type=parquet, predicate=int_col@4 > 10, 
pruning_predicate=int_col_null_count@1 != row_count@2 AND int_col_max@0 > 10, 
required_guarantees=[] |
   ```
   will use the column and page statistics (`pruning_predicate` is populated). 
However, when I tried a similar plan for a `float` column, I was surprised to 
see that no pruning is done.
   ```sql
   > explain select float_col from 
'parquet-testing/data/alltypes_tiny_pages.parquet' where float_col > 10.0;
   
+---------------+----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
   | plan_type     | plan                                                       
                                                                                
                                                                          |
   
+---------------+----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
   | logical_plan  | Filter: 
CAST(parquet-testing/data/alltypes_tiny_pages.parquet.float_col AS Float64) > 
Float64(10)                                                                     
                                               |
   |               |   TableScan: 
parquet-testing/data/alltypes_tiny_pages.parquet projection=[float_col], 
partial_filters=[CAST(parquet-testing/data/alltypes_tiny_pages.parquet.float_col
 AS Float64) > Float64(10)]                    |
   | physical_plan | CoalesceBatchesExec: target_batch_size=8192                
                                                                                
                                                                          |
   |               |   FilterExec: CAST(float_col@0 AS Float64) > 10            
                                                                                
                                                                          |
   |               |     RepartitionExec: partitioning=RoundRobinBatch(8), 
input_partitions=1                                                              
                                                                               |
   |               |       DataSourceExec: file_groups={1 group: 
[[Users/seidl/src/datafusion/parquet-testing/data/alltypes_tiny_pages.parquet]]},
 projection=[float_col], file_type=parquet, predicate=CAST(float_col@6 AS 
Float64) > 10 |
   |               |      
   ```
   But, using a `double` column will:
   ```sql
   > explain select double_col from 
'parquet-testing/data/alltypes_tiny_pages.parquet' where double_col > 10.0;
   
+---------------+-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
   | plan_type     | plan                                                       
                                                                                
                                                                                
                                                                                
       |
   
+---------------+-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
   | logical_plan  | Filter: 
parquet-testing/data/alltypes_tiny_pages.parquet.double_col > Float64(10)       
                                                                                
                                                                                
                                                          |
   |               |   TableScan: 
parquet-testing/data/alltypes_tiny_pages.parquet projection=[double_col], 
partial_filters=[parquet-testing/data/alltypes_tiny_pages.parquet.double_col > 
Float64(10)]                                                                    
                                                            |
   | physical_plan | CoalesceBatchesExec: target_batch_size=8192                
                                                                                
                                                                                
                                                                                
       |
   |               |   FilterExec: double_col@0 > 10                            
                                                                                
                                                                                
                                                                                
       |
   |               |     RepartitionExec: partitioning=RoundRobinBatch(8), 
input_partitions=1                                                              
                                                                                
                                                                                
            |
   |               |       DataSourceExec: file_groups={1 group: 
[[Users/seidl/src/datafusion/parquet-testing/data/alltypes_tiny_pages.parquet]]},
 projection=[double_col], file_type=parquet, predicate=double_col@7 > 10, 
pruning_predicate=double_col_null_count@1 != row_count@2 AND double_col_max@0 > 
10, required_guarantees=[] |
   ```
   
   Digging into the code, it seems the problem with `float` lies in a) the 
literal `10.0` is treated as a `double` so b) `float_col` is cast to `Float64`, 
which is disallowed in 
`datafusion_physical_optimizer::pruning::verify_support_type_for_prune()`. 
   
   Interestingly, using an `int` literal works:
   ```sql
   > explain select float_col from 
'parquet-testing/data/alltypes_tiny_pages.parquet' where float_col > 10;
   
+---------------+---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
   | plan_type     | plan                                                       
                                                                                
                                                                                
                                                                                
   |
   
+---------------+---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
   | logical_plan  | Filter: 
parquet-testing/data/alltypes_tiny_pages.parquet.float_col > Float32(10)        
                                                                                
                                                                                
                                                      |
   |               |   TableScan: 
parquet-testing/data/alltypes_tiny_pages.parquet projection=[float_col], 
partial_filters=[parquet-testing/data/alltypes_tiny_pages.parquet.float_col > 
Float32(10)]                                                                    
                                                          |
   | physical_plan | CoalesceBatchesExec: target_batch_size=8192                
                                                                                
                                                                                
                                                                                
   |
   |               |   FilterExec: float_col@0 > 10                             
                                                                                
                                                                                
                                                                                
   |
   |               |     RepartitionExec: partitioning=RoundRobinBatch(8), 
input_partitions=1                                                              
                                                                                
                                                                                
        |
   |               |       DataSourceExec: file_groups={1 group: 
[[Users/seidl/src/datafusion/parquet-testing/data/alltypes_tiny_pages.parquet]]},
 projection=[float_col], file_type=parquet, predicate=float_col@6 > 10, 
pruning_predicate=float_col_null_count@1 != row_count@2 AND float_col_max@0 > 
10, required_guarantees=[] |
   ```
   as does an explicit cast
   ```sql
   > explain select float_col from 
'parquet-testing/data/alltypes_tiny_pages.parquet' where float_col > cast(10.0 
as float);
   
+---------------+---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
   | plan_type     | plan                                                       
                                                                                
                                                                                
                                                                                
   |
   
+---------------+---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
   | logical_plan  | Filter: 
parquet-testing/data/alltypes_tiny_pages.parquet.float_col > Float32(10)        
                                                                                
                                                                                
                                                      |
   |               |   TableScan: 
parquet-testing/data/alltypes_tiny_pages.parquet projection=[float_col], 
partial_filters=[parquet-testing/data/alltypes_tiny_pages.parquet.float_col > 
Float32(10)]                                                                    
                                                          |
   | physical_plan | CoalesceBatchesExec: target_batch_size=8192                
                                                                                
                                                                                
                                                                                
   |
   |               |   FilterExec: float_col@0 > 10                             
                                                                                
                                                                                
                                                                                
   |
   |               |     RepartitionExec: partitioning=RoundRobinBatch(8), 
input_partitions=1                                                              
                                                                                
                                                                                
        |
   |               |       DataSourceExec: file_groups={1 group: 
[[Users/seidl/src/datafusion/parquet-testing/data/alltypes_tiny_pages.parquet]]},
 projection=[float_col], file_type=parquet, predicate=float_col@6 > 10, 
pruning_predicate=float_col_null_count@1 != row_count@2 AND float_col_max@0 > 
10, required_guarantees=[] |
   ```
   
   ### Describe the solution you'd like
   
   It would be nice if Datafusion always used statistics for floating point 
columns if they are available. One potential fix is to add more cases to 
`verify_support_type_for_prune` 
(https://github.com/apache/datafusion/blob/42a45d1ff1e6d7d5d33f0b9b858e6d0d51ce8208/datafusion/physical-optimizer/src/pruning.rs#L1217-L1229).
 Another would be to cast floating point literals to a more appropriate type 
(as it appears is done when casting `int` literals to `float`). I'd be happy to 
attempt the former, but I'd need pointing to where to attempt the latter 😅.
   
   ### Describe alternatives you've considered
   
   _No response_
   
   ### Additional context
   
   _No response_


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org
For additional commands, e-mail: github-h...@datafusion.apache.org

Reply via email to