alamb opened a new issue, #15689:
URL: https://github.com/apache/datafusion/issues/15689

   ### Describe the bug
   
   As @xudong963  mentions in 
   - 
https://github.com/xudong963/arrow-datafusion/pull/5#discussion_r2034641672.
   
   And also brought up again in 
   - https://github.com/apache/datafusion/pull/15661
   
   When table_schema is different from file_schema then the current statistics 
merging code will incorrectly merge statistics
   
   Specifically, it merges column statistics based on their ordinal position 
(order in the file) 
   
   Currently this isn't a huge problem as the statistics are only used in a 
limited way for some optimizations, but as we start to rely on statistics for 
correctness, such as https://github.com/apache/datafusion/issues/6672 it is 
more important
   
   ### To Reproduce
   
   if we have two files
   * File 1: `(a int32, b int32)`
   * File 2: `(b int32, a int32)`
   
   I think the code on main will combine statistics for columns a in File 1 and 
column `b` in File 2 together. 
   
   
   
   ### Expected behavior
   
   I expect that only statistics from the same logical column are merged 
together. 
   
   
   
   
   ### Additional context
   
   After https://github.com/apache/datafusion/pull/15661 is merged, I suggest:
   1. adding some function that knows how to map columns from a file schema --> 
table schema (filling in any missing columns with 
`ColumnStatistics::new_unnown`) before combining them
   2. Adding testst
   
   Maybe we can simply reuse  the existing 
[`SchemaMapper`](https://docs.rs/datafusion/latest/datafusion/datasource/schema_adapter/trait.SchemaMapper.html)
 / factory 🤔  so we are sure the statistics merging is consistent with runtime


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org
For additional commands, e-mail: github-h...@datafusion.apache.org

Reply via email to