friendlymatthew opened a new issue, #20871:
URL: https://github.com/apache/datafusion/issues/20871

   Related to the work on struct array handling:
   - #20854 
   - #20822 
   - #20829 
   
   When filtering on struct fields (e.g. `WHERE s['value'] > 5`), Datafusion 
currently can not prune row groups using Parquet column statistics, even though 
the underlying leaf columns have valid min/max statistics stored in the parquet 
metadata
   
   The issue is in the pruning predicate system. When it encounters a 
`GetField` expr like `GetField(Column("s"), "value")`, the column extraction 
logic only sees the parent struct `Column(s)` and doesn't resolve through to 
the nested field
   
   Fixing this would mean teaching the pruning system to resolve `GetField` 
expressions down to their leaf columns, then look up the corresponding Parquet 
column stats. Note, the stats themselves are already there in the Parquet 
metadata, they're just never consulted for nested field access
   
   On tables with many row groups, this could significantly reduce the amount 
of data read for struct field predicates


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to