friendlymatthew opened a new issue, #20871:
URL: https://github.com/apache/datafusion/issues/20871
Related to the work on struct array handling:
- #20854
- #20822
- #20829
When filtering on struct fields (e.g. `WHERE s['value'] > 5`), Datafusion
currently can not prune row groups using Parquet column statistics, even though
the underlying leaf columns have valid min/max statistics stored in the parquet
metadata
The issue is in the pruning predicate system. When it encounters a
`GetField` expr like `GetField(Column("s"), "value")`, the column extraction
logic only sees the parent struct `Column(s)` and doesn't resolve through to
the nested field
Fixing this would mean teaching the pruning system to resolve `GetField`
expressions down to their leaf columns, then look up the corresponding Parquet
column stats. Note, the stats themselves are already there in the Parquet
metadata, they're just never consulted for nested field access
On tables with many row groups, this could significantly reduce the amount
of data read for struct field predicates
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]