kosiew opened a new issue, #23156: URL: https://github.com/apache/datafusion/issues/23156
## Summary Parquet row-filter planning currently handles struct field access paths in multiple independent shapes. `build_filter_schema` / `prune_struct_type` now group struct paths while constructing the projected Arrow schema, but `resolve_struct_field_leaves` still handles each `StructFieldAccess` separately by building a prefix and scanning every Parquet leaf column. Introduce a reusable struct-access path tree (trie) for row-filter planning, and use it to drive both: 1. Parquet projection-mask leaf selection. 2. Arrow projected-schema pruning. This would encode the core invariant once: the set of struct paths required by a row-filter must produce a projection mask and projected schema that describe the same pruned struct shape. ## Current behavior Relevant file: `datafusion/datasource-parquet/src/row_filter.rs` [Current flow](https://github.com/apache/datafusion/tree/994b926168033e06c4b379b7bac7667c958cfc1b): - `PushdownChecker` collects flat `StructFieldAccess { root_index, field_path }` entries. - `resolve_struct_field_leaves` iterates every access path, builds a string prefix, then scans all Parquet leaf columns to find matching paths. - `build_filter_schema` groups paths by root column. - `prune_struct_type` groups paths by next struct-field name at each schema level. After #22960, schema pruning avoids repeated scans, but leaf resolution still has the older flat-path scan pattern. ## Problem The same logical information — required struct access paths — is interpreted in separate places with separate algorithms: - Projection-mask construction scans Parquet leaves per access path. - Projected-schema construction recursively groups paths by Arrow struct fields. This causes avoidable repeated work for predicates with many struct accesses, especially with shared nested prefixes, for example: ```sql WHERE s['outer']['a'] > 10 AND s['outer']['b'] < 20 AND s['outer']['inner']['c'] IS NOT NULL ``` It also increases maintenance risk: projection-mask selection and schema pruning can drift, even though they must stay contract-compatible for row-filter evaluation. ## Proposed refactor Create a private local representation such as: ```rust struct StructAccessTree { roots: BTreeMap<usize, StructAccessNode>, } struct StructAccessNode { children: BTreeMap<String, StructAccessNode>, selected_here: bool, } ``` Exact shape can differ; key goal is to represent shared prefixes once. Use the tree to: 1. Build grouped paths once from `Vec<StructFieldAccess>`. 2. Resolve Parquet leaves by walking/comparing against the tree instead of scanning all leaves for each access path. 3. Prune Arrow `DataType::Struct` fields by walking the same tree. 4. Keep whole-column references as an override: if a root appears in `regular_indices`, preserve the full field type and include all root leaves. ## Expected benefits - Less repeated iterator work and fewer temporary path/prefix allocations. - One shared abstraction for the row-filter struct-access invariant. - Lower risk that projection mask and projected schema disagree. - Easier tests for nested shared-prefix access patterns. - Cleaner future extensions for struct/map/list pruning behavior. ## Correctness invariant For any pushed-down filter expression with struct field accesses: - The Parquet projection mask must include exactly the leaf columns needed to evaluate those accesses. - The projected Arrow schema passed to row-filter evaluation must match the schema produced by the Parquet reader for that projection. - If a struct root is also referenced as a whole column, pruning must be disabled for that root and the full struct must be preserved. ## Suggested implementation steps 1. Add a private helper in `row_filter.rs` to build the access tree from `&[StructFieldAccess]`. 2. Replace `paths_by_root` / `paths_by_field` ad hoc grouping with tree traversal in `build_filter_schema` and `prune_struct_type`. 3. Update `resolve_struct_field_leaves` to use the same tree when matching Parquet column paths. 4. Keep behavior unchanged for non-struct columns, missing columns, map columns, and list predicates. 5. Add focused tests for shared-prefix nested struct paths. ## Test coverage to add Add unit tests in `datafusion/datasource-parquet/src/row_filter.rs` covering: 1. Multiple fields under the same struct root: - `s['a']` - `s['b']` - projection mask includes only `s.a` and `s.b` leaves. - projected schema is `Struct{a, b}`. 2. Multiple nested paths sharing a prefix: - `s['outer']['a']` - `s['outer']['b']` - projection mask includes only those nested leaves. - projected schema is `Struct{outer: Struct{a, b}}`. 3. Mixed whole-root and nested access: - `s` plus `s['outer']['a']`. - full `s` schema and all `s` leaves are preserved. 4. End-to-end row-filter evaluation for nested shared-prefix predicates, asserting returned row count and pushdown metrics. ## Scope Local to `datafusion/datasource-parquet/src/row_filter.rs` unless a more general nested-path helper already exists and is clearly reusable. Avoid widening public API. Keep helper private unless another module has an immediate, concrete need. ## Related PR #22960 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected] --------------------------------------------------------------------- To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
