EshwarCVS opened a new pull request, #22059:
URL: https://github.com/apache/datafusion/pull/22059
## Which issue does this PR close?
Closes #<issue>
## Rationale for this change
When a query accesses only a single field of a wide struct column (e.g.
`SELECT event['user_id'] FROM logs`), DataFusion today reads **all** leaf
columns of `event` from Parquet. For structs with many fields this is
significant unnecessary I/O.
## What changes are included in this PR?
### Logical optimizer (2-pass pipeline)
* **`ExtractLeafExpressions`** (pass 1): detects `MoveTowardsLeafNodes`
sub-expressions (including `get_field`) inside Filter, Sort, Limit,
Aggregate, and Join nodes and lifts them into named extraction
projections (`__datafusion_extracted_N`) inserted below those nodes.
* **`PushDownLeafProjections`** (pass 2): pushes those extraction
projections further down toward leaf/datasource nodes, merging into
existing projections where possible and routing each expression to the
correct input side of multi-input nodes (Join, Union).
### Physical layer — Parquet leaf-column projection
* **`PushdownChecker` / `StructFieldAccess`**: extended the filter
pushdown visitor to recognise `get_field(Column, "field1", "field2", …)`
patterns and record a `StructFieldAccess { root_index, field_path }`
instead of requiring the entire struct column.
* **`resolve_struct_field_leaves`**: maps each `StructFieldAccess` to the
exact Parquet leaf column indices by prefix-matching against
`SchemaDescriptor`, enabling `ProjectionMask::leaves()` instead of
`ProjectionMask::roots()`.
* **`build_filter_schema` / `prune_struct_type`**: constructs a narrowed
Arrow schema that matches what the Parquet reader actually produces when
projecting specific struct leaves, so `reassign_expr_columns` can
correctly remap filter expressions.
* **`build_projection_read_plan`**: unified entry point (used by
`opener.rs`) that builds a leaf-level `ProjectionMask` from the physical
projection expressions, with fast paths for all-plain-column and
no-struct-column schemas.
### Physical optimizer wiring
`remove_unnecessary_projections` / `try_swapping_with_projection` already
merges `ProjectionExec` nodes (including those containing `get_field`)
into `DataSourceExec` via `ParquetSource::try_pushdown_projection` →
`try_merge`. The new physical-layer code ensures the merged expressions
are then used to build a narrow `ProjectionMask`.
## How are these changes tested?
* **Unit tests** in `row_filter.rs`: 10 new tests covering
`get_field` pushdown allowance/denial, correct Parquet leaf index
selection for simple and deeply-nested structs, end-to-end row
filtering, and the projection-preserves-full-struct invariant.
* **Optimizer unit tests** in `extract_leaf_expressions.rs`: 47 tests
covering extraction from Filter/Sort/Limit/Aggregate/Join/Union/
SubqueryAlias, deduplication, CSE interaction, and recovery projection
correctness.
* **SQL logic tests** in `projection_pushdown.slt`: end-to-end queries
covering basic field access, filter pushdown, sort/TopK, multi-partition,
joins, aggregation, nullable structs, `SELECT *` with struct field
filters, and edge cases (Map columns, non-literal field names).
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]