EshwarCVS opened a new pull request, #22059:
URL: https://github.com/apache/datafusion/pull/22059

   ## Which issue does this PR close?
   
   Closes #<issue>
   
   ## Rationale for this change
   
   When a query accesses only a single field of a wide struct column (e.g.
   `SELECT event['user_id'] FROM logs`), DataFusion today reads **all** leaf
   columns of `event` from Parquet. For structs with many fields this is
   significant unnecessary I/O.
   
   ## What changes are included in this PR?
   
   ### Logical optimizer (2-pass pipeline)
   
   * **`ExtractLeafExpressions`** (pass 1): detects `MoveTowardsLeafNodes`
     sub-expressions (including `get_field`) inside Filter, Sort, Limit,
     Aggregate, and Join nodes and lifts them into named extraction
     projections (`__datafusion_extracted_N`) inserted below those nodes.
   * **`PushDownLeafProjections`** (pass 2): pushes those extraction
     projections further down toward leaf/datasource nodes, merging into
     existing projections where possible and routing each expression to the
     correct input side of multi-input nodes (Join, Union).
   
   ### Physical layer — Parquet leaf-column projection
   
   * **`PushdownChecker` / `StructFieldAccess`**: extended the filter
     pushdown visitor to recognise `get_field(Column, "field1", "field2", …)`
     patterns and record a `StructFieldAccess { root_index, field_path }`
     instead of requiring the entire struct column.
   * **`resolve_struct_field_leaves`**: maps each `StructFieldAccess` to the
     exact Parquet leaf column indices by prefix-matching against
     `SchemaDescriptor`, enabling `ProjectionMask::leaves()` instead of
     `ProjectionMask::roots()`.
   * **`build_filter_schema` / `prune_struct_type`**: constructs a narrowed
     Arrow schema that matches what the Parquet reader actually produces when
     projecting specific struct leaves, so `reassign_expr_columns` can
     correctly remap filter expressions.
   * **`build_projection_read_plan`**: unified entry point (used by
     `opener.rs`) that builds a leaf-level `ProjectionMask` from the physical
     projection expressions, with fast paths for all-plain-column and
     no-struct-column schemas.
   
   ### Physical optimizer wiring
   
   `remove_unnecessary_projections` / `try_swapping_with_projection` already
   merges `ProjectionExec` nodes (including those containing `get_field`)
   into `DataSourceExec` via `ParquetSource::try_pushdown_projection` →
   `try_merge`. The new physical-layer code ensures the merged expressions
   are then used to build a narrow `ProjectionMask`.
   
   ## How are these changes tested?
   
   * **Unit tests** in `row_filter.rs`: 10 new tests covering
     `get_field` pushdown allowance/denial, correct Parquet leaf index
     selection for simple and deeply-nested structs, end-to-end row
     filtering, and the projection-preserves-full-struct invariant.
   * **Optimizer unit tests** in `extract_leaf_expressions.rs`: 47 tests
     covering extraction from Filter/Sort/Limit/Aggregate/Join/Union/
     SubqueryAlias, deduplication, CSE interaction, and recovery projection
     correctness.
   * **SQL logic tests** in `projection_pushdown.slt`: end-to-end queries
     covering basic field access, filter pushdown, sort/TopK, multi-partition,
     joins, aggregation, nullable structs, `SELECT *` with struct field
     filters, and edge cases (Map columns, non-literal field names).


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to