kosiew commented on PR #16461: URL: https://github.com/apache/datafusion/pull/16461#issuecomment-2994947013
@adriangb Thanks for the ping on this. > Would it be possible to implement the nested struct imputation work you're doing with this approach? Do you mean reusing the PhysicalExprSchemaRewriter machinery to drive nested‐struct imputation? Here're some disadvantages of the rewrite‐centric approach versus the more data‐centric adapter approach: - Expression-only, not data-only: This never actually transforms the underlying RecordBatch columns—if downstream logic (or the user) inspects a struct column directly, they won’t see the new null fields injected. We’d still need array-level imputation for correctness in the result batches. - Limited to predicate contexts: The rewriter hooks into filter and pruning, but our broader schema-evolution needs (e.g. reading all columns, SELECT *, writing out evolved nested structs) fall outside its scope. - Duplication risk: We end up reinventing part of the schema-adapter’s compatibility logic (matching fields by name, casting types) inside the rewriter, which can drift from the adapter’s rules over time. - Complexity with deep nesting: Recursively handling deeply nested structs inside an expression tree—and ensuring every nested‐field access gets rewritten with the right shape—quickly becomes more intricate than a simple tree visitor. - Performance implications: Constantly rewriting and reconstructing expression trees (and then evaluating those casts/lits) might be less efficient than bulk array‐level casts + struct builds, especially on wide tables. So, could we bolt nested‐struct imputation onto his rewriter? Technically yes, we could extend rewrite_column so that, whenever we see a Column referring to foo.bar.baz that’s missing in the physical schema, you generate a Literal::Null of the full nested type (constructing the proper StructValue). But we’d still need an array-level counterpart to actually materialize those null nested fields in the RecordBatch when we call map_batch. In practice, the two approaches complement each other: Use the rewriter to handle predicate and projection expressions (so filters and column references don’t blow up). Continue to rely on cast_struct_column + NestedStructSchemaAdapter to adapt the actual batch data, filling in null arrays and doing recursive casts. That way we get the best of both worlds—clean, centralized expression rewriting for pushdown, and robust array-level marshalling for the final tables. 😊 ## Why the two-pronged approach makes sense 1. Pushdown vs. Data Adaptation Are Different Concerns The PhysicalExprSchemaRewriter is perfect for rewriting predicates and projections so they don’t blow up when the on-disk schema diverges. But once you’ve read that Parquet row group into memory, you still need to reshape the StructArray itself—filling in null arrays for new nested fields, dropping old ones, recursively casting types. 2. Keeping Pruning Code Lean Swapping out the old SchemaMapper for the rewriter in your pruning path is a great win: much less boilerplate, better separation of concerns, and no more “fake record batches” just to evaluate a filter. You can remove all of that pruning-specific adapter code and lean on the rewriter’s tree visitor. 3. Deep Schema-Evolution Still Lives in the Adapter Handling a top-level missing column is easy in the rewriter (you just rewrite col("foo") to lit(NULL)), but handling col("a.b.c") where b itself is a new struct, and c is a new field inside it… that’s far more natural in a recursive cast_struct_column that operates on StructArray children. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org --------------------------------------------------------------------- To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For additional commands, e-mail: github-h...@datafusion.apache.org