Re: [PR] adapt filter expressions to file schema during parquet scan [datafusion]

via GitHub Mon, 23 Jun 2025 04:11:58 -0700


kosiew commented on PR #16461:
URL: https://github.com/apache/datafusion/pull/16461#issuecomment-2994947013


   @adriangb 
   Thanks for the ping on this.
   
   > Would it be possible to implement the nested struct imputation work you're 
doing with this approach?
   
   Do you mean  reusing the PhysicalExprSchemaRewriter machinery to drive 
nested‐struct imputation?
   
   Here're some disadvantages of the rewrite‐centric approach versus the more 
data‐centric adapter approach:
   
   - Expression-only, not data-only: This never actually transforms the 
underlying RecordBatch columns—if downstream logic (or the user) inspects a 
struct column directly, they won’t see the new null fields injected. We’d still 
need array-level imputation for correctness in the result batches.
   
   - Limited to predicate contexts: The rewriter hooks into filter and pruning, 
but our broader schema-evolution needs (e.g. reading all columns, SELECT *, 
writing out evolved nested structs) fall outside its scope.
   
   - Duplication risk: We end up reinventing part of the schema-adapter’s 
compatibility logic (matching fields by name, casting types) inside the 
rewriter, which can drift from the adapter’s rules over time.
   
   - Complexity with deep nesting: Recursively handling deeply nested structs 
inside an expression tree—and ensuring every nested‐field access gets rewritten 
with the right shape—quickly becomes more intricate than a simple tree visitor.
   
   - Performance implications: Constantly rewriting and reconstructing 
expression trees (and then evaluating those casts/lits) might be less efficient 
than bulk array‐level casts + struct builds, especially on wide tables.
   
   
   So, could we bolt nested‐struct imputation onto his rewriter? Technically 
yes, we could extend rewrite_column so that, whenever we see a Column referring 
to foo.bar.baz that’s missing in the physical schema, you generate a 
Literal::Null of the full nested type (constructing the proper StructValue). 
But we’d still need an array-level counterpart to actually materialize those 
null nested fields in the RecordBatch when we call map_batch.
   
   In practice, the two approaches complement each other:
   
   Use the rewriter to handle predicate and projection expressions (so filters 
and column references don’t blow up).
   
   Continue to rely on  cast_struct_column + NestedStructSchemaAdapter to adapt 
the actual batch data, filling in null arrays and doing recursive casts.
   
   That way we get the best of both worlds—clean, centralized expression 
rewriting for pushdown, and robust array-level marshalling for the final 
tables. 😊
   
   ## Why the two-pronged approach makes sense
   
   1. Pushdown vs. Data Adaptation Are Different Concerns
   
   The PhysicalExprSchemaRewriter is perfect for rewriting predicates and 
projections so they don’t blow up when the on-disk schema diverges.
   
   But once you’ve read that Parquet row group into memory, you still need to 
reshape the StructArray itself—filling in null arrays for new nested fields, 
dropping old ones, recursively casting types.
   
   2. Keeping Pruning Code Lean
   
   Swapping out the old SchemaMapper for the rewriter in your pruning path is a 
great win: much less boilerplate, better separation of concerns, and no more 
“fake record batches” just to evaluate a filter.
   
   You can remove all of that pruning-specific adapter code and lean on the 
rewriter’s tree visitor.
   
   3. Deep Schema-Evolution Still Lives in the Adapter
   
   Handling a top-level missing column is easy in the rewriter (you just 
rewrite col("foo") to lit(NULL)), but handling col("a.b.c") where b itself is a 
new struct, and c is a new field inside it… that’s far more natural in a 
recursive cast_struct_column that operates on StructArray children.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org
For additional commands, e-mail: github-h...@datafusion.apache.org

Re: [PR] adapt filter expressions to file schema during parquet scan [datafusion]

Reply via email to