adriangb commented on issue #6735: URL: https://github.com/apache/arrow-rs/issues/6735#issuecomment-3029106131
> > > Often computing the transformation may be non trivial (e.g. matching columns by name) so it would be nice to do the mapping calculation once per schema rather than once per batch / StructArrayschema. For example DF's SchemaAdapter computes the mapping once and can then apply that to multiple batches. > > > > > > I'm tot sure how this would happen in practice: there's no state in `UDFs` in DataFusion. So if we e.g. wanted to implement `cast(...)` in terms of a SchemaAdapter we have nowhere to store the pre-computed value. I think we'd have to introduce some sort of build step that goes around the expression tree and optimizes expressions for the given input / output schemas. > > I wonder if we could use the new snapshot machinery 🤔 > > https://docs.rs/datafusion/latest/datafusion/physical_expr/trait.PhysicalExpr.html#method.snapshot `snapshot` doesn't know anything about it's input data types. I did propose something similar in https://github.com/apache/datafusion/pull/15057#issuecomment-2800002196 which is actually part of what lead to the idea of `PhysicalExpr::snapshot`. I think it could be a win in performance to "specialize" the whole expression tree to the work it's actually going to do but it will likely depend on the balance of upfront cost vs. execution gains, which would require benchmarks, etc. My intuition is that for now we could start by doing something naive: build a SchemaAdapter every time CastExpr::evaluate` gets called and then throw it away. I think that will be more expensive than the current cast kernel but maybe not by much. If that is too expensive then we can think of some system to specialize PhysicalExprs. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: github-unsubscr...@arrow.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org