adriangb commented on issue #6735:
URL: https://github.com/apache/arrow-rs/issues/6735#issuecomment-3029106131

   > > > Often computing the transformation may be non trivial (e.g. matching 
columns by name) so it would be nice to do the mapping calculation once per 
schema rather than once per batch / StructArrayschema. For example DF's 
SchemaAdapter computes the mapping once and can then apply that to multiple 
batches.
   > > 
   > > 
   > > I'm tot sure how this would happen in practice: there's no state in 
`UDFs` in DataFusion. So if we e.g. wanted to implement `cast(...)` in terms of 
a SchemaAdapter we have nowhere to store the pre-computed value. I think we'd 
have to introduce some sort of build step that goes around the expression tree 
and optimizes expressions for the given input / output schemas.
   > 
   > I wonder if we could use the new snapshot machinery 🤔
   > 
   > 
https://docs.rs/datafusion/latest/datafusion/physical_expr/trait.PhysicalExpr.html#method.snapshot
   
   `snapshot` doesn't know anything about it's input data types. I did propose 
something similar in 
https://github.com/apache/datafusion/pull/15057#issuecomment-2800002196 which 
is actually part of what lead to the idea of `PhysicalExpr::snapshot`.
   
   I think it could be a win in performance to "specialize" the whole 
expression tree to the work it's actually going to do but it will likely depend 
on the balance of upfront cost vs. execution gains, which would require 
benchmarks, etc.
   
   My intuition is that for now we could start by doing something naive: build 
a SchemaAdapter every time CastExpr::evaluate` gets called and then throw it 
away. I think that will be more expensive than the current cast kernel but 
maybe not by much. If that is too expensive then we can think of some system to 
specialize PhysicalExprs.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscr...@arrow.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

Reply via email to