rtpsw commented on PR #14043: URL: https://github.com/apache/arrow/pull/14043#issuecomment-1257178331
> I think it would be more valuable to have an "expression executor" and not merely a "function executor". While I don't know enough about expression use cases, this sounds right to me in the wider context, i.e., outside of just UDFs/UDTs. Does this mean the expression executor should be built on top of the function executor proposed here? > Also, removing the lookup / argument resolution time is nice but the biggest win would be removing allocations for temporaries / outputs but that can be deferred for a future PR :). I agree that removing allocations etc would be a significant win. What do you think is missing from this PR to do so? At least at the function executor API level, I believe repeated invocations of `FunctionExecutor::Execute(const std::vector<Datum>& args, int64_t passed_length)` should be able to avoid repeated allocations. > I think the alternative (and this may be a misunderstanding of your goal) is that a UDT not be put into the function registry. This is an interesting alternative to compare to. I'll try to explain below the differences between this and the one proposed in this PR. I think each of the two alternatives has its merits, and we'll just need to choose whether we want one or both. The source-node (Weston's) proposal for UDTs has several pros that I can see. It requires less changes to Arrow in an end-to-end solution, probably just in the Substrait engine component. It bypasses the need to manage nested registries for UDTs (though these are still needed for UDFs). And its PyArrow part builds on fewer Arrow APIs, probably just the source-node related APIs. OTOH, it also has some cons. It requires a separate source-node per UDT. It does not directly support composing UDFs (say, from a library) with a UDT. And it does not directly support ordering of UDTs within one execution. The function-executor (my) proposal for UDTs, besides supporting the expression executor feature Weston mentioned, has pretty much the reverse pros and cons. I'll elaborate on the less trivial points about UDT composition and ordering. In the prototype, from Arrow's point of view, a UDT is defined as a function that returns a generator of tabular data. However, from Substrait's point of view, a UDT is modeled like a [monad](https://en.wikipedia.org/wiki/Monad_(functional_programming)), which abstracts out side-effects. This enables composition of UDTs and UDFs in a single expression (rather than placing each UDT in a separate node), kind of like functions and monads can be composed in a functional programming language. For example, one can consider an expression like `a * prng(1) + b * prng(2)`, where `prng(seed)` generates a pseudorandom sequence using `seed`. This expression yields a different value on each evaluation, but the same sequence of values when restarted. While this e xample is slightly contrived, and it should (and can easily) be extended to tabular data, it is intended for demonstrating composition and ordering with UDTs. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected]
