rtpsw commented on PR #14043:
URL: https://github.com/apache/arrow/pull/14043#issuecomment-1257178331

   > I think it would be more valuable to have an "expression executor" and not 
merely a "function executor".
   
   While I don't know enough about expression use cases, this sounds right to 
me in the wider context, i.e., outside of just UDFs/UDTs. Does this mean the 
expression executor should be built on top of the function executor proposed 
here?
   
   > Also, removing the lookup / argument resolution time is nice but the 
biggest win would be removing allocations for temporaries / outputs but that 
can be deferred for a future PR :).
   
   I agree that removing allocations etc would be a significant win. What do 
you think is missing from this PR to do so? At least at the function executor 
API level, I believe repeated invocations of `FunctionExecutor::Execute(const 
std::vector<Datum>& args, int64_t passed_length)` should be able to avoid 
repeated allocations.
   
   > I think the alternative (and this may be a misunderstanding of your goal) 
is that a UDT not be put into the function registry.
   
   This is an interesting alternative to compare to. I'll try to explain below 
the differences between this and the one proposed in this PR. I think each of 
the two alternatives has its merits, and we'll just need to choose whether we 
want one or both.
   
   The source-node (Weston's) proposal for UDTs has several pros that I can 
see. It requires less changes to Arrow in an end-to-end solution, probably just 
in the Substrait engine component. It bypasses the need to manage nested 
registries for UDTs (though these are still needed for UDFs). And its PyArrow 
part builds on fewer Arrow APIs, probably just the source-node related APIs. 
OTOH, it also has some cons. It requires a separate source-node per UDT. It 
does not directly support composing UDFs (say, from a library) with a UDT. And 
it does not directly support ordering of UDTs within one execution.
   
   The function-executor (my) proposal for UDTs, besides supporting the 
expression executor feature Weston mentioned, has pretty much the reverse pros 
and cons. I'll elaborate on the less trivial points about UDT composition and 
ordering. In the prototype, from Arrow's point of view, a UDT is defined as a 
function that returns a generator of tabular data. However, from Substrait's 
point of view, a UDT is modeled like a 
[monad](https://en.wikipedia.org/wiki/Monad_(functional_programming)), which 
abstracts out side-effects. This enables composition of UDTs and UDFs in a 
single expression (rather than placing each UDT in a separate node), kind of 
like functions and monads can be composed in a functional programming language. 
For example, one can consider an expression like `a * prng(1) + b * prng(2)`, 
where `prng(seed)` generates a pseudorandom sequence using `seed`. This 
expression yields a different value on each evaluation, but the same sequence 
of values when restarted. While this e
 xample is slightly contrived, and it should (and can easily) be extended to 
tabular data, it is intended for demonstrating composition and ordering with 
UDTs.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Reply via email to