westonpace commented on PR #14043: URL: https://github.com/apache/arrow/pull/14043#issuecomment-1256999253
A few thoughts: > I understand, but when do users touch the function execution API? I think that'd primarily be through the python or R bindings to handle ad-hoc cases like adding two arrays together... and in that case, constructing a FunctionExecutor would not be useful since the user input time delay will greatly outweigh kernel lookup. I'm pretty sure there are existing cases where users have interacted directly with functions and not via an execution plan (I think @marsupialtail does this with Take and I seem to recall @drin using the compute API directly as well). I'm not sure those cases couldn't be converted to an execution plan but they do exist. IIRC these are cases where the user already has an engine / execution plan of their own and they are simply trying to integrate Arrow compute. That being said, if we were going to go down this road, I think it would be more valuable to have an "expression executor" and not merely a "function executor". Also, removing the lookup / argument resolution time is nice but the biggest win would be removing allocations for temporaries / outputs but that can be deferred for a future PR :). > @bkietz, before I answer your points, I should note that the code here is extracted from a working end-to-end (Ibis/Substrait/PyArrow) prototype for UDFs and UDTs, which are UDFs that provide a stream of tabular data, that I developed. While this doesn't mean its design would be accepted as is (and I do welcome feedback on it, or parts of it that I extract), there is currently no alternative working design. I'd expect to see a comparable alternative put forward, so I could evaluate pros and cons in the context of end-to-end support for UDFs and UDTs. In my mind, just evolving expression binding is not a comparable alternative. I think the alternative (and this may be a misunderstanding of your goal) is that a UDT not be put into the function registry, even if it looks like a UDF elsewhere (e.g. Substrait). As an example, consider an embedded python Substrait UDT (which does very much look like a UDF). When we consume that plan we would convert that embedded UDT into a function. Let's say it is a python function that returns an iterator of tabular data. Instead of creating a stateful function to poll that iterator we could put that iterator into a source node, probably one of the source nodes you just created in #14207. The `it_maker` would be a wrapper around your python function that returns a wrapper around your python iterable (I am fairly certain we wrap python iterators in either `RecordBatchReader` or `AsyncGenerator<RecordBatch>`) somewhere else too. This removes the overhead of the function registry entirely. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected]
