[GitHub] [arrow] westonpace commented on pull request #14043: ARROW-17613: [C++] Add function execution API for a preconfigured kernel

GitBox Sat, 24 Sep 2022 08:53:46 -0700


westonpace commented on PR #14043:
URL: https://github.com/apache/arrow/pull/14043#issuecomment-1256999253


   A few thoughts:
   
   > I understand, but when do users touch the function execution API? I think 
that'd primarily be through the python or R bindings to handle ad-hoc cases 
like adding two arrays together... and in that case, constructing a 
FunctionExecutor would not be useful since the user input time delay will 
greatly outweigh kernel lookup.
   
   I'm pretty sure there are existing cases where users have interacted 
directly with functions and not via an execution plan (I think @marsupialtail 
does this with Take and I seem to recall @drin using the compute API directly 
as well).  I'm not sure those cases couldn't be converted to an execution plan 
but they do exist.  IIRC these are cases where the user already has an engine / 
execution plan of their own and they are simply trying to integrate Arrow 
compute.
   
   That being said, if we were going to go down this road, I think it would be 
more valuable to have an "expression executor" and not merely a "function 
executor".  Also, removing the lookup / argument resolution time is nice but 
the biggest win would be removing allocations for temporaries / outputs but 
that can be deferred for a future PR :).
   
   > @bkietz, before I answer your points, I should note that the code here is 
extracted from a working end-to-end (Ibis/Substrait/PyArrow) prototype for UDFs 
and UDTs, which are UDFs that provide a stream of tabular data, that I 
developed. While this doesn't mean its design would be accepted as is (and I do 
welcome feedback on it, or parts of it that I extract), there is currently no 
alternative working design. I'd expect to see a comparable alternative put 
forward, so I could evaluate pros and cons in the context of end-to-end support 
for UDFs and UDTs. In my mind, just evolving expression binding is not a 
comparable alternative.
   
   I think the alternative (and this may be a misunderstanding of your goal) is 
that a UDT not be put into the function registry, even if it looks like a UDF 
elsewhere (e.g. Substrait).  As an example, consider an embedded python 
Substrait UDT (which does very much look like a UDF).  When we consume that 
plan we would convert that embedded UDT into a function.  Let's say it is a 
python function that returns an iterator of tabular data.  Instead of creating 
a stateful function to poll that iterator we could put that iterator into a 
source node, probably one of the source nodes you just created in #14207.  The 
`it_maker` would be a wrapper around your python function that returns a 
wrapper around your python iterable (I am fairly certain we wrap python 
iterators in either `RecordBatchReader` or `AsyncGenerator<RecordBatch>`) 
somewhere else too.  This removes the overhead of the function registry 
entirely.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

[GitHub] [arrow] westonpace commented on pull request #14043: ARROW-17613: [C++] Add function execution API for a preconfigured kernel

Reply via email to