[GitHub] [arrow] rtpsw commented on pull request #14043: ARROW-17613: [C++] Add function execution API for a preconfigured kernel

GitBox Thu, 22 Sep 2022 10:28:25 -0700


rtpsw commented on PR #14043:
URL: https://github.com/apache/arrow/pull/14043#issuecomment-1255332996


   @bkietz, before I answer your points, I should note that the code here is 
extracted from a working end-to-end (Ibis/Substrait/PyArrow) prototype for UDFs 
and UDTs, which are UDFs that provide a stream of tabular data, that I 
developed. While this doesn't mean its design would be accepted as is (and I do 
welcome feedback on it, or parts of it that I extract), there is currently no 
alternative working design. I'd expect to see a comparable alternative put 
forward, so I could evaluate pros and cons in the context of end-to-end support 
for UDFs and UDTs. In my mind, just evolving expression binding is not a 
comparable alternative.
   
   > I understand, but when do users touch the function execution API?
   
   I said "user" but didn't mean "end-user" necessarily; I should have said 
"caller" for clarity. Still, the pre-PR function execution API is public, so we 
should assume it is used by end-users and the burden is actually in claiming 
the opposite (e.g., for deprecation purposes). The fact that there exists a 
higher-level API, which may be convenient for a lot of use cases (like 
streaming), does not change this.
   
   Granted, there is also a burden of showing the proposed API is useful. I 
could point you to how the prototype uses this new function execution API, if 
that would be helpful. The general idea is that the end-user is driving from 
PyArrow and registers a UDT. The UDT is a Python-implemented function that may 
be invoked multiple times, each at a different source node in the execution 
plan. Each such invocation returns a stream object implemented in Python that 
is managed in a kernel state. Invoking the kernel returns tabular data that is 
part of the dynamically generated stream. The new function execution API is 
designed to enable this setup.
   
   > A FunctionExecutor would only be useful when executing the same function 
multiple times
   
   Yes, on arguments of the same types.
   
   > What I'd like to hear is when that's beneficial and isn't served by 
construction of an ExecPlan.
   
   AFAICS, the above described UDT functionality cannot be served by an 
ExecPlan nor by expression binding.
   
   > Kernel preconfiguration is precisely the function of Expression::Bind, 
among other things:
   > 
   >     * [invokes 
Function::DispatchBest](https://github.com/apache/arrow/blob/40ec95646962cccdcd62032c80e8506d4c275bc6/cpp/src/arrow/compute/exec/expression.cc#L372)
 to acquire a kernel and types for implicit casts
   > 
   >     * [caches the Kernel and its 
state](https://github.com/apache/arrow/blob/40ec95646962cccdcd62032c80e8506d4c275bc6/cpp/src/arrow/compute/exec/expression.h#L54-L58)
 for later use in execution
   > 
   >     * note that currently Expression execution assumes only scalar 
functions are referenced and that KernelState is not mutated
   
   IIUC, C++ code (for expression binding) is the driver here. In the 
prototype's design, it is end-user code, via PyArrow, who is the driver. It's 
not clear how to reconcile the two in a proposal based on expression binding. 
At least this will need to be explained in the context of a more complete 
description of an alternative.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

[GitHub] [arrow] rtpsw commented on pull request #14043: ARROW-17613: [C++] Add function execution API for a preconfigured kernel

Reply via email to