rtpsw commented on PR #14043: URL: https://github.com/apache/arrow/pull/14043#issuecomment-1255332996
@bkietz, before I answer your points, I should note that the code here is extracted from a working end-to-end (Ibis/Substrait/PyArrow) prototype for UDFs and UDTs, which are UDFs that provide a stream of tabular data, that I developed. While this doesn't mean its design would be accepted as is (and I do welcome feedback on it, or parts of it that I extract), there is currently no alternative working design. I'd expect to see a comparable alternative put forward, so I could evaluate pros and cons in the context of end-to-end support for UDFs and UDTs. In my mind, just evolving expression binding is not a comparable alternative. > I understand, but when do users touch the function execution API? I said "user" but didn't mean "end-user" necessarily; I should have said "caller" for clarity. Still, the pre-PR function execution API is public, so we should assume it is used by end-users and the burden is actually in claiming the opposite (e.g., for deprecation purposes). The fact that there exists a higher-level API, which may be convenient for a lot of use cases (like streaming), does not change this. Granted, there is also a burden of showing the proposed API is useful. I could point you to how the prototype uses this new function execution API, if that would be helpful. The general idea is that the end-user is driving from PyArrow and registers a UDT. The UDT is a Python-implemented function that may be invoked multiple times, each at a different source node in the execution plan. Each such invocation returns a stream object implemented in Python that is managed in a kernel state. Invoking the kernel returns tabular data that is part of the dynamically generated stream. The new function execution API is designed to enable this setup. > A FunctionExecutor would only be useful when executing the same function multiple times Yes, on arguments of the same types. > What I'd like to hear is when that's beneficial and isn't served by construction of an ExecPlan. AFAICS, the above described UDT functionality cannot be served by an ExecPlan nor by expression binding. > Kernel preconfiguration is precisely the function of Expression::Bind, among other things: > > * [invokes Function::DispatchBest](https://github.com/apache/arrow/blob/40ec95646962cccdcd62032c80e8506d4c275bc6/cpp/src/arrow/compute/exec/expression.cc#L372) to acquire a kernel and types for implicit casts > > * [caches the Kernel and its state](https://github.com/apache/arrow/blob/40ec95646962cccdcd62032c80e8506d4c275bc6/cpp/src/arrow/compute/exec/expression.h#L54-L58) for later use in execution > > * note that currently Expression execution assumes only scalar functions are referenced and that KernelState is not mutated IIUC, C++ code (for expression binding) is the driver here. In the prototype's design, it is end-user code, via PyArrow, who is the driver. It's not clear how to reconcile the two in a proposal based on expression binding. At least this will need to be explained in the context of a more complete description of an alternative. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected]
