Hi,
I'm working on a Python UDFs PoC and would like to get the community's feedback
on its design.
The goal of this PoC is to enable a user to integrate Python UDFs in an
Ibis/Substrait/Arrow workflow. The basic idea is that the user would create an
Ibis expression that includes Python UDFs implemented by the user, then use a
single invocation to:
1. produce a Substrait plan for the Ibis expression;
2. deliver the Substrait plan to Arrow;
3. consume the Substrait plan in Arrow to obtain an execution plan;
4. run the execution plan;
5. deliver back the resulting table outputs of the plan.
I already have the above steps implemented, with support for some existing
execution nodes and compute functions in Arrow, but without the support for
Python UDFs, which is the focus of this discussion. To add this support I am
thinking about a design where:
* Ibis is extended to support an expression of type Python UDF, which
carries (text-form) serialized UDF code (likely using cloudpickle)
* Ibis-Substrait is extended to support an extension function of type
Python UDF, which also carries (text-within-protobuf-form) serialized UDF code
* PyArrow is extended to process a Substrait plan that includes serialized
Python UDFs using these steps:
* Deserialize the Python UDFs
* Register the Python UDFs (preferably, in a local scope that is cleared
after processing ends)
* Run the execution plan corresponding to the Substrait plan
* Return the resulting table outputs of the plan
Note that this design requires the user to drive PyArrow. It does not support
driving Arrow C++ to process such a Substrait plan (it does not start an
embedded Python interpreter).
Cheers,
Yaron.