For context, this is a continuation of a previous email discussion: "RPC: Out of Process Python UDFs in Arrow Compute" where we identified reasonable steps to solve are: ostef Process Python UDFs (1) A way to serialize the user function (and function parameters) (in the doc I proposed to use cloudpickle based approach similar to how Spark does it) (2) Add a Substrait relation/expression that captures the the UDF (3) Deserialize the Substrait relation/expression in Arrow compute and execute the UDF (either using the approach in the current Scalar UDF prototype or do sth else) (Same as the Yaron layout above).
Now I think we have reasonable solutions are (1) and (2) (at least for PoC purpose), but we are not sure how to proceed with (3), so wondering how to register/execute the deserialize the UDF from substrait and execute it using existing UDF code. Any thoughts? Li On Sun, May 15, 2022 at 8:02 AM Yaron Gvili <rt...@hotmail.com> wrote: > Hi, > > I'm working on a Python UDFs PoC and would like to get the community's > feedback on its design. > > The goal of this PoC is to enable a user to integrate Python UDFs in an > Ibis/Substrait/Arrow workflow. The basic idea is that the user would create > an Ibis expression that includes Python UDFs implemented by the user, then > use a single invocation to: > > 1. produce a Substrait plan for the Ibis expression; > 2. deliver the Substrait plan to Arrow; > 3. consume the Substrait plan in Arrow to obtain an execution plan; > 4. run the execution plan; > 5. deliver back the resulting table outputs of the plan. > > I already have the above steps implemented, with support for some existing > execution nodes and compute functions in Arrow, but without the support for > Python UDFs, which is the focus of this discussion. To add this support I > am thinking about a design where: > > * Ibis is extended to support an expression of type Python UDF, which > carries (text-form) serialized UDF code (likely using cloudpickle) > * Ibis-Substrait is extended to support an extension function of type > Python UDF, which also carries (text-within-protobuf-form) serialized UDF > code > * PyArrow is extended to process a Substrait plan that includes > serialized Python UDFs using these steps: > * Deserialize the Python UDFs > * Register the Python UDFs (preferably, in a local scope that is > cleared after processing ends) > * Run the execution plan corresponding to the Substrait plan > * Return the resulting table outputs of the plan > > Note that this design requires the user to drive PyArrow. It does not > support driving Arrow C++ to process such a Substrait plan (it does not > start an embedded Python interpreter). > > > Cheers, > Yaron. >