For context, this is a continuation of a previous email discussion: "RPC:
Out of Process Python UDFs in Arrow Compute" where we identified reasonable
steps to solve are:
ostef Process Python UDFs
(1) A way to serialize the user function (and function parameters) (in the
doc I proposed to use cloudpickle based approach similar to how Spark does
it) (2) Add a Substrait relation/expression that captures the the UDF (3)
Deserialize the Substrait relation/expression in Arrow compute and execute
the UDF (either using the approach in the current Scalar UDF prototype or
do sth else)
(Same as the Yaron layout above).

Now I think we have reasonable solutions are (1) and (2) (at least for PoC
purpose), but we are not sure how to proceed with (3), so wondering how to
register/execute the deserialize the UDF from substrait and execute it
using existing UDF code. Any thoughts?

Li


On Sun, May 15, 2022 at 8:02 AM Yaron Gvili <rt...@hotmail.com> wrote:

> Hi,
>
> I'm working on a Python UDFs PoC and would like to get the community's
> feedback on its design.
>
> The goal of this PoC is to enable a user to integrate Python UDFs in an
> Ibis/Substrait/Arrow workflow. The basic idea is that the user would create
> an Ibis expression that includes Python UDFs implemented by the user, then
> use a single invocation to:
>
>   1.  produce a Substrait plan for the Ibis expression;
>   2.  deliver the Substrait plan to Arrow;
>   3.  consume the Substrait plan in Arrow to obtain an execution plan;
>   4.  run the execution plan;
>   5.  deliver back the resulting table outputs of the plan.
>
> I already have the above steps implemented, with support for some existing
> execution nodes and compute functions in Arrow, but without the support for
> Python UDFs, which is the focus of this discussion. To add this support I
> am thinking about a design where:
>
>   *   Ibis is extended to support an expression of type Python UDF, which
> carries (text-form) serialized UDF code (likely using cloudpickle)
>   *   Ibis-Substrait is extended to support an extension function of type
> Python UDF, which also carries (text-within-protobuf-form) serialized UDF
> code
>   *   PyArrow is extended to process a Substrait plan that includes
> serialized Python UDFs using these steps:
>      *   Deserialize the Python UDFs
>      *   Register the Python UDFs (preferably, in a local scope that is
> cleared after processing ends)
>      *   Run the execution plan corresponding to the Substrait plan
>      *   Return the resulting table outputs of the plan
>
> Note that this design requires the user to drive PyArrow. It does not
> support driving Arrow C++ to process such a Substrait plan (it does not
> start an embedded Python interpreter).
>
>
> Cheers,
> Yaron.
>

Reply via email to