1. This is probably best discussed on the Ibis mailing list (or Github? I'm not sure how the Ibis community communicates).
2. If we consider "encoding the embedded function in Substrait" to be part of (2) then this can be discussed in the Substrait community. That being said, in the sync call last week we had a brief discussion and to start with the EmbeddedFunction message and add an "any" version and then propose a PR to standardize it. We will need to find some spot to encode the embedded function into the plan itself however. 3. A PR is in progress (pretty close to merging) to expose the Substrait consumer to python. I think this will be good enough for MVP purposes. Check out [1]. > we are not sure how to proceed with (3), so wondering how to > register/execute the deserialize the UDF from substrait and execute it > using existing UDF code. Any thoughts? Good questions. I haven't given this a whole lot of thought yet so take these suggestions with a grain of salt. Probably the minimal code change approach would be to implement [2] and call the unregister from the ExecPlan's destructor. However, this is going to eventually lead to naming conflicts if multiple UDFs are running in parallel. We could generate unique names to work around this. We could also consider a second function registry, unique to the plan, combined with the master function registry via some kind of facade registry that prefer's local UDFs and falls back to process UDFs. I think that might be the best long term solution. [1] https://github.com/apache/arrow/pull/12672 [2] https://issues.apache.org/jira/browse/ARROW-16211 On Mon, May 16, 2022 at 5:06 AM Li Jin <ice.xell...@gmail.com> wrote: > > For context, this is a continuation of a previous email discussion: "RPC: > Out of Process Python UDFs in Arrow Compute" where we identified reasonable > steps to solve are: > ostef Process Python UDFs > (1) A way to serialize the user function (and function parameters) (in the > doc I proposed to use cloudpickle based approach similar to how Spark does > it) (2) Add a Substrait relation/expression that captures the the UDF (3) > Deserialize the Substrait relation/expression in Arrow compute and execute > the UDF (either using the approach in the current Scalar UDF prototype or > do sth else) > (Same as the Yaron layout above). > > Now I think we have reasonable solutions are (1) and (2) (at least for PoC > purpose), but we are not sure how to proceed with (3), so wondering how to > register/execute the deserialize the UDF from substrait and execute it > using existing UDF code. Any thoughts? > > Li > > > On Sun, May 15, 2022 at 8:02 AM Yaron Gvili <rt...@hotmail.com> wrote: > > > Hi, > > > > I'm working on a Python UDFs PoC and would like to get the community's > > feedback on its design. > > > > The goal of this PoC is to enable a user to integrate Python UDFs in an > > Ibis/Substrait/Arrow workflow. The basic idea is that the user would create > > an Ibis expression that includes Python UDFs implemented by the user, then > > use a single invocation to: > > > > 1. produce a Substrait plan for the Ibis expression; > > 2. deliver the Substrait plan to Arrow; > > 3. consume the Substrait plan in Arrow to obtain an execution plan; > > 4. run the execution plan; > > 5. deliver back the resulting table outputs of the plan. > > > > I already have the above steps implemented, with support for some existing > > execution nodes and compute functions in Arrow, but without the support for > > Python UDFs, which is the focus of this discussion. To add this support I > > am thinking about a design where: > > > > * Ibis is extended to support an expression of type Python UDF, which > > carries (text-form) serialized UDF code (likely using cloudpickle) > > * Ibis-Substrait is extended to support an extension function of type > > Python UDF, which also carries (text-within-protobuf-form) serialized UDF > > code > > * PyArrow is extended to process a Substrait plan that includes > > serialized Python UDFs using these steps: > > * Deserialize the Python UDFs > > * Register the Python UDFs (preferably, in a local scope that is > > cleared after processing ends) > > * Run the execution plan corresponding to the Substrait plan > > * Return the resulting table outputs of the plan > > > > Note that this design requires the user to drive PyArrow. It does not > > support driving Arrow C++ to process such a Substrait plan (it does not > > start an embedded Python interpreter). > > > > > > Cheers, > > Yaron. > >