Re: RFC: Out of Process Python UDFs in Arrow Compute

Weston Pace Wed, 04 May 2022 13:06:23 -0700

Hi Li, have you seen the python UDF prototype that we just recently
merged into the execution engine at [1]?  It adds support for scalar
UDFs.

Comparing your proposal to what we've done so far I would ask:

 1. Why do you want to run these UDFs in a separate process?  Is this
for robustness (if the UDF crashes the process recovery is easier) or
performance (potentially working around the GIL if you have multiple
python processes)?  Have you given much thought to how the data would
travel back and forth between the processes?  For example, via sockets
(maybe flight) or shared memory?
 2. The current implementation doesn't address serialization of UDFs.

I'm not sure a separate ExecNode would be necessary.  So far we've
implemented UDFs at the kernel function level and I think we can
continue to do that, even if we are calling out of process workers.

[1] https://github.com/apache/arrow/pull/12590

On Wed, May 4, 2022 at 6:12 AM Li Jin <ice.xell...@gmail.com> wrote:
>
> Hello,
>
> I have a somewhat controversial idea to introduce a "bridge" solution for
> Python UDFs in Arrow Compute and have write up my thoughts in this proposal:
>
> https://docs.google.com/document/d/1s7Gchq_LoNuiZO5bHq9PZx9RdoCWSavuS58KrTYXVMU/edit?usp=sharing
>
> I am curious to hear what the community thinks about this. (I am ready to
> take criticism :) )
>
> I wrote this in just 1-2 hours so I'm happy to explain anything that is
> unclear.
>
> Appreciate your feedback,
> Li

Re: RFC: Out of Process Python UDFs in Arrow Compute

Reply via email to