Hi Li, have you seen the python UDF prototype that we just recently merged into the execution engine at [1]? It adds support for scalar UDFs.
Comparing your proposal to what we've done so far I would ask: 1. Why do you want to run these UDFs in a separate process? Is this for robustness (if the UDF crashes the process recovery is easier) or performance (potentially working around the GIL if you have multiple python processes)? Have you given much thought to how the data would travel back and forth between the processes? For example, via sockets (maybe flight) or shared memory? 2. The current implementation doesn't address serialization of UDFs. I'm not sure a separate ExecNode would be necessary. So far we've implemented UDFs at the kernel function level and I think we can continue to do that, even if we are calling out of process workers. [1] https://github.com/apache/arrow/pull/12590 On Wed, May 4, 2022 at 6:12 AM Li Jin <ice.xell...@gmail.com> wrote: > > Hello, > > I have a somewhat controversial idea to introduce a "bridge" solution for > Python UDFs in Arrow Compute and have write up my thoughts in this proposal: > > https://docs.google.com/document/d/1s7Gchq_LoNuiZO5bHq9PZx9RdoCWSavuS58KrTYXVMU/edit?usp=sharing > > I am curious to hear what the community thinks about this. (I am ready to > take criticism :) ) > > I wrote this in just 1-2 hours so I'm happy to explain anything that is > unclear. > > Appreciate your feedback, > Li