I was going to reply to this e-mail thread on user@ but thought I
would start a new thread on dev@.

Executing user-defined functions in memory, especially untrusted
functions, in general is unsafe. For "trusted" functions, having an
in-memory API for writing them in user languages is very useful. I
remember tinkering with adding UDFs in Impala with LLVM IR, which
would allow UDFs to have performance consistent with built-ins
(because built-in functions are all inlined into code-generated
expressions), but segfaults would bring down the server, so only
admins could be trusted to add new UDFs.

However, I wonder if we should eventually define an "external UDF"
protocol and an example UDF "harness", using Flight to do RPC across
the process boundaries. So the idea is that an external local UDF
Flight execution service is spun up, and then data is sent to the UDF
in a DoExchange call.

As Jacques pointed out in an interview 1], a compelling solution to
the UDF sandboxing problem is WASM. This allows "untrusted" WASM
functions to be run safely in-process. However, we would need to
harden and document the details of the interface between the host
language and the user WASM code.

Since there are many different potential kinds of user-defined
functions aside from scalar functions, that increases the complexity /
scope of specification work here also.

- Wes

[1]: 
https://reneeshah.medium.com/how-webassembly-gets-used-the-18-most-exciting-startups-building-with-wasm-939474e951db

On Fri, Apr 22, 2022 at 2:09 PM David Li <lidav...@apache.org> wrote:
>
> This is currently being implemented for Python: 
> https://github.com/apache/arrow/pull/12590 It may not land for 8.0.0 but 
> should be there for 9.0.0, presumably.
>
> It is already possible in C++. The same APIs that built-in functions use to 
> register themselves should be available to applications and there's a fairly 
> trivial example of this in [1]. Such a function would also be available from 
> Python/R/etc. if you could figure out how to package/distribute/load the 
> application library appropriately.
>
> [1]: 
> https://github.com/apache/arrow/blob/e1e782a4542817e8a6139d6d5e022b56abdbc81d/cpp/examples/arrow/compute_register_example.cc
>
> On Fri, Apr 22, 2022, at 15:04, Wenlei Xie wrote:
>
> Hi,
>
> I am wondering if I can define my own Arrow Compute function and use it, say 
> in PyArrow? It looks like Compute Function has a FuntionRegistry, but I 
> didn't find documentation about how to write your own Arrow Compute function 
> (but maybe just didn't find the right place)
>
> Thank you so much!
>
> --
> Best Regards,
> Wenlei Xie
>
> Email: wenlei....@gmail.com
>
>

Reply via email to