Re: Designing standards for "sandboxed" Arrow user-defined functions [was Re: User defined "Arrow Compute Function"]

Gavin Ray Tue, 26 Apr 2022 07:30:52 -0700

Antoine, sandboxing comes into play from two places:

1) The WASM specification itself, which puts a bounds on the types of
behaviors possible
2) The implementation of the WASM bytecode interpreter chosen, like Jorge
mentioned in the comment above


The wasmtime docs have a pretty solid section covering the sandboxing
guarantees of WASM, and then the interpreter-specific behavior/abilities of
wasmtime FWIW:
https://docs.wasmtime.dev/security-sandboxing.html#webassembly-core

On Tue, Apr 26, 2022 at 10:22 AM Antoine Pitrou <[email protected]> wrote:

>
> Le 26/04/2022 à 16:18, Jorge Cardoso Leitão a écrit :
> >> Would WASM be able to interact in-process with non-WASM buffers safely?
> >
> > AFAIK yes. My understanding from playing with it in JS is that a
> > WASM-backed udf execution would be something like:
> >
> > 1. compile the C++/Rust/etc UDF to WASM (a binary format)
> > 2. provide a small WASM-compiled middleware of the c data interface that
> > consumes (binary, c data interface pointers)
> > 3. ship a WASM interpreter as part of the query engine
> > 4. pass binary and c data interface pointers from the query engine
> program
> > to the interpreter with WASM-compiled middleware
>
> Ok, but the key word in my question was "safely". What mechanisms are in
> place such that the WASM user function will not access Arrow buffers out
> of bounds? Nothing really stands out in
> https://webassembly.github.io/spec/core/index.html, but it's the first
> time I try to have a look at the WebAssembly spec.
>
> Regards
>
> Antoine.
>
>
> >
> > Step 2 is necessary to read the buffers from FFI and output the result
> back
> > from the interpreter once the UDF is done, similar to what we do in
> > datafusion to run Python from Rust. In the case of datafusion the
> "binary"
> > is a Python function, which has security implications since the Python
> > interpreter allows everything by default.
> >
> > Best,
> > Jorge
> >
> >
> >
> > On Tue, Apr 26, 2022 at 2:56 PM Antoine Pitrou <[email protected]>
> wrote:
> >
> >>
> >> Le 25/04/2022 à 23:04, David Li a écrit :
> >>> The WebAssembly documentation has a rundown of the techniques used:
> >> https://webassembly.org/docs/security/
> >>>
> >>> I think usually you would run WASM in-process, though we could indeed
> >> also put it in a subprocess to further isolate things.
> >>
> >> Would WASM be able to interact in-process with non-WASM buffers safely?
> >> It's not obvious from reading the page above.
> >>
> >>
> >>>
> >>> It would be interesting to define the Flight "harness" protocol.
> >> Handling heterogeneous arguments may require some evolution in Flight
> (e.g.
> >> if the function is non scalar and arguments are of different length -
> we'd
> >> need something like the ColumnBag proposal, so this might be a good
> reason
> >> to revive that).
> >>>
> >>> On Mon, Apr 25, 2022, at 16:35, Antoine Pitrou wrote:
> >>>> Le 25/04/2022 à 22:19, Wes McKinney a écrit :
> >>>>> I was going to reply to this e-mail thread on user@ but thought I
> >>>>> would start a new thread on dev@.
> >>>>>
> >>>>> Executing user-defined functions in memory, especially untrusted
> >>>>> functions, in general is unsafe. For "trusted" functions, having an
> >>>>> in-memory API for writing them in user languages is very useful. I
> >>>>> remember tinkering with adding UDFs in Impala with LLVM IR, which
> >>>>> would allow UDFs to have performance consistent with built-ins
> >>>>> (because built-in functions are all inlined into code-generated
> >>>>> expressions), but segfaults would bring down the server, so only
> >>>>> admins could be trusted to add new UDFs.
> >>>>>
> >>>>> However, I wonder if we should eventually define an "external UDF"
> >>>>> protocol and an example UDF "harness", using Flight to do RPC across
> >>>>> the process boundaries. So the idea is that an external local UDF
> >>>>> Flight execution service is spun up, and then data is sent to the UDF
> >>>>> in a DoExchange call.
> >>>>>
> >>>>> As Jacques pointed out in an interview 1], a compelling solution to
> >>>>> the UDF sandboxing problem is WASM. This allows "untrusted" WASM
> >>>>> functions to be run safely in-process.
> >>>>
> >>>> How does the sandboxing work in this case? Is it simply executing in a
> >>>> separate process with restricted capabilities, or are other mechanisms
> >>>> put in place?
> >>
> >
>

Re: Designing standards for "sandboxed" Arrow user-defined functions [was Re: User defined "Arrow Compute Function"]

Reply via email to