Antoine, sandboxing comes into play from two places: 1) The WASM specification itself, which puts a bounds on the types of behaviors possible 2) The implementation of the WASM bytecode interpreter chosen, like Jorge mentioned in the comment above
The wasmtime docs have a pretty solid section covering the sandboxing guarantees of WASM, and then the interpreter-specific behavior/abilities of wasmtime FWIW: https://docs.wasmtime.dev/security-sandboxing.html#webassembly-core On Tue, Apr 26, 2022 at 10:22 AM Antoine Pitrou <[email protected]> wrote: > > Le 26/04/2022 à 16:18, Jorge Cardoso Leitão a écrit : > >> Would WASM be able to interact in-process with non-WASM buffers safely? > > > > AFAIK yes. My understanding from playing with it in JS is that a > > WASM-backed udf execution would be something like: > > > > 1. compile the C++/Rust/etc UDF to WASM (a binary format) > > 2. provide a small WASM-compiled middleware of the c data interface that > > consumes (binary, c data interface pointers) > > 3. ship a WASM interpreter as part of the query engine > > 4. pass binary and c data interface pointers from the query engine > program > > to the interpreter with WASM-compiled middleware > > Ok, but the key word in my question was "safely". What mechanisms are in > place such that the WASM user function will not access Arrow buffers out > of bounds? Nothing really stands out in > https://webassembly.github.io/spec/core/index.html, but it's the first > time I try to have a look at the WebAssembly spec. > > Regards > > Antoine. > > > > > > Step 2 is necessary to read the buffers from FFI and output the result > back > > from the interpreter once the UDF is done, similar to what we do in > > datafusion to run Python from Rust. In the case of datafusion the > "binary" > > is a Python function, which has security implications since the Python > > interpreter allows everything by default. > > > > Best, > > Jorge > > > > > > > > On Tue, Apr 26, 2022 at 2:56 PM Antoine Pitrou <[email protected]> > wrote: > > > >> > >> Le 25/04/2022 à 23:04, David Li a écrit : > >>> The WebAssembly documentation has a rundown of the techniques used: > >> https://webassembly.org/docs/security/ > >>> > >>> I think usually you would run WASM in-process, though we could indeed > >> also put it in a subprocess to further isolate things. > >> > >> Would WASM be able to interact in-process with non-WASM buffers safely? > >> It's not obvious from reading the page above. > >> > >> > >>> > >>> It would be interesting to define the Flight "harness" protocol. > >> Handling heterogeneous arguments may require some evolution in Flight > (e.g. > >> if the function is non scalar and arguments are of different length - > we'd > >> need something like the ColumnBag proposal, so this might be a good > reason > >> to revive that). > >>> > >>> On Mon, Apr 25, 2022, at 16:35, Antoine Pitrou wrote: > >>>> Le 25/04/2022 à 22:19, Wes McKinney a écrit : > >>>>> I was going to reply to this e-mail thread on user@ but thought I > >>>>> would start a new thread on dev@. > >>>>> > >>>>> Executing user-defined functions in memory, especially untrusted > >>>>> functions, in general is unsafe. For "trusted" functions, having an > >>>>> in-memory API for writing them in user languages is very useful. I > >>>>> remember tinkering with adding UDFs in Impala with LLVM IR, which > >>>>> would allow UDFs to have performance consistent with built-ins > >>>>> (because built-in functions are all inlined into code-generated > >>>>> expressions), but segfaults would bring down the server, so only > >>>>> admins could be trusted to add new UDFs. > >>>>> > >>>>> However, I wonder if we should eventually define an "external UDF" > >>>>> protocol and an example UDF "harness", using Flight to do RPC across > >>>>> the process boundaries. So the idea is that an external local UDF > >>>>> Flight execution service is spun up, and then data is sent to the UDF > >>>>> in a DoExchange call. > >>>>> > >>>>> As Jacques pointed out in an interview 1], a compelling solution to > >>>>> the UDF sandboxing problem is WASM. This allows "untrusted" WASM > >>>>> functions to be run safely in-process. > >>>> > >>>> How does the sandboxing work in this case? Is it simply executing in a > >>>> separate process with restricted capabilities, or are other mechanisms > >>>> put in place? > >> > > >
