Ah, fair point Antoine. Yes, I believe you are expected to copy data in/out right now: https://github.com/WebAssembly/design/issues/1162
On Tue, Apr 26, 2022, at 10:43, Antoine Pitrou wrote: > Le 26/04/2022 à 16:30, Gavin Ray a écrit : >> Antoine, sandboxing comes into play from two places: >> >> 1) The WASM specification itself, which puts a bounds on the types of >> behaviors possible >> 2) The implementation of the WASM bytecode interpreter chosen, like Jorge >> mentioned in the comment above >> >> The wasmtime docs have a pretty solid section covering the sandboxing >> guarantees of WASM, and then the interpreter-specific behavior/abilities of >> wasmtime FWIW: >> https://docs.wasmtime.dev/security-sandboxing.html#webassembly-core > > This doesn't really answer my question, does it? > > >> >> On Tue, Apr 26, 2022 at 10:22 AM Antoine Pitrou <[email protected]> wrote: >> >>> >>> Le 26/04/2022 à 16:18, Jorge Cardoso Leitão a écrit : >>>>> Would WASM be able to interact in-process with non-WASM buffers safely? >>>> >>>> AFAIK yes. My understanding from playing with it in JS is that a >>>> WASM-backed udf execution would be something like: >>>> >>>> 1. compile the C++/Rust/etc UDF to WASM (a binary format) >>>> 2. provide a small WASM-compiled middleware of the c data interface that >>>> consumes (binary, c data interface pointers) >>>> 3. ship a WASM interpreter as part of the query engine >>>> 4. pass binary and c data interface pointers from the query engine >>> program >>>> to the interpreter with WASM-compiled middleware >>> >>> Ok, but the key word in my question was "safely". What mechanisms are in >>> place such that the WASM user function will not access Arrow buffers out >>> of bounds? Nothing really stands out in >>> https://webassembly.github.io/spec/core/index.html, but it's the first >>> time I try to have a look at the WebAssembly spec. >>> >>> Regards >>> >>> Antoine. >>> >>> >>>> >>>> Step 2 is necessary to read the buffers from FFI and output the result >>> back >>>> from the interpreter once the UDF is done, similar to what we do in >>>> datafusion to run Python from Rust. In the case of datafusion the >>> "binary" >>>> is a Python function, which has security implications since the Python >>>> interpreter allows everything by default. >>>> >>>> Best, >>>> Jorge >>>> >>>> >>>> >>>> On Tue, Apr 26, 2022 at 2:56 PM Antoine Pitrou <[email protected]> >>> wrote: >>>> >>>>> >>>>> Le 25/04/2022 à 23:04, David Li a écrit : >>>>>> The WebAssembly documentation has a rundown of the techniques used: >>>>> https://webassembly.org/docs/security/ >>>>>> >>>>>> I think usually you would run WASM in-process, though we could indeed >>>>> also put it in a subprocess to further isolate things. >>>>> >>>>> Would WASM be able to interact in-process with non-WASM buffers safely? >>>>> It's not obvious from reading the page above. >>>>> >>>>> >>>>>> >>>>>> It would be interesting to define the Flight "harness" protocol. >>>>> Handling heterogeneous arguments may require some evolution in Flight >>> (e.g. >>>>> if the function is non scalar and arguments are of different length - >>> we'd >>>>> need something like the ColumnBag proposal, so this might be a good >>> reason >>>>> to revive that). >>>>>> >>>>>> On Mon, Apr 25, 2022, at 16:35, Antoine Pitrou wrote: >>>>>>> Le 25/04/2022 à 22:19, Wes McKinney a écrit : >>>>>>>> I was going to reply to this e-mail thread on user@ but thought I >>>>>>>> would start a new thread on dev@. >>>>>>>> >>>>>>>> Executing user-defined functions in memory, especially untrusted >>>>>>>> functions, in general is unsafe. For "trusted" functions, having an >>>>>>>> in-memory API for writing them in user languages is very useful. I >>>>>>>> remember tinkering with adding UDFs in Impala with LLVM IR, which >>>>>>>> would allow UDFs to have performance consistent with built-ins >>>>>>>> (because built-in functions are all inlined into code-generated >>>>>>>> expressions), but segfaults would bring down the server, so only >>>>>>>> admins could be trusted to add new UDFs. >>>>>>>> >>>>>>>> However, I wonder if we should eventually define an "external UDF" >>>>>>>> protocol and an example UDF "harness", using Flight to do RPC across >>>>>>>> the process boundaries. So the idea is that an external local UDF >>>>>>>> Flight execution service is spun up, and then data is sent to the UDF >>>>>>>> in a DoExchange call. >>>>>>>> >>>>>>>> As Jacques pointed out in an interview 1], a compelling solution to >>>>>>>> the UDF sandboxing problem is WASM. This allows "untrusted" WASM >>>>>>>> functions to be run safely in-process. >>>>>>> >>>>>>> How does the sandboxing work in this case? Is it simply executing in a >>>>>>> separate process with restricted capabilities, or are other mechanisms >>>>>>> put in place? >>>>> >>>> >>> >>
