Re: Designing standards for "sandboxed" Arrow user-defined functions [was Re: User defined "Arrow Compute Function"]

David Li Tue, 26 Apr 2022 07:58:54 -0700

Ah, fair point Antoine. Yes, I believe you are expected to copy data in/out 
right now: https://github.com/WebAssembly/design/issues/1162


On Tue, Apr 26, 2022, at 10:43, Antoine Pitrou wrote:
> Le 26/04/2022 à 16:30, Gavin Ray a écrit :
>> Antoine, sandboxing comes into play from two places:
>> 
>> 1) The WASM specification itself, which puts a bounds on the types of
>> behaviors possible
>> 2) The implementation of the WASM bytecode interpreter chosen, like Jorge
>> mentioned in the comment above
>> 
>> The wasmtime docs have a pretty solid section covering the sandboxing
>> guarantees of WASM, and then the interpreter-specific behavior/abilities of
>> wasmtime FWIW:
>> https://docs.wasmtime.dev/security-sandboxing.html#webassembly-core
>
> This doesn't really answer my question, does it?
>
>
>> 
>> On Tue, Apr 26, 2022 at 10:22 AM Antoine Pitrou <[email protected]> wrote:
>> 
>>>
>>> Le 26/04/2022 à 16:18, Jorge Cardoso Leitão a écrit :
>>>>> Would WASM be able to interact in-process with non-WASM buffers safely?
>>>>
>>>> AFAIK yes. My understanding from playing with it in JS is that a
>>>> WASM-backed udf execution would be something like:
>>>>
>>>> 1. compile the C++/Rust/etc UDF to WASM (a binary format)
>>>> 2. provide a small WASM-compiled middleware of the c data interface that
>>>> consumes (binary, c data interface pointers)
>>>> 3. ship a WASM interpreter as part of the query engine
>>>> 4. pass binary and c data interface pointers from the query engine
>>> program
>>>> to the interpreter with WASM-compiled middleware
>>>
>>> Ok, but the key word in my question was "safely". What mechanisms are in
>>> place such that the WASM user function will not access Arrow buffers out
>>> of bounds? Nothing really stands out in
>>> https://webassembly.github.io/spec/core/index.html, but it's the first
>>> time I try to have a look at the WebAssembly spec.
>>>
>>> Regards
>>>
>>> Antoine.
>>>
>>>
>>>>
>>>> Step 2 is necessary to read the buffers from FFI and output the result
>>> back
>>>> from the interpreter once the UDF is done, similar to what we do in
>>>> datafusion to run Python from Rust. In the case of datafusion the
>>> "binary"
>>>> is a Python function, which has security implications since the Python
>>>> interpreter allows everything by default.
>>>>
>>>> Best,
>>>> Jorge
>>>>
>>>>
>>>>
>>>> On Tue, Apr 26, 2022 at 2:56 PM Antoine Pitrou <[email protected]>
>>> wrote:
>>>>
>>>>>
>>>>> Le 25/04/2022 à 23:04, David Li a écrit :
>>>>>> The WebAssembly documentation has a rundown of the techniques used:
>>>>> https://webassembly.org/docs/security/
>>>>>>
>>>>>> I think usually you would run WASM in-process, though we could indeed
>>>>> also put it in a subprocess to further isolate things.
>>>>>
>>>>> Would WASM be able to interact in-process with non-WASM buffers safely?
>>>>> It's not obvious from reading the page above.
>>>>>
>>>>>
>>>>>>
>>>>>> It would be interesting to define the Flight "harness" protocol.
>>>>> Handling heterogeneous arguments may require some evolution in Flight
>>> (e.g.
>>>>> if the function is non scalar and arguments are of different length -
>>> we'd
>>>>> need something like the ColumnBag proposal, so this might be a good
>>> reason
>>>>> to revive that).
>>>>>>
>>>>>> On Mon, Apr 25, 2022, at 16:35, Antoine Pitrou wrote:
>>>>>>> Le 25/04/2022 à 22:19, Wes McKinney a écrit :
>>>>>>>> I was going to reply to this e-mail thread on user@ but thought I
>>>>>>>> would start a new thread on dev@.
>>>>>>>>
>>>>>>>> Executing user-defined functions in memory, especially untrusted
>>>>>>>> functions, in general is unsafe. For "trusted" functions, having an
>>>>>>>> in-memory API for writing them in user languages is very useful. I
>>>>>>>> remember tinkering with adding UDFs in Impala with LLVM IR, which
>>>>>>>> would allow UDFs to have performance consistent with built-ins
>>>>>>>> (because built-in functions are all inlined into code-generated
>>>>>>>> expressions), but segfaults would bring down the server, so only
>>>>>>>> admins could be trusted to add new UDFs.
>>>>>>>>
>>>>>>>> However, I wonder if we should eventually define an "external UDF"
>>>>>>>> protocol and an example UDF "harness", using Flight to do RPC across
>>>>>>>> the process boundaries. So the idea is that an external local UDF
>>>>>>>> Flight execution service is spun up, and then data is sent to the UDF
>>>>>>>> in a DoExchange call.
>>>>>>>>
>>>>>>>> As Jacques pointed out in an interview 1], a compelling solution to
>>>>>>>> the UDF sandboxing problem is WASM. This allows "untrusted" WASM
>>>>>>>> functions to be run safely in-process.
>>>>>>>
>>>>>>> How does the sandboxing work in this case? Is it simply executing in a
>>>>>>> separate process with restricted capabilities, or are other mechanisms
>>>>>>> put in place?
>>>>>
>>>>
>>>
>>

Re: Designing standards for "sandboxed" Arrow user-defined functions [was Re: User defined "Arrow Compute Function"]

Reply via email to