Re: Designing standards for "sandboxed" Arrow user-defined functions [was Re: User defined "Arrow Compute Function"]

Li Jin Tue, 26 Apr 2022 10:39:06 -0700

This is a very interesting topic and one that we care a lot about when
using/thinking about Arrow compute.


I come from Python data analytics where most of our users use Pandas/Numpy.
This is also my first time learning about WASM and my previous
understanding of "Python UDF in Arrow C++ compute" engine is more of:

UDF written in NumPy API -> Using Numba to compile UDF into LLVM IR ->
Execute LLVM IR within Arrow C++ engine on Arrow Arrays

Which in my understanding is similar to UDFs in Impala with LLVM IR that
Wes mentioned.

I wonder how WASM potentially changing things. A couple of questions:
(1) What is the advantage of using WASM instead of sth like LLVM IR?
(2) Do we envision using sth like a NumPy API as the language that writes
these UDFs or sth completely different? (Another DSL?)

Li

On Tue, Apr 26, 2022 at 11:04 AM Weston Pace <[email protected]> wrote:

> In addition to the memory copy it looks like WASM is going to bounds
> check all loads/stores.  It does, at least, have some vectorized
> load/store operations so that can help amortize the cost.  It appears
> you aren't going to get the same performance as native today using
> WASM but I'm guessing that is an active area of research and
> investment.
>
> On Tue, Apr 26, 2022 at 5:00 AM Jorge Cardoso Leitão
> <[email protected]> wrote:
> >
> > I need to correct myself here - it is currently not possible to pass
> memory
> > at zero cost between the engine and WASM interpreter. This is related to
> > your point about safety - WASM provides memory safety guarantees because
> it
> > controls the memory region that it can read from and write to. Therefore,
> > currently passing data from and into WASM requires a memcopy.
> >
> > There is a proposal [1] to improve the situation, but currently would
> incur
> > a cost in the query engine, since we would need to memcopy the regions
> > around.
> >
> > I forgot that on my poc I passed the parquet file from js to WASM and
> > de-serialized it to arrow directly in wasm - so memory was already being
> > allocated from within WASM sandbox, not JS. Sorry for the confusion.
> >
> > [1] https://github.com/WebAssembly/design/issues/1439
> >
> > Best,
> > Jorge
> >
> >
> >
> > On Tue, Apr 26, 2022 at 3:43 PM Antoine Pitrou <[email protected]>
> wrote:
> >
> > >
> > > Le 26/04/2022 à 16:30, Gavin Ray a écrit :
> > > > Antoine, sandboxing comes into play from two places:
> > > >
> > > > 1) The WASM specification itself, which puts a bounds on the types of
> > > > behaviors possible
> > > > 2) The implementation of the WASM bytecode interpreter chosen, like
> Jorge
> > > > mentioned in the comment above
> > > >
> > > > The wasmtime docs have a pretty solid section covering the sandboxing
> > > > guarantees of WASM, and then the interpreter-specific
> behavior/abilities
> > > of
> > > > wasmtime FWIW:
> > > > https://docs.wasmtime.dev/security-sandboxing.html#webassembly-core
> > >
> > > This doesn't really answer my question, does it?
> > >
> > >
> > > >
> > > > On Tue, Apr 26, 2022 at 10:22 AM Antoine Pitrou <[email protected]>
> > > wrote:
> > > >
> > > >>
> > > >> Le 26/04/2022 à 16:18, Jorge Cardoso Leitão a écrit :
> > > >>>> Would WASM be able to interact in-process with non-WASM buffers
> > > safely?
> > > >>>
> > > >>> AFAIK yes. My understanding from playing with it in JS is that a
> > > >>> WASM-backed udf execution would be something like:
> > > >>>
> > > >>> 1. compile the C++/Rust/etc UDF to WASM (a binary format)
> > > >>> 2. provide a small WASM-compiled middleware of the c data interface
> > > that
> > > >>> consumes (binary, c data interface pointers)
> > > >>> 3. ship a WASM interpreter as part of the query engine
> > > >>> 4. pass binary and c data interface pointers from the query engine
> > > >> program
> > > >>> to the interpreter with WASM-compiled middleware
> > > >>
> > > >> Ok, but the key word in my question was "safely". What mechanisms
> are in
> > > >> place such that the WASM user function will not access Arrow
> buffers out
> > > >> of bounds? Nothing really stands out in
> > > >> https://webassembly.github.io/spec/core/index.html, but it's the
> first
> > > >> time I try to have a look at the WebAssembly spec.
> > > >>
> > > >> Regards
> > > >>
> > > >> Antoine.
> > > >>
> > > >>
> > > >>>
> > > >>> Step 2 is necessary to read the buffers from FFI and output the
> result
> > > >> back
> > > >>> from the interpreter once the UDF is done, similar to what we do in
> > > >>> datafusion to run Python from Rust. In the case of datafusion the
> > > >> "binary"
> > > >>> is a Python function, which has security implications since the
> Python
> > > >>> interpreter allows everything by default.
> > > >>>
> > > >>> Best,
> > > >>> Jorge
> > > >>>
> > > >>>
> > > >>>
> > > >>> On Tue, Apr 26, 2022 at 2:56 PM Antoine Pitrou <[email protected]
> >
> > > >> wrote:
> > > >>>
> > > >>>>
> > > >>>> Le 25/04/2022 à 23:04, David Li a écrit :
> > > >>>>> The WebAssembly documentation has a rundown of the techniques
> used:
> > > >>>> https://webassembly.org/docs/security/
> > > >>>>>
> > > >>>>> I think usually you would run WASM in-process, though we could
> indeed
> > > >>>> also put it in a subprocess to further isolate things.
> > > >>>>
> > > >>>> Would WASM be able to interact in-process with non-WASM buffers
> > > safely?
> > > >>>> It's not obvious from reading the page above.
> > > >>>>
> > > >>>>
> > > >>>>>
> > > >>>>> It would be interesting to define the Flight "harness" protocol.
> > > >>>> Handling heterogeneous arguments may require some evolution in
> Flight
> > > >> (e.g.
> > > >>>> if the function is non scalar and arguments are of different
> length -
> > > >> we'd
> > > >>>> need something like the ColumnBag proposal, so this might be a
> good
> > > >> reason
> > > >>>> to revive that).
> > > >>>>>
> > > >>>>> On Mon, Apr 25, 2022, at 16:35, Antoine Pitrou wrote:
> > > >>>>>> Le 25/04/2022 à 22:19, Wes McKinney a écrit :
> > > >>>>>>> I was going to reply to this e-mail thread on user@ but
> thought I
> > > >>>>>>> would start a new thread on dev@.
> > > >>>>>>>
> > > >>>>>>> Executing user-defined functions in memory, especially
> untrusted
> > > >>>>>>> functions, in general is unsafe. For "trusted" functions,
> having an
> > > >>>>>>> in-memory API for writing them in user languages is very
> useful. I
> > > >>>>>>> remember tinkering with adding UDFs in Impala with LLVM IR,
> which
> > > >>>>>>> would allow UDFs to have performance consistent with built-ins
> > > >>>>>>> (because built-in functions are all inlined into code-generated
> > > >>>>>>> expressions), but segfaults would bring down the server, so
> only
> > > >>>>>>> admins could be trusted to add new UDFs.
> > > >>>>>>>
> > > >>>>>>> However, I wonder if we should eventually define an "external
> UDF"
> > > >>>>>>> protocol and an example UDF "harness", using Flight to do RPC
> > > across
> > > >>>>>>> the process boundaries. So the idea is that an external local
> UDF
> > > >>>>>>> Flight execution service is spun up, and then data is sent to
> the
> > > UDF
> > > >>>>>>> in a DoExchange call.
> > > >>>>>>>
> > > >>>>>>> As Jacques pointed out in an interview 1], a compelling
> solution to
> > > >>>>>>> the UDF sandboxing problem is WASM. This allows "untrusted"
> WASM
> > > >>>>>>> functions to be run safely in-process.
> > > >>>>>>
> > > >>>>>> How does the sandboxing work in this case? Is it simply
> executing
> > > in a
> > > >>>>>> separate process with restricted capabilities, or are other
> > > mechanisms
> > > >>>>>> put in place?
> > > >>>>
> > > >>>
> > > >>
> > > >
> > >
>

Re: Designing standards for "sandboxed" Arrow user-defined functions [was Re: User defined "Arrow Compute Function"]

Reply via email to