Re: Designing standards for "sandboxed" Arrow user-defined functions [was Re: User defined "Arrow Compute Function"]

Yue Ni Tue, 26 Apr 2022 18:24:00 -0700

This is a very interesting topic. I wonder if we have a UDF mechanism in
arrow compute, is there any chance Gandiva's UDF could be integrated with
arrow compute's UDF function registry? [1]
>From an external user's perspective, Gandiva is part of arrow project,
having two UDF registries that are not interoperable seems a bit of a
waste. If arrow compute has the option to make Gandiva UDFs accessible, it
would be great for users. LLVM IR is used in Gandiva's precompiled UDF as
far as I know.


[1] https://www.dremio.com/blog/adding-a-user-define-function-to-gandiva/

On Wed, Apr 27, 2022 at 3:37 AM Antoine Pitrou <[email protected]> wrote:

>
> Also, this may sound counter-intuitive, but LLVM IR is actually
> architecture-specific because it is tied to various parameters of the
> architecture such as type widths and alignments.
>
>
> Le 26/04/2022 à 19:51, Sasha Krassovsky a écrit :
> > I think I can help answer these:
> > 1) LLVM IR is an intermediate representation for compilers, WASM is an
> open standard for sandboxed computation. They fulfill different but
> complimentary roles. If the query engine were handed LLVM IR, it would
> still have to JIT the IR to wasm in order to maintain the sandboxing
> guarantees. It would also tie the query engine to LLVM, whereas there may
> be other wasm generators out there.
> >
> > 2) The idea would be for the user to use some external tool or compiler
> that generates wasm, and pass the wasm to the query engine. This would mean
> that you could write a UDF in any language of your choosing. It seems like
> it wouldn’t be much work to use your existing numpy + numba pipeline as
> well, you would just have to add a step to generate wasm from your LLVM IR
> before passing it to the engine.
> >
> > Sasha
> >
> >> 26 апр. 2022 г., в 10:39, Li Jin <[email protected]> написал(а):
> >>
> >> This is a very interesting topic and one that we care a lot about when
> >> using/thinking about Arrow compute.
> >>
> >> I come from Python data analytics where most of our users use
> Pandas/Numpy.
> >> This is also my first time learning about WASM and my previous
> >> understanding of "Python UDF in Arrow C++ compute" engine is more of:
> >>
> >> UDF written in NumPy API -> Using Numba to compile UDF into LLVM IR ->
> >> Execute LLVM IR within Arrow C++ engine on Arrow Arrays
> >>
> >> Which in my understanding is similar to UDFs in Impala with LLVM IR that
> >> Wes mentioned.
> >>
> >> I wonder how WASM potentially changing things. A couple of questions:
> >> (1) What is the advantage of using WASM instead of sth like LLVM IR?
> >> (2) Do we envision using sth like a NumPy API as the language that
> writes
> >> these UDFs or sth completely different? (Another DSL?)
> >>
> >> Li
> >>
> >>> On Tue, Apr 26, 2022 at 11:04 AM Weston Pace <[email protected]>
> wrote:
> >>>
> >>> In addition to the memory copy it looks like WASM is going to bounds
> >>> check all loads/stores.  It does, at least, have some vectorized
> >>> load/store operations so that can help amortize the cost.  It appears
> >>> you aren't going to get the same performance as native today using
> >>> WASM but I'm guessing that is an active area of research and
> >>> investment.
> >>>
> >>>> On Tue, Apr 26, 2022 at 5:00 AM Jorge Cardoso Leitão
> >>>> <[email protected]> wrote:
> >>>>
> >>>> I need to correct myself here - it is currently not possible to pass
> >>> memory
> >>>> at zero cost between the engine and WASM interpreter. This is related
> to
> >>>> your point about safety - WASM provides memory safety guarantees
> because
> >>> it
> >>>> controls the memory region that it can read from and write to.
> Therefore,
> >>>> currently passing data from and into WASM requires a memcopy.
> >>>>
> >>>> There is a proposal [1] to improve the situation, but currently would
> >>> incur
> >>>> a cost in the query engine, since we would need to memcopy the regions
> >>>> around.
> >>>>
> >>>> I forgot that on my poc I passed the parquet file from js to WASM and
> >>>> de-serialized it to arrow directly in wasm - so memory was already
> being
> >>>> allocated from within WASM sandbox, not JS. Sorry for the confusion.
> >>>>
> >>>> [1] https://github.com/WebAssembly/design/issues/1439
> >>>>
> >>>> Best,
> >>>> Jorge
> >>>>
> >>>>
> >>>>
> >>>> On Tue, Apr 26, 2022 at 3:43 PM Antoine Pitrou <[email protected]>
> >>> wrote:
> >>>>
> >>>>>
> >>>>> Le 26/04/2022 à 16:30, Gavin Ray a écrit :
> >>>>>> Antoine, sandboxing comes into play from two places:
> >>>>>>
> >>>>>> 1) The WASM specification itself, which puts a bounds on the types
> of
> >>>>>> behaviors possible
> >>>>>> 2) The implementation of the WASM bytecode interpreter chosen, like
> >>> Jorge
> >>>>>> mentioned in the comment above
> >>>>>>
> >>>>>> The wasmtime docs have a pretty solid section covering the
> sandboxing
> >>>>>> guarantees of WASM, and then the interpreter-specific
> >>> behavior/abilities
> >>>>> of
> >>>>>> wasmtime FWIW:
> >>>>>> https://docs.wasmtime.dev/security-sandboxing.html#webassembly-core
> >>>>>
> >>>>> This doesn't really answer my question, does it?
> >>>>>
> >>>>>
> >>>>>>
> >>>>>> On Tue, Apr 26, 2022 at 10:22 AM Antoine Pitrou <[email protected]
> >
> >>>>> wrote:
> >>>>>>
> >>>>>>>
> >>>>>>> Le 26/04/2022 à 16:18, Jorge Cardoso Leitão a écrit :
> >>>>>>>>> Would WASM be able to interact in-process with non-WASM buffers
> >>>>> safely?
> >>>>>>>>
> >>>>>>>> AFAIK yes. My understanding from playing with it in JS is that a
> >>>>>>>> WASM-backed udf execution would be something like:
> >>>>>>>>
> >>>>>>>> 1. compile the C++/Rust/etc UDF to WASM (a binary format)
> >>>>>>>> 2. provide a small WASM-compiled middleware of the c data
> interface
> >>>>> that
> >>>>>>>> consumes (binary, c data interface pointers)
> >>>>>>>> 3. ship a WASM interpreter as part of the query engine
> >>>>>>>> 4. pass binary and c data interface pointers from the query engine
> >>>>>>> program
> >>>>>>>> to the interpreter with WASM-compiled middleware
> >>>>>>>
> >>>>>>> Ok, but the key word in my question was "safely". What mechanisms
> >>> are in
> >>>>>>> place such that the WASM user function will not access Arrow
> >>> buffers out
> >>>>>>> of bounds? Nothing really stands out in
> >>>>>>> https://webassembly.github.io/spec/core/index.html, but it's the
> >>> first
> >>>>>>> time I try to have a look at the WebAssembly spec.
> >>>>>>>
> >>>>>>> Regards
> >>>>>>>
> >>>>>>> Antoine.
> >>>>>>>
> >>>>>>>
> >>>>>>>>
> >>>>>>>> Step 2 is necessary to read the buffers from FFI and output the
> >>> result
> >>>>>>> back
> >>>>>>>> from the interpreter once the UDF is done, similar to what we do
> in
> >>>>>>>> datafusion to run Python from Rust. In the case of datafusion the
> >>>>>>> "binary"
> >>>>>>>> is a Python function, which has security implications since the
> >>> Python
> >>>>>>>> interpreter allows everything by default.
> >>>>>>>>
> >>>>>>>> Best,
> >>>>>>>> Jorge
> >>>>>>>>
> >>>>>>>>
> >>>>>>>>
> >>>>>>>> On Tue, Apr 26, 2022 at 2:56 PM Antoine Pitrou <
> [email protected]
> >>>>
> >>>>>>> wrote:
> >>>>>>>>
> >>>>>>>>>
> >>>>>>>>> Le 25/04/2022 à 23:04, David Li a écrit :
> >>>>>>>>>> The WebAssembly documentation has a rundown of the techniques
> >>> used:
> >>>>>>>>> https://webassembly.org/docs/security/
> >>>>>>>>>>
> >>>>>>>>>> I think usually you would run WASM in-process, though we could
> >>> indeed
> >>>>>>>>> also put it in a subprocess to further isolate things.
> >>>>>>>>>
> >>>>>>>>> Would WASM be able to interact in-process with non-WASM buffers
> >>>>> safely?
> >>>>>>>>> It's not obvious from reading the page above.
> >>>>>>>>>
> >>>>>>>>>
> >>>>>>>>>>
> >>>>>>>>>> It would be interesting to define the Flight "harness" protocol.
> >>>>>>>>> Handling heterogeneous arguments may require some evolution in
> >>> Flight
> >>>>>>> (e.g.
> >>>>>>>>> if the function is non scalar and arguments are of different
> >>> length -
> >>>>>>> we'd
> >>>>>>>>> need something like the ColumnBag proposal, so this might be a
> >>> good
> >>>>>>> reason
> >>>>>>>>> to revive that).
> >>>>>>>>>>
> >>>>>>>>>> On Mon, Apr 25, 2022, at 16:35, Antoine Pitrou wrote:
> >>>>>>>>>>> Le 25/04/2022 à 22:19, Wes McKinney a écrit :
> >>>>>>>>>>>> I was going to reply to this e-mail thread on user@ but
> >>> thought I
> >>>>>>>>>>>> would start a new thread on dev@.
> >>>>>>>>>>>>
> >>>>>>>>>>>> Executing user-defined functions in memory, especially
> >>> untrusted
> >>>>>>>>>>>> functions, in general is unsafe. For "trusted" functions,
> >>> having an
> >>>>>>>>>>>> in-memory API for writing them in user languages is very
> >>> useful. I
> >>>>>>>>>>>> remember tinkering with adding UDFs in Impala with LLVM IR,
> >>> which
> >>>>>>>>>>>> would allow UDFs to have performance consistent with built-ins
> >>>>>>>>>>>> (because built-in functions are all inlined into
> code-generated
> >>>>>>>>>>>> expressions), but segfaults would bring down the server, so
> >>> only
> >>>>>>>>>>>> admins could be trusted to add new UDFs.
> >>>>>>>>>>>>
> >>>>>>>>>>>> However, I wonder if we should eventually define an "external
> >>> UDF"
> >>>>>>>>>>>> protocol and an example UDF "harness", using Flight to do RPC
> >>>>> across
> >>>>>>>>>>>> the process boundaries. So the idea is that an external local
> >>> UDF
> >>>>>>>>>>>> Flight execution service is spun up, and then data is sent to
> >>> the
> >>>>> UDF
> >>>>>>>>>>>> in a DoExchange call.
> >>>>>>>>>>>>
> >>>>>>>>>>>> As Jacques pointed out in an interview 1], a compelling
> >>> solution to
> >>>>>>>>>>>> the UDF sandboxing problem is WASM. This allows "untrusted"
> >>> WASM
> >>>>>>>>>>>> functions to be run safely in-process.
> >>>>>>>>>>>
> >>>>>>>>>>> How does the sandboxing work in this case? Is it simply
> >>> executing
> >>>>> in a
> >>>>>>>>>>> separate process with restricted capabilities, or are other
> >>>>> mechanisms
> >>>>>>>>>>> put in place?
> >>>>>>>>>
> >>>>>>>>
> >>>>>>>
> >>>>>>
> >>>>>
> >>>
>

Re: Designing standards for "sandboxed" Arrow user-defined functions [was Re: User defined "Arrow Compute Function"]

Reply via email to