This is a very interesting topic. I wonder if we have a UDF mechanism in arrow compute, is there any chance Gandiva's UDF could be integrated with arrow compute's UDF function registry? [1] >From an external user's perspective, Gandiva is part of arrow project, having two UDF registries that are not interoperable seems a bit of a waste. If arrow compute has the option to make Gandiva UDFs accessible, it would be great for users. LLVM IR is used in Gandiva's precompiled UDF as far as I know.
[1] https://www.dremio.com/blog/adding-a-user-define-function-to-gandiva/ On Wed, Apr 27, 2022 at 3:37 AM Antoine Pitrou <[email protected]> wrote: > > Also, this may sound counter-intuitive, but LLVM IR is actually > architecture-specific because it is tied to various parameters of the > architecture such as type widths and alignments. > > > Le 26/04/2022 à 19:51, Sasha Krassovsky a écrit : > > I think I can help answer these: > > 1) LLVM IR is an intermediate representation for compilers, WASM is an > open standard for sandboxed computation. They fulfill different but > complimentary roles. If the query engine were handed LLVM IR, it would > still have to JIT the IR to wasm in order to maintain the sandboxing > guarantees. It would also tie the query engine to LLVM, whereas there may > be other wasm generators out there. > > > > 2) The idea would be for the user to use some external tool or compiler > that generates wasm, and pass the wasm to the query engine. This would mean > that you could write a UDF in any language of your choosing. It seems like > it wouldn’t be much work to use your existing numpy + numba pipeline as > well, you would just have to add a step to generate wasm from your LLVM IR > before passing it to the engine. > > > > Sasha > > > >> 26 апр. 2022 г., в 10:39, Li Jin <[email protected]> написал(а): > >> > >> This is a very interesting topic and one that we care a lot about when > >> using/thinking about Arrow compute. > >> > >> I come from Python data analytics where most of our users use > Pandas/Numpy. > >> This is also my first time learning about WASM and my previous > >> understanding of "Python UDF in Arrow C++ compute" engine is more of: > >> > >> UDF written in NumPy API -> Using Numba to compile UDF into LLVM IR -> > >> Execute LLVM IR within Arrow C++ engine on Arrow Arrays > >> > >> Which in my understanding is similar to UDFs in Impala with LLVM IR that > >> Wes mentioned. > >> > >> I wonder how WASM potentially changing things. A couple of questions: > >> (1) What is the advantage of using WASM instead of sth like LLVM IR? > >> (2) Do we envision using sth like a NumPy API as the language that > writes > >> these UDFs or sth completely different? (Another DSL?) > >> > >> Li > >> > >>> On Tue, Apr 26, 2022 at 11:04 AM Weston Pace <[email protected]> > wrote: > >>> > >>> In addition to the memory copy it looks like WASM is going to bounds > >>> check all loads/stores. It does, at least, have some vectorized > >>> load/store operations so that can help amortize the cost. It appears > >>> you aren't going to get the same performance as native today using > >>> WASM but I'm guessing that is an active area of research and > >>> investment. > >>> > >>>> On Tue, Apr 26, 2022 at 5:00 AM Jorge Cardoso Leitão > >>>> <[email protected]> wrote: > >>>> > >>>> I need to correct myself here - it is currently not possible to pass > >>> memory > >>>> at zero cost between the engine and WASM interpreter. This is related > to > >>>> your point about safety - WASM provides memory safety guarantees > because > >>> it > >>>> controls the memory region that it can read from and write to. > Therefore, > >>>> currently passing data from and into WASM requires a memcopy. > >>>> > >>>> There is a proposal [1] to improve the situation, but currently would > >>> incur > >>>> a cost in the query engine, since we would need to memcopy the regions > >>>> around. > >>>> > >>>> I forgot that on my poc I passed the parquet file from js to WASM and > >>>> de-serialized it to arrow directly in wasm - so memory was already > being > >>>> allocated from within WASM sandbox, not JS. Sorry for the confusion. > >>>> > >>>> [1] https://github.com/WebAssembly/design/issues/1439 > >>>> > >>>> Best, > >>>> Jorge > >>>> > >>>> > >>>> > >>>> On Tue, Apr 26, 2022 at 3:43 PM Antoine Pitrou <[email protected]> > >>> wrote: > >>>> > >>>>> > >>>>> Le 26/04/2022 à 16:30, Gavin Ray a écrit : > >>>>>> Antoine, sandboxing comes into play from two places: > >>>>>> > >>>>>> 1) The WASM specification itself, which puts a bounds on the types > of > >>>>>> behaviors possible > >>>>>> 2) The implementation of the WASM bytecode interpreter chosen, like > >>> Jorge > >>>>>> mentioned in the comment above > >>>>>> > >>>>>> The wasmtime docs have a pretty solid section covering the > sandboxing > >>>>>> guarantees of WASM, and then the interpreter-specific > >>> behavior/abilities > >>>>> of > >>>>>> wasmtime FWIW: > >>>>>> https://docs.wasmtime.dev/security-sandboxing.html#webassembly-core > >>>>> > >>>>> This doesn't really answer my question, does it? > >>>>> > >>>>> > >>>>>> > >>>>>> On Tue, Apr 26, 2022 at 10:22 AM Antoine Pitrou <[email protected] > > > >>>>> wrote: > >>>>>> > >>>>>>> > >>>>>>> Le 26/04/2022 à 16:18, Jorge Cardoso Leitão a écrit : > >>>>>>>>> Would WASM be able to interact in-process with non-WASM buffers > >>>>> safely? > >>>>>>>> > >>>>>>>> AFAIK yes. My understanding from playing with it in JS is that a > >>>>>>>> WASM-backed udf execution would be something like: > >>>>>>>> > >>>>>>>> 1. compile the C++/Rust/etc UDF to WASM (a binary format) > >>>>>>>> 2. provide a small WASM-compiled middleware of the c data > interface > >>>>> that > >>>>>>>> consumes (binary, c data interface pointers) > >>>>>>>> 3. ship a WASM interpreter as part of the query engine > >>>>>>>> 4. pass binary and c data interface pointers from the query engine > >>>>>>> program > >>>>>>>> to the interpreter with WASM-compiled middleware > >>>>>>> > >>>>>>> Ok, but the key word in my question was "safely". What mechanisms > >>> are in > >>>>>>> place such that the WASM user function will not access Arrow > >>> buffers out > >>>>>>> of bounds? Nothing really stands out in > >>>>>>> https://webassembly.github.io/spec/core/index.html, but it's the > >>> first > >>>>>>> time I try to have a look at the WebAssembly spec. > >>>>>>> > >>>>>>> Regards > >>>>>>> > >>>>>>> Antoine. > >>>>>>> > >>>>>>> > >>>>>>>> > >>>>>>>> Step 2 is necessary to read the buffers from FFI and output the > >>> result > >>>>>>> back > >>>>>>>> from the interpreter once the UDF is done, similar to what we do > in > >>>>>>>> datafusion to run Python from Rust. In the case of datafusion the > >>>>>>> "binary" > >>>>>>>> is a Python function, which has security implications since the > >>> Python > >>>>>>>> interpreter allows everything by default. > >>>>>>>> > >>>>>>>> Best, > >>>>>>>> Jorge > >>>>>>>> > >>>>>>>> > >>>>>>>> > >>>>>>>> On Tue, Apr 26, 2022 at 2:56 PM Antoine Pitrou < > [email protected] > >>>> > >>>>>>> wrote: > >>>>>>>> > >>>>>>>>> > >>>>>>>>> Le 25/04/2022 à 23:04, David Li a écrit : > >>>>>>>>>> The WebAssembly documentation has a rundown of the techniques > >>> used: > >>>>>>>>> https://webassembly.org/docs/security/ > >>>>>>>>>> > >>>>>>>>>> I think usually you would run WASM in-process, though we could > >>> indeed > >>>>>>>>> also put it in a subprocess to further isolate things. > >>>>>>>>> > >>>>>>>>> Would WASM be able to interact in-process with non-WASM buffers > >>>>> safely? > >>>>>>>>> It's not obvious from reading the page above. > >>>>>>>>> > >>>>>>>>> > >>>>>>>>>> > >>>>>>>>>> It would be interesting to define the Flight "harness" protocol. > >>>>>>>>> Handling heterogeneous arguments may require some evolution in > >>> Flight > >>>>>>> (e.g. > >>>>>>>>> if the function is non scalar and arguments are of different > >>> length - > >>>>>>> we'd > >>>>>>>>> need something like the ColumnBag proposal, so this might be a > >>> good > >>>>>>> reason > >>>>>>>>> to revive that). > >>>>>>>>>> > >>>>>>>>>> On Mon, Apr 25, 2022, at 16:35, Antoine Pitrou wrote: > >>>>>>>>>>> Le 25/04/2022 à 22:19, Wes McKinney a écrit : > >>>>>>>>>>>> I was going to reply to this e-mail thread on user@ but > >>> thought I > >>>>>>>>>>>> would start a new thread on dev@. > >>>>>>>>>>>> > >>>>>>>>>>>> Executing user-defined functions in memory, especially > >>> untrusted > >>>>>>>>>>>> functions, in general is unsafe. For "trusted" functions, > >>> having an > >>>>>>>>>>>> in-memory API for writing them in user languages is very > >>> useful. I > >>>>>>>>>>>> remember tinkering with adding UDFs in Impala with LLVM IR, > >>> which > >>>>>>>>>>>> would allow UDFs to have performance consistent with built-ins > >>>>>>>>>>>> (because built-in functions are all inlined into > code-generated > >>>>>>>>>>>> expressions), but segfaults would bring down the server, so > >>> only > >>>>>>>>>>>> admins could be trusted to add new UDFs. > >>>>>>>>>>>> > >>>>>>>>>>>> However, I wonder if we should eventually define an "external > >>> UDF" > >>>>>>>>>>>> protocol and an example UDF "harness", using Flight to do RPC > >>>>> across > >>>>>>>>>>>> the process boundaries. So the idea is that an external local > >>> UDF > >>>>>>>>>>>> Flight execution service is spun up, and then data is sent to > >>> the > >>>>> UDF > >>>>>>>>>>>> in a DoExchange call. > >>>>>>>>>>>> > >>>>>>>>>>>> As Jacques pointed out in an interview 1], a compelling > >>> solution to > >>>>>>>>>>>> the UDF sandboxing problem is WASM. This allows "untrusted" > >>> WASM > >>>>>>>>>>>> functions to be run safely in-process. > >>>>>>>>>>> > >>>>>>>>>>> How does the sandboxing work in this case? Is it simply > >>> executing > >>>>> in a > >>>>>>>>>>> separate process with restricted capabilities, or are other > >>>>> mechanisms > >>>>>>>>>>> put in place? > >>>>>>>>> > >>>>>>>> > >>>>>>> > >>>>>> > >>>>> > >>> >
