This is a very interesting topic and one that we care a lot about when using/thinking about Arrow compute.
I come from Python data analytics where most of our users use Pandas/Numpy. This is also my first time learning about WASM and my previous understanding of "Python UDF in Arrow C++ compute" engine is more of: UDF written in NumPy API -> Using Numba to compile UDF into LLVM IR -> Execute LLVM IR within Arrow C++ engine on Arrow Arrays Which in my understanding is similar to UDFs in Impala with LLVM IR that Wes mentioned. I wonder how WASM potentially changing things. A couple of questions: (1) What is the advantage of using WASM instead of sth like LLVM IR? (2) Do we envision using sth like a NumPy API as the language that writes these UDFs or sth completely different? (Another DSL?) Li On Tue, Apr 26, 2022 at 11:04 AM Weston Pace <[email protected]> wrote: > In addition to the memory copy it looks like WASM is going to bounds > check all loads/stores. It does, at least, have some vectorized > load/store operations so that can help amortize the cost. It appears > you aren't going to get the same performance as native today using > WASM but I'm guessing that is an active area of research and > investment. > > On Tue, Apr 26, 2022 at 5:00 AM Jorge Cardoso Leitão > <[email protected]> wrote: > > > > I need to correct myself here - it is currently not possible to pass > memory > > at zero cost between the engine and WASM interpreter. This is related to > > your point about safety - WASM provides memory safety guarantees because > it > > controls the memory region that it can read from and write to. Therefore, > > currently passing data from and into WASM requires a memcopy. > > > > There is a proposal [1] to improve the situation, but currently would > incur > > a cost in the query engine, since we would need to memcopy the regions > > around. > > > > I forgot that on my poc I passed the parquet file from js to WASM and > > de-serialized it to arrow directly in wasm - so memory was already being > > allocated from within WASM sandbox, not JS. Sorry for the confusion. > > > > [1] https://github.com/WebAssembly/design/issues/1439 > > > > Best, > > Jorge > > > > > > > > On Tue, Apr 26, 2022 at 3:43 PM Antoine Pitrou <[email protected]> > wrote: > > > > > > > > Le 26/04/2022 à 16:30, Gavin Ray a écrit : > > > > Antoine, sandboxing comes into play from two places: > > > > > > > > 1) The WASM specification itself, which puts a bounds on the types of > > > > behaviors possible > > > > 2) The implementation of the WASM bytecode interpreter chosen, like > Jorge > > > > mentioned in the comment above > > > > > > > > The wasmtime docs have a pretty solid section covering the sandboxing > > > > guarantees of WASM, and then the interpreter-specific > behavior/abilities > > > of > > > > wasmtime FWIW: > > > > https://docs.wasmtime.dev/security-sandboxing.html#webassembly-core > > > > > > This doesn't really answer my question, does it? > > > > > > > > > > > > > > On Tue, Apr 26, 2022 at 10:22 AM Antoine Pitrou <[email protected]> > > > wrote: > > > > > > > >> > > > >> Le 26/04/2022 à 16:18, Jorge Cardoso Leitão a écrit : > > > >>>> Would WASM be able to interact in-process with non-WASM buffers > > > safely? > > > >>> > > > >>> AFAIK yes. My understanding from playing with it in JS is that a > > > >>> WASM-backed udf execution would be something like: > > > >>> > > > >>> 1. compile the C++/Rust/etc UDF to WASM (a binary format) > > > >>> 2. provide a small WASM-compiled middleware of the c data interface > > > that > > > >>> consumes (binary, c data interface pointers) > > > >>> 3. ship a WASM interpreter as part of the query engine > > > >>> 4. pass binary and c data interface pointers from the query engine > > > >> program > > > >>> to the interpreter with WASM-compiled middleware > > > >> > > > >> Ok, but the key word in my question was "safely". What mechanisms > are in > > > >> place such that the WASM user function will not access Arrow > buffers out > > > >> of bounds? Nothing really stands out in > > > >> https://webassembly.github.io/spec/core/index.html, but it's the > first > > > >> time I try to have a look at the WebAssembly spec. > > > >> > > > >> Regards > > > >> > > > >> Antoine. > > > >> > > > >> > > > >>> > > > >>> Step 2 is necessary to read the buffers from FFI and output the > result > > > >> back > > > >>> from the interpreter once the UDF is done, similar to what we do in > > > >>> datafusion to run Python from Rust. In the case of datafusion the > > > >> "binary" > > > >>> is a Python function, which has security implications since the > Python > > > >>> interpreter allows everything by default. > > > >>> > > > >>> Best, > > > >>> Jorge > > > >>> > > > >>> > > > >>> > > > >>> On Tue, Apr 26, 2022 at 2:56 PM Antoine Pitrou <[email protected] > > > > > >> wrote: > > > >>> > > > >>>> > > > >>>> Le 25/04/2022 à 23:04, David Li a écrit : > > > >>>>> The WebAssembly documentation has a rundown of the techniques > used: > > > >>>> https://webassembly.org/docs/security/ > > > >>>>> > > > >>>>> I think usually you would run WASM in-process, though we could > indeed > > > >>>> also put it in a subprocess to further isolate things. > > > >>>> > > > >>>> Would WASM be able to interact in-process with non-WASM buffers > > > safely? > > > >>>> It's not obvious from reading the page above. > > > >>>> > > > >>>> > > > >>>>> > > > >>>>> It would be interesting to define the Flight "harness" protocol. > > > >>>> Handling heterogeneous arguments may require some evolution in > Flight > > > >> (e.g. > > > >>>> if the function is non scalar and arguments are of different > length - > > > >> we'd > > > >>>> need something like the ColumnBag proposal, so this might be a > good > > > >> reason > > > >>>> to revive that). > > > >>>>> > > > >>>>> On Mon, Apr 25, 2022, at 16:35, Antoine Pitrou wrote: > > > >>>>>> Le 25/04/2022 à 22:19, Wes McKinney a écrit : > > > >>>>>>> I was going to reply to this e-mail thread on user@ but > thought I > > > >>>>>>> would start a new thread on dev@. > > > >>>>>>> > > > >>>>>>> Executing user-defined functions in memory, especially > untrusted > > > >>>>>>> functions, in general is unsafe. For "trusted" functions, > having an > > > >>>>>>> in-memory API for writing them in user languages is very > useful. I > > > >>>>>>> remember tinkering with adding UDFs in Impala with LLVM IR, > which > > > >>>>>>> would allow UDFs to have performance consistent with built-ins > > > >>>>>>> (because built-in functions are all inlined into code-generated > > > >>>>>>> expressions), but segfaults would bring down the server, so > only > > > >>>>>>> admins could be trusted to add new UDFs. > > > >>>>>>> > > > >>>>>>> However, I wonder if we should eventually define an "external > UDF" > > > >>>>>>> protocol and an example UDF "harness", using Flight to do RPC > > > across > > > >>>>>>> the process boundaries. So the idea is that an external local > UDF > > > >>>>>>> Flight execution service is spun up, and then data is sent to > the > > > UDF > > > >>>>>>> in a DoExchange call. > > > >>>>>>> > > > >>>>>>> As Jacques pointed out in an interview 1], a compelling > solution to > > > >>>>>>> the UDF sandboxing problem is WASM. This allows "untrusted" > WASM > > > >>>>>>> functions to be run safely in-process. > > > >>>>>> > > > >>>>>> How does the sandboxing work in this case? Is it simply > executing > > > in a > > > >>>>>> separate process with restricted capabilities, or are other > > > mechanisms > > > >>>>>> put in place? > > > >>>> > > > >>> > > > >> > > > > > > > >
