I agree with Luke. Targeting little helper UDFs that go along with IOs are actually a major feature gap for xlang - like timestamp extractors that have to parse particular data formats. This could be a very useful place to try out the design options. I think we can simplify the problem by insisting that they are pure functions that do not access state or side inputs.
On Wed, Jul 13, 2022 at 7:52 PM Luke Cwik via dev <dev@beam.apache.org> wrote: > I think an easier target would be to support things like > DynamicDestinations for Java IO connectors that are exposed as XLang for > Go/Python <https://goto.google.com/Python>. > > This is because Go/Python <https://goto.google.com/Python> have good > transpiling support to WebAssembly and we already exposed several Java IO > XLang connectors already so its about plumbing one more thing through for > these IO connectors. > > What interface should we expect for UDFs / UDAFs and should they be > purpose oriented or should we do something like we did for portability > where we have a graph of transforms that we feed arbitrary data in/out > from. The latter would have the benefit of allowing the runner to embed the > language execution directly within the runner and would pay the Wasm > communication tax instead of the gRPC communication tax. If we do the > former we still have the same issues where we have to be able to have a > type system to pass information between the host system and the transpiled > WebAssembly code that wraps the users UDF/UDAF and what if the UDF wants > access to side inputs or user state ... > > On Wed, Jul 13, 2022 at 4:09 PM Chamikara Jayalath <chamik...@google.com> > wrote: > >> >> >> On Wed, Jul 13, 2022 at 9:31 AM Luke Cwik <lc...@google.com> wrote: >> >>> First we'll want to choose whether we want to target Wasm, WASI or Wagi. >>> >> >> These terms are defined here >> <https://www.fermyon.com/blog/wasm-wasi-wagi?gclid=CjwKCAjw2rmWBhB4EiwAiJ0mtVhiTuMZmy4bJSlk4nJj1deNX3KueomLgkG8JMyGeiHJ3FJRPpVn7BoCs58QAvD_BwE> >> if anybody is confused as I am :) >> >> >>> WASI adds a lot of simple things like access to a clock, random number >>> generator, ... that would expand the scope of what transpiled code can do. >>> It is debatable whether we'll want the power to run the transpiled code as >>> a microservice. Using UDFs for XLang and UDFs and UDAFs for SQL as our >>> expected use cases seem to make WASI the best choice. The issue is in the >>> details as there is a hodgepodge of what language runtimes support and what >>> are the limits of transpiling from a language to WebAssembly. >>> >> >> Agree that WASI seems like a good target since it gives access to >> additional system resources/tooling. >> >> >>> >>> Assuming WASI then it breaks down to these two aspects: >>> 1) Does the host language have a runtime? >>> Java: https://github.com/wasmerio/wasmer-java >>> Python: https://github.com/wasmerio/wasmer-python >>> Go: https://github.com/wasmerio/wasmer-go >>> >>> 2) How good is compilation from source language to WebAssembly >>> <https://github.com/appcypher/awesome-wasm-langs>? >>> Java (very limited): >>> Issues with garbage collection and the need to transpile/replace much of >>> the VM's capabilities plus the large standard library that everyone uses >>> causes a lot of challenges. >>> JWebAssembly can do simple things like basic classes, strings, method >>> calls. Should be able to compile trivial lambdas to Wasm. There are other >>> choices but to my knowledge all are very limited. >>> >> >> That's unfortunate. But hopefully Java support will be implemented soon ? >> >> >>> >>> Python <https://pythondev.readthedocs.io/wasm.html> (quite good): >>> Features CPython Emscripten browser CPython Emscripten node Pyodide >>> subprocess (fork, exec) no no no >>> threads no YES WIP >>> file system no (only MEMFS) YES (Node raw FS) YES (IDB, Node, …) >>> shared extension modules WIP WIP YES >>> PyPI packages no no YES >>> sockets ? ? ? >>> urllib, asyncio no no WebAPI fetch / WebSocket >>> signals no WIP YES >>> >>> Go (excellent): Native support in go compiler >>> >> >> Great. Could executing Go UDFs in Python x-lang transforms (for example, >> Dataframe, RunInference, Python Map) be a good first target ? >> >> Thanks, >> Cham >> >> >>> >>> On Tue, Jul 12, 2022 at 5:51 PM Chamikara Jayalath via dev < >>> dev@beam.apache.org> wrote: >>> >>>> >>>> >>>> On Wed, Jun 29, 2022 at 9:31 AM Luke Cwik <lc...@google.com> wrote: >>>> >>>>> I have had interest in integrating Wasm within Beam as well as I have >>>>> had a lot of interest in improving language portability. >>>>> >>>>> Wasm has a lot of benefits over using docker containers to provide a >>>>> place for code to execute. From experience implementing working on the >>>>> Beam's portability layer and internal Flume knowledge: >>>>> * encoding and decoding data is expensive, anything which ensures that >>>>> in-memory representations for data being transferred from the host to the >>>>> guest and back without transcoding/re-interpreting will be a big win. >>>>> * reducing the amount of times we need to pass data between guest and >>>>> host and back is important >>>>> * fusing transforms reduces the number of data passing points >>>>> * batching (row or columnar) data reduces the amount of times we >>>>> need to pass data at each data passing point >>>>> * there are enough complicated use cases (state & timers, large >>>>> iterables, side inputs) where handling the trivial map/flatmap usecase >>>>> will >>>>> provide little value since it will prevent fusion >>>>> >>>>> I have been meaning to work on a prototype where we replace the >>>>> current gRPC + docker path with one in which we use Wasm to execute a >>>>> fused >>>>> graph re-using large parts of the existing code base written to support >>>>> portability. >>>>> >>>> >>>> This sounds very interesting. Probably using Wasm to implement proper >>>> UDF support for x-lang (for example, executing Python timestamp/watermark >>>> functions provided through the Kafka Python x-lang wrapper on the Java >>>> Kafka transform) will be a good first target ? My main question for this at >>>> this point is whether Wasm has adequate support for existing SDKs that use >>>> x-lang to implement this in a useful way. >>>> >>>> Thanks, >>>> Cham >>>> >>>> >>>>> >>>>> >>>>> On Fri, Jun 17, 2022 at 2:19 PM Brian Hulette <bhule...@google.com> >>>>> wrote: >>>>> >>>>>> Re: Arrow - it's long been my dream to use Arrow for interchange in >>>>>> Beam [1]. I'm trying to move us in that direction with >>>>>> https://s.apache.org/batched-dofns (arrow is discussed briefly in >>>>>> the Future Work section). This gives the Python SDK a concept of batches >>>>>> of >>>>>> logical elements. My goal is Beam schemas + batches of logical elements >>>>>> -> >>>>>> Arrow RecordBatches. >>>>>> >>>>>> The Batched DoFn infrastructure is stable as of the 2.40.0 release >>>>>> cut and I'm currently working on adding what I'm calling a >>>>>> "BatchConverter" >>>>>> [2] for Beam Rows -> Arrow RecordBatch. Once that's done it could be >>>>>> interesting to experiment with a "WasmDoFn" that uses Arrow for >>>>>> interchange. >>>>>> >>>>>> Brian >>>>>> >>>>>> [1] >>>>>> https://docs.google.com/presentation/d/1D9vigwYTCuAuz_CO8nex3GK3h873acmQJE5Ui8TFsDY/edit#slide=id.g608e662464_0_160 >>>>>> [2] >>>>>> https://github.com/apache/beam/blob/master/sdks/python/apache_beam/typehints/batch.py >>>>>> >>>>>> >>>>>> On Thu, Jun 16, 2022 at 10:55 AM Sean Jensen-Grey < >>>>>> jenseng...@google.com> wrote: >>>>>> >>>>>>> Interesting. >>>>>>> >>>>>>> Robert, I was just served an ad for Redpanda when I searched for >>>>>>> "golang wasm" :) >>>>>>> >>>>>>> The storage and execution grid systems are all embracing wasm in >>>>>>> some way. >>>>>>> >>>>>>> https://redpanda.com/ >>>>>>> https://www.fluvio.io/ >>>>>>> https://temporal.io/ (Cadence fork by the Cadence folks, I met >>>>>>> Maxim the lead at Temporal at the 2020 Wasm Summit) >>>>>>> https://github.com/pachyderm/pachyderm no mention of wasm, yet. >>>>>>> >>>>>>> Keep the Wasm+Beam demos coming. >>>>>>> >>>>>>> Sean >>>>>>> >>>>>>> >>>>>>> >>>>>>> On Thu, Jun 16, 2022 at 4:23 AM Steven van Rossum < >>>>>>> sjvanros...@google.com> wrote: >>>>>>> >>>>>>>> I caught up with all the replies through the web interface, but I >>>>>>>> didn't have my list subscription set up correctly so my reply (TL;DR >>>>>>>> sample >>>>>>>> code available at https://github.com/sjvanrossum/beam-wasm) didn't >>>>>>>> come through until a bit later yesterday I think. >>>>>>>> >>>>>>>> Sean, I agree with your suggestion of Arrow as the interchange >>>>>>>> format for Wasm transforms and it's something I thought about exploring >>>>>>>> when I was adding serialization/deserialization of complex (meaning >>>>>>>> anything that's not an integer or float in the context of Wasm) data >>>>>>>> types >>>>>>>> in the demo. It's an unfortunate bit of overhead which could very well >>>>>>>> be >>>>>>>> solved with Arrow and shared memory between Wasm modules. >>>>>>>> I've seen Wasm transforms pop up in a few other places, notably in >>>>>>>> streaming data platforms like Fluvio and Redpanda and they seem to >>>>>>>> incur >>>>>>>> the same overhead when moving data into and out of the guest context so >>>>>>>> maybe it's negligible, but I haven't done any serious benchmark yet to >>>>>>>> validate that. >>>>>>>> >>>>>>>> Regards, >>>>>>>> >>>>>>>> Steve >>>>>>>> >>>>>>>> On Thu, Jun 16, 2022 at 3:04 AM Robert Burke <rob...@frantil.com> >>>>>>>> wrote: >>>>>>>> >>>>>>>>> Obligatory mention that WASM is basically an architecture that any >>>>>>>>> well meaning compiler can target, eg the Go compiler >>>>>>>>> >>>>>>>>> >>>>>>>>> https://www.bradcypert.com/an-introduction-to-targeting-web-assembly-with-golang/ >>>>>>>>> >>>>>>>>> (Among many articles for the last few years) >>>>>>>>> >>>>>>>>> Robert Burke >>>>>>>>> Beam Go Busybody >>>>>>>>> >>>>>>>>> On Wed, Jun 15, 2022, 2:04 PM Sean Jensen-Grey < >>>>>>>>> jenseng...@google.com> wrote: >>>>>>>>> >>>>>>>>>> Heh, my stage fright was so strong, I didn't realize that the >>>>>>>>>> talk was recorded. :) >>>>>>>>>> >>>>>>>>>> Steven, I'd love to chat about Wasm in Beam. This email is a bit >>>>>>>>>> rough. >>>>>>>>>> >>>>>>>>>> I haven't explored Wasm in Beam much since that talk. I think the >>>>>>>>>> most compelling use is in the portability of logic between data >>>>>>>>>> processing >>>>>>>>>> systems. Esp in the use of probabilistic data structures like Bloom >>>>>>>>>> Filters, Count-Min-Sketch, HyperLogLog, where it is nice to persist >>>>>>>>>> the >>>>>>>>>> data structure and use it on a different system. Like generating a >>>>>>>>>> bloom >>>>>>>>>> filter in Beam and using it inside of a BQ query w/o having to >>>>>>>>>> reimplement >>>>>>>>>> and test across many platforms. >>>>>>>>>> >>>>>>>>>> I have used Wasm in BQ, as BQ UDFs are driven by V8. Anywhere V8 >>>>>>>>>> exists, Wasm support exists for free unless the embedder goes out of >>>>>>>>>> their >>>>>>>>>> way to disable it. So it is supported in Deno/Node as well. In >>>>>>>>>> Python, Wasm >>>>>>>>>> support via Wasmtime >>>>>>>>>> <https://github.com/bytecodealliance/wasmtime> is really good. >>>>>>>>>> There are *many* options for execution environments, one of the >>>>>>>>>> downsides >>>>>>>>>> of passing through JS one is in string and number >>>>>>>>>> support(float/int64) >>>>>>>>>> issues, afaik. I could be wrong, maybe JS has fixed all this by now. >>>>>>>>>> >>>>>>>>>> The qualities in order of importance (for me) are >>>>>>>>>> >>>>>>>>>> 1. Portability, run the same code everywhere >>>>>>>>>> 2. Security, memory safety for the caller. Running Wasm >>>>>>>>>> inside of Python should never crash your Python interpreter. The >>>>>>>>>> capability >>>>>>>>>> model ensures that the Wasm module can only do what you allow it >>>>>>>>>> to >>>>>>>>>> 3. Performance (portable), compile once and run everywhere >>>>>>>>>> within some margin of native. Python makes this look good :) >>>>>>>>>> >>>>>>>>>> I think something worth exploring is moving opaque-ish Arrow >>>>>>>>>> objects around via Beam, so that Beam is now mostly in the control >>>>>>>>>> plane >>>>>>>>>> and computation happens in Wasm, this should reduce the serialization >>>>>>>>>> overhead and also get Python out of the datapath. >>>>>>>>>> >>>>>>>>>> I see someone exploring Wasm+Arrow here, >>>>>>>>>> https://github.com/domoritz/arrow-wasm >>>>>>>>>> >>>>>>>>>> Another possibly interesting avenue to explore is compiling >>>>>>>>>> command line programs to Wasi (WebAssembly System Interface), the >>>>>>>>>> POSIX >>>>>>>>>> like shim, so that they can be run inprocess without the >>>>>>>>>> fork/exec/pipe >>>>>>>>>> overhead of running a subprocess. A neat demo might be running >>>>>>>>>> something >>>>>>>>>> like Jq <https://stedolan.github.io/jq/> inside of a Beam job. >>>>>>>>>> >>>>>>>>>> Not to make Wasm sound like a Python only technology, it can be >>>>>>>>>> used via Java/JVM via >>>>>>>>>> >>>>>>>>>> - https://www.graalvm.org/22.1/reference-manual/wasm/ >>>>>>>>>> - https://github.com/kawamuray/wasmtime-java >>>>>>>>>> >>>>>>>>>> Sean >>>>>>>>>> >>>>>>>>>> >>>>>>>>>> >>>>>>>>>> On Wed, Jun 15, 2022 at 9:35 AM Pablo Estrada <pabl...@google.com> >>>>>>>>>> wrote: >>>>>>>>>> >>>>>>>>>>> adding Steven in case he didn't get the replies : ) >>>>>>>>>>> >>>>>>>>>>> On Wed, Jun 15, 2022 at 9:29 AM Daniel Collins < >>>>>>>>>>> dpcoll...@google.com> wrote: >>>>>>>>>>> >>>>>>>>>>>> If we ever do anything with the JS runtime, this would seem to >>>>>>>>>>>> be the best place to run WASM. >>>>>>>>>>>> >>>>>>>>>>>> On Tue, Jun 14, 2022 at 8:13 PM Brian Hulette < >>>>>>>>>>>> bhule...@google.com> wrote: >>>>>>>>>>>> >>>>>>>>>>>>> FYI: @Sean Jensen-Grey <jenseng...@google.com> gave a talk >>>>>>>>>>>>> back in 2020 where he had integrated Rust with the Python SDK. I >>>>>>>>>>>>> thought he >>>>>>>>>>>>> used WebAssembly for that, but it looks like he used some other >>>>>>>>>>>>> approaches, >>>>>>>>>>>>> and his talk mentioned WebAssembly as future work. Not sure if >>>>>>>>>>>>> that was >>>>>>>>>>>>> ever explored. >>>>>>>>>>>>> >>>>>>>>>>>>> https://www.youtube.com/watch?v=fZK_Tiu7q1o >>>>>>>>>>>>> https://github.com/seanjensengrey/beam-rust-python-java >>>>>>>>>>>>> >>>>>>>>>>>>> Brian >>>>>>>>>>>>> >>>>>>>>>>>>> >>>>>>>>>>>>> On Tue, Jun 14, 2022 at 5:05 PM Ahmet Altay <al...@google.com> >>>>>>>>>>>>> wrote: >>>>>>>>>>>>> >>>>>>>>>>>>>> Adding @Lukasz Cwik <lc...@google.com> - he was interested >>>>>>>>>>>>>> in the WebAssembly topic. >>>>>>>>>>>>>> >>>>>>>>>>>>>> On Tue, Jun 14, 2022 at 3:09 PM Pablo Estrada < >>>>>>>>>>>>>> pabl...@google.com> wrote: >>>>>>>>>>>>>> >>>>>>>>>>>>>>> Would you open a pull request for it? Or at least share a >>>>>>>>>>>>>>> branch? : ) >>>>>>>>>>>>>>> Even if we don't want to merge it, it would be great to have >>>>>>>>>>>>>>> a PR as a way to showcase the work, its usefulness, and receive >>>>>>>>>>>>>>> comments on >>>>>>>>>>>>>>> this thread once we can see something more specific. >>>>>>>>>>>>>>> >>>>>>>>>>>>>>> On Tue, Jun 14, 2022 at 3:05 PM Steven van Rossum < >>>>>>>>>>>>>>> sjvanros...@google.com> wrote: >>>>>>>>>>>>>>> >>>>>>>>>>>>>>>> Hi folks, >>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>> I had some spare time yesterday and thought it'd be fun to >>>>>>>>>>>>>>>> implement a transform which runs WebAssembly modules as a >>>>>>>>>>>>>>>> lightweight way >>>>>>>>>>>>>>>> to implement cross language transforms for languages which >>>>>>>>>>>>>>>> don't (yet) have >>>>>>>>>>>>>>>> a SDK implementation. >>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>> I've got a small proof of concept running in the Python SDK >>>>>>>>>>>>>>>> as a DoFn with Wasmer as the WebAssembly runtime and simple >>>>>>>>>>>>>>>> support for >>>>>>>>>>>>>>>> marshalling between the host and guest environment with the >>>>>>>>>>>>>>>> RowCoder. The >>>>>>>>>>>>>>>> module I've constructed is mostly useless, but demonstrates >>>>>>>>>>>>>>>> the host >>>>>>>>>>>>>>>> copying the encoded element into the guest's memory, the guest >>>>>>>>>>>>>>>> copying >>>>>>>>>>>>>>>> those bytes elsewhere in its linear memory buffer, the guest >>>>>>>>>>>>>>>> calling back >>>>>>>>>>>>>>>> to the host with the offset and size and the host copying and >>>>>>>>>>>>>>>> decoding from >>>>>>>>>>>>>>>> the guest's memory. >>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>> Any thoughts/interest? I'm not sure where I was going with >>>>>>>>>>>>>>>> this, since it was mostly just a "wouldn't it be cool if..." >>>>>>>>>>>>>>>> on a Monday >>>>>>>>>>>>>>>> afternoon, but I can see a few use cases for this. >>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>> Regards, >>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>> Steve >>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>> Steven van Rossum | Strategic Cloud Engineer | >>>>>>>>>>>>>>>> sjvanros...@google.com | (+31) (0)6 21174069 >>>>>>>>>>>>>>>> <+31%206%2021174069> >>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>> *Google Netherlands B.V.* >>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>> *Reg: Claude Debussylaan 34 15th floor, 1082 MD >>>>>>>>>>>>>>>> Amsterdam34198589NETHERLANDSVAT / Tax ID:- 812788515 B01* >>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>> *If you received this communication by mistake, please >>>>>>>>>>>>>>>> don't forward it to anyone else (it may contain confidential >>>>>>>>>>>>>>>> or privileged >>>>>>>>>>>>>>>> information), please erase all copies of it, including all >>>>>>>>>>>>>>>> attachments, and >>>>>>>>>>>>>>>> please let the sender know it went to the wrong person. >>>>>>>>>>>>>>>> Thanks.* >>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>> *The above terms reflect a potential business arrangement, >>>>>>>>>>>>>>>> are provided solely as a basis for further discussion, and are >>>>>>>>>>>>>>>> not intended >>>>>>>>>>>>>>>> to be and do not constitute a legally binding obligation. No >>>>>>>>>>>>>>>> legally >>>>>>>>>>>>>>>> binding obligations will be created, implied, or inferred >>>>>>>>>>>>>>>> until an >>>>>>>>>>>>>>>> agreement in final form is executed in writing by all parties >>>>>>>>>>>>>>>> involved.* >>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>