+1 and this is exactly what I suggested as well. Python Dataframe, RunInference, Python Map are available via x-lang for Java already [1] and all three need/use simple UDFs to customize operation. There is some logic that needs to be added to use Python transforms from Go SDK. As you suggested there are many Java x-lang transforms that can use simple UDF support as well. Either language combination should work to implement a first proof of concept for WASI support while also addressing an existing limitation.
Thanks, Cham [1] https://github.com/apache/beam/tree/master/sdks/java/extensions/python/src/main/java/org/apache/beam/sdk/extensions/python/transforms On Wed, Jul 13, 2022 at 8:26 PM Kenneth Knowles <k...@apache.org> wrote: > I agree with Luke. Targeting little helper UDFs that go along with IOs are > actually a major feature gap for xlang - like timestamp extractors that > have to parse particular data formats. This could be a very useful place to > try out the design options. I think we can simplify the problem by > insisting that they are pure functions that do not access state or side > inputs. > > On Wed, Jul 13, 2022 at 7:52 PM Luke Cwik via dev <dev@beam.apache.org> > wrote: > >> I think an easier target would be to support things like >> DynamicDestinations for Java IO connectors that are exposed as XLang for >> Go/Python <https://goto.google.com/Python>. >> >> This is because Go/Python <https://goto.google.com/Python> have good >> transpiling support to WebAssembly and we already exposed several Java IO >> XLang connectors already so its about plumbing one more thing through for >> these IO connectors. >> >> What interface should we expect for UDFs / UDAFs and should they be >> purpose oriented or should we do something like we did for portability >> where we have a graph of transforms that we feed arbitrary data in/out >> from. The latter would have the benefit of allowing the runner to embed the >> language execution directly within the runner and would pay the Wasm >> communication tax instead of the gRPC communication tax. If we do the >> former we still have the same issues where we have to be able to have a >> type system to pass information between the host system and the transpiled >> WebAssembly code that wraps the users UDF/UDAF and what if the UDF wants >> access to side inputs or user state ... >> >> On Wed, Jul 13, 2022 at 4:09 PM Chamikara Jayalath <chamik...@google.com> >> wrote: >> >>> >>> >>> On Wed, Jul 13, 2022 at 9:31 AM Luke Cwik <lc...@google.com> wrote: >>> >>>> First we'll want to choose whether we want to target Wasm, WASI or Wagi. >>>> >>> >>> These terms are defined here >>> <https://www.fermyon.com/blog/wasm-wasi-wagi?gclid=CjwKCAjw2rmWBhB4EiwAiJ0mtVhiTuMZmy4bJSlk4nJj1deNX3KueomLgkG8JMyGeiHJ3FJRPpVn7BoCs58QAvD_BwE> >>> if anybody is confused as I am :) >>> >>> >>>> WASI adds a lot of simple things like access to a clock, random number >>>> generator, ... that would expand the scope of what transpiled code can do. >>>> It is debatable whether we'll want the power to run the transpiled code as >>>> a microservice. Using UDFs for XLang and UDFs and UDAFs for SQL as our >>>> expected use cases seem to make WASI the best choice. The issue is in the >>>> details as there is a hodgepodge of what language runtimes support and what >>>> are the limits of transpiling from a language to WebAssembly. >>>> >>> >>> Agree that WASI seems like a good target since it gives access to >>> additional system resources/tooling. >>> >>> >>>> >>>> Assuming WASI then it breaks down to these two aspects: >>>> 1) Does the host language have a runtime? >>>> Java: https://github.com/wasmerio/wasmer-java >>>> Python: https://github.com/wasmerio/wasmer-python >>>> Go: https://github.com/wasmerio/wasmer-go >>>> >>>> 2) How good is compilation from source language to WebAssembly >>>> <https://github.com/appcypher/awesome-wasm-langs>? >>>> Java (very limited): >>>> Issues with garbage collection and the need to transpile/replace much >>>> of the VM's capabilities plus the large standard library that everyone uses >>>> causes a lot of challenges. >>>> JWebAssembly can do simple things like basic classes, strings, method >>>> calls. Should be able to compile trivial lambdas to Wasm. There are other >>>> choices but to my knowledge all are very limited. >>>> >>> >>> That's unfortunate. But hopefully Java support will be implemented soon ? >>> >>> >>>> >>>> Python <https://pythondev.readthedocs.io/wasm.html> (quite good): >>>> Features CPython Emscripten browser CPython Emscripten node Pyodide >>>> subprocess (fork, exec) no no no >>>> threads no YES WIP >>>> file system no (only MEMFS) YES (Node raw FS) YES (IDB, Node, …) >>>> shared extension modules WIP WIP YES >>>> PyPI packages no no YES >>>> sockets ? ? ? >>>> urllib, asyncio no no WebAPI fetch / WebSocket >>>> signals no WIP YES >>>> >>>> Go (excellent): Native support in go compiler >>>> >>> >>> Great. Could executing Go UDFs in Python x-lang transforms (for example, >>> Dataframe, RunInference, Python Map) be a good first target ? >>> >>> Thanks, >>> Cham >>> >>> >>>> >>>> On Tue, Jul 12, 2022 at 5:51 PM Chamikara Jayalath via dev < >>>> dev@beam.apache.org> wrote: >>>> >>>>> >>>>> >>>>> On Wed, Jun 29, 2022 at 9:31 AM Luke Cwik <lc...@google.com> wrote: >>>>> >>>>>> I have had interest in integrating Wasm within Beam as well as I have >>>>>> had a lot of interest in improving language portability. >>>>>> >>>>>> Wasm has a lot of benefits over using docker containers to provide a >>>>>> place for code to execute. From experience implementing working on the >>>>>> Beam's portability layer and internal Flume knowledge: >>>>>> * encoding and decoding data is expensive, anything which ensures >>>>>> that in-memory representations for data being transferred from the host >>>>>> to >>>>>> the guest and back without transcoding/re-interpreting will be a big win. >>>>>> * reducing the amount of times we need to pass data between guest and >>>>>> host and back is important >>>>>> * fusing transforms reduces the number of data passing points >>>>>> * batching (row or columnar) data reduces the amount of times we >>>>>> need to pass data at each data passing point >>>>>> * there are enough complicated use cases (state & timers, large >>>>>> iterables, side inputs) where handling the trivial map/flatmap usecase >>>>>> will >>>>>> provide little value since it will prevent fusion >>>>>> >>>>>> I have been meaning to work on a prototype where we replace the >>>>>> current gRPC + docker path with one in which we use Wasm to execute a >>>>>> fused >>>>>> graph re-using large parts of the existing code base written to support >>>>>> portability. >>>>>> >>>>> >>>>> This sounds very interesting. Probably using Wasm to implement proper >>>>> UDF support for x-lang (for example, executing Python timestamp/watermark >>>>> functions provided through the Kafka Python x-lang wrapper on the Java >>>>> Kafka transform) will be a good first target ? My main question for this >>>>> at >>>>> this point is whether Wasm has adequate support for existing SDKs that use >>>>> x-lang to implement this in a useful way. >>>>> >>>>> Thanks, >>>>> Cham >>>>> >>>>> >>>>>> >>>>>> >>>>>> On Fri, Jun 17, 2022 at 2:19 PM Brian Hulette <bhule...@google.com> >>>>>> wrote: >>>>>> >>>>>>> Re: Arrow - it's long been my dream to use Arrow for interchange in >>>>>>> Beam [1]. I'm trying to move us in that direction with >>>>>>> https://s.apache.org/batched-dofns (arrow is discussed briefly in >>>>>>> the Future Work section). This gives the Python SDK a concept of >>>>>>> batches of >>>>>>> logical elements. My goal is Beam schemas + batches of logical elements >>>>>>> -> >>>>>>> Arrow RecordBatches. >>>>>>> >>>>>>> The Batched DoFn infrastructure is stable as of the 2.40.0 release >>>>>>> cut and I'm currently working on adding what I'm calling a >>>>>>> "BatchConverter" >>>>>>> [2] for Beam Rows -> Arrow RecordBatch. Once that's done it could be >>>>>>> interesting to experiment with a "WasmDoFn" that uses Arrow for >>>>>>> interchange. >>>>>>> >>>>>>> Brian >>>>>>> >>>>>>> [1] >>>>>>> https://docs.google.com/presentation/d/1D9vigwYTCuAuz_CO8nex3GK3h873acmQJE5Ui8TFsDY/edit#slide=id.g608e662464_0_160 >>>>>>> [2] >>>>>>> https://github.com/apache/beam/blob/master/sdks/python/apache_beam/typehints/batch.py >>>>>>> >>>>>>> >>>>>>> On Thu, Jun 16, 2022 at 10:55 AM Sean Jensen-Grey < >>>>>>> jenseng...@google.com> wrote: >>>>>>> >>>>>>>> Interesting. >>>>>>>> >>>>>>>> Robert, I was just served an ad for Redpanda when I searched for >>>>>>>> "golang wasm" :) >>>>>>>> >>>>>>>> The storage and execution grid systems are all embracing wasm in >>>>>>>> some way. >>>>>>>> >>>>>>>> https://redpanda.com/ >>>>>>>> https://www.fluvio.io/ >>>>>>>> https://temporal.io/ (Cadence fork by the Cadence folks, I met >>>>>>>> Maxim the lead at Temporal at the 2020 Wasm Summit) >>>>>>>> https://github.com/pachyderm/pachyderm no mention of wasm, yet. >>>>>>>> >>>>>>>> Keep the Wasm+Beam demos coming. >>>>>>>> >>>>>>>> Sean >>>>>>>> >>>>>>>> >>>>>>>> >>>>>>>> On Thu, Jun 16, 2022 at 4:23 AM Steven van Rossum < >>>>>>>> sjvanros...@google.com> wrote: >>>>>>>> >>>>>>>>> I caught up with all the replies through the web interface, but I >>>>>>>>> didn't have my list subscription set up correctly so my reply (TL;DR >>>>>>>>> sample >>>>>>>>> code available at https://github.com/sjvanrossum/beam-wasm) >>>>>>>>> didn't come through until a bit later yesterday I think. >>>>>>>>> >>>>>>>>> Sean, I agree with your suggestion of Arrow as the interchange >>>>>>>>> format for Wasm transforms and it's something I thought about >>>>>>>>> exploring >>>>>>>>> when I was adding serialization/deserialization of complex (meaning >>>>>>>>> anything that's not an integer or float in the context of Wasm) data >>>>>>>>> types >>>>>>>>> in the demo. It's an unfortunate bit of overhead which could very >>>>>>>>> well be >>>>>>>>> solved with Arrow and shared memory between Wasm modules. >>>>>>>>> I've seen Wasm transforms pop up in a few other places, notably in >>>>>>>>> streaming data platforms like Fluvio and Redpanda and they seem to >>>>>>>>> incur >>>>>>>>> the same overhead when moving data into and out of the guest context >>>>>>>>> so >>>>>>>>> maybe it's negligible, but I haven't done any serious benchmark yet to >>>>>>>>> validate that. >>>>>>>>> >>>>>>>>> Regards, >>>>>>>>> >>>>>>>>> Steve >>>>>>>>> >>>>>>>>> On Thu, Jun 16, 2022 at 3:04 AM Robert Burke <rob...@frantil.com> >>>>>>>>> wrote: >>>>>>>>> >>>>>>>>>> Obligatory mention that WASM is basically an architecture that >>>>>>>>>> any well meaning compiler can target, eg the Go compiler >>>>>>>>>> >>>>>>>>>> >>>>>>>>>> https://www.bradcypert.com/an-introduction-to-targeting-web-assembly-with-golang/ >>>>>>>>>> >>>>>>>>>> (Among many articles for the last few years) >>>>>>>>>> >>>>>>>>>> Robert Burke >>>>>>>>>> Beam Go Busybody >>>>>>>>>> >>>>>>>>>> On Wed, Jun 15, 2022, 2:04 PM Sean Jensen-Grey < >>>>>>>>>> jenseng...@google.com> wrote: >>>>>>>>>> >>>>>>>>>>> Heh, my stage fright was so strong, I didn't realize that the >>>>>>>>>>> talk was recorded. :) >>>>>>>>>>> >>>>>>>>>>> Steven, I'd love to chat about Wasm in Beam. This email is a bit >>>>>>>>>>> rough. >>>>>>>>>>> >>>>>>>>>>> I haven't explored Wasm in Beam much since that talk. I think >>>>>>>>>>> the most compelling use is in the portability of logic between data >>>>>>>>>>> processing systems. Esp in the use of probabilistic data structures >>>>>>>>>>> like >>>>>>>>>>> Bloom Filters, Count-Min-Sketch, HyperLogLog, where it is nice to >>>>>>>>>>> persist the data structure and use it on a different system. Like >>>>>>>>>>> generating a bloom filter in Beam and using it inside of a BQ query >>>>>>>>>>> w/o >>>>>>>>>>> having to reimplement and test across many platforms. >>>>>>>>>>> >>>>>>>>>>> I have used Wasm in BQ, as BQ UDFs are driven by V8. Anywhere V8 >>>>>>>>>>> exists, Wasm support exists for free unless the embedder goes out >>>>>>>>>>> of their >>>>>>>>>>> way to disable it. So it is supported in Deno/Node as well. In >>>>>>>>>>> Python, Wasm >>>>>>>>>>> support via Wasmtime >>>>>>>>>>> <https://github.com/bytecodealliance/wasmtime> is really good. >>>>>>>>>>> There are *many* options for execution environments, one of the >>>>>>>>>>> downsides >>>>>>>>>>> of passing through JS one is in string and number >>>>>>>>>>> support(float/int64) >>>>>>>>>>> issues, afaik. I could be wrong, maybe JS has fixed all this by now. >>>>>>>>>>> >>>>>>>>>>> The qualities in order of importance (for me) are >>>>>>>>>>> >>>>>>>>>>> 1. Portability, run the same code everywhere >>>>>>>>>>> 2. Security, memory safety for the caller. Running Wasm >>>>>>>>>>> inside of Python should never crash your Python interpreter. The >>>>>>>>>>> capability >>>>>>>>>>> model ensures that the Wasm module can only do what you allow it >>>>>>>>>>> to >>>>>>>>>>> 3. Performance (portable), compile once and run everywhere >>>>>>>>>>> within some margin of native. Python makes this look good :) >>>>>>>>>>> >>>>>>>>>>> I think something worth exploring is moving opaque-ish Arrow >>>>>>>>>>> objects around via Beam, so that Beam is now mostly in the control >>>>>>>>>>> plane >>>>>>>>>>> and computation happens in Wasm, this should reduce the >>>>>>>>>>> serialization >>>>>>>>>>> overhead and also get Python out of the datapath. >>>>>>>>>>> >>>>>>>>>>> I see someone exploring Wasm+Arrow here, >>>>>>>>>>> https://github.com/domoritz/arrow-wasm >>>>>>>>>>> >>>>>>>>>>> Another possibly interesting avenue to explore is compiling >>>>>>>>>>> command line programs to Wasi (WebAssembly System Interface), the >>>>>>>>>>> POSIX >>>>>>>>>>> like shim, so that they can be run inprocess without the >>>>>>>>>>> fork/exec/pipe >>>>>>>>>>> overhead of running a subprocess. A neat demo might be running >>>>>>>>>>> something >>>>>>>>>>> like Jq <https://stedolan.github.io/jq/> inside of a Beam job. >>>>>>>>>>> >>>>>>>>>>> Not to make Wasm sound like a Python only technology, it can be >>>>>>>>>>> used via Java/JVM via >>>>>>>>>>> >>>>>>>>>>> - https://www.graalvm.org/22.1/reference-manual/wasm/ >>>>>>>>>>> - https://github.com/kawamuray/wasmtime-java >>>>>>>>>>> >>>>>>>>>>> Sean >>>>>>>>>>> >>>>>>>>>>> >>>>>>>>>>> >>>>>>>>>>> On Wed, Jun 15, 2022 at 9:35 AM Pablo Estrada < >>>>>>>>>>> pabl...@google.com> wrote: >>>>>>>>>>> >>>>>>>>>>>> adding Steven in case he didn't get the replies : ) >>>>>>>>>>>> >>>>>>>>>>>> On Wed, Jun 15, 2022 at 9:29 AM Daniel Collins < >>>>>>>>>>>> dpcoll...@google.com> wrote: >>>>>>>>>>>> >>>>>>>>>>>>> If we ever do anything with the JS runtime, this would seem to >>>>>>>>>>>>> be the best place to run WASM. >>>>>>>>>>>>> >>>>>>>>>>>>> On Tue, Jun 14, 2022 at 8:13 PM Brian Hulette < >>>>>>>>>>>>> bhule...@google.com> wrote: >>>>>>>>>>>>> >>>>>>>>>>>>>> FYI: @Sean Jensen-Grey <jenseng...@google.com> gave a talk >>>>>>>>>>>>>> back in 2020 where he had integrated Rust with the Python SDK. I >>>>>>>>>>>>>> thought he >>>>>>>>>>>>>> used WebAssembly for that, but it looks like he used some other >>>>>>>>>>>>>> approaches, >>>>>>>>>>>>>> and his talk mentioned WebAssembly as future work. Not sure if >>>>>>>>>>>>>> that was >>>>>>>>>>>>>> ever explored. >>>>>>>>>>>>>> >>>>>>>>>>>>>> https://www.youtube.com/watch?v=fZK_Tiu7q1o >>>>>>>>>>>>>> https://github.com/seanjensengrey/beam-rust-python-java >>>>>>>>>>>>>> >>>>>>>>>>>>>> Brian >>>>>>>>>>>>>> >>>>>>>>>>>>>> >>>>>>>>>>>>>> On Tue, Jun 14, 2022 at 5:05 PM Ahmet Altay <al...@google.com> >>>>>>>>>>>>>> wrote: >>>>>>>>>>>>>> >>>>>>>>>>>>>>> Adding @Lukasz Cwik <lc...@google.com> - he was interested >>>>>>>>>>>>>>> in the WebAssembly topic. >>>>>>>>>>>>>>> >>>>>>>>>>>>>>> On Tue, Jun 14, 2022 at 3:09 PM Pablo Estrada < >>>>>>>>>>>>>>> pabl...@google.com> wrote: >>>>>>>>>>>>>>> >>>>>>>>>>>>>>>> Would you open a pull request for it? Or at least share a >>>>>>>>>>>>>>>> branch? : ) >>>>>>>>>>>>>>>> Even if we don't want to merge it, it would be great to >>>>>>>>>>>>>>>> have a PR as a way to showcase the work, its usefulness, and >>>>>>>>>>>>>>>> receive >>>>>>>>>>>>>>>> comments on this thread once we can see something more >>>>>>>>>>>>>>>> specific. >>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>> On Tue, Jun 14, 2022 at 3:05 PM Steven van Rossum < >>>>>>>>>>>>>>>> sjvanros...@google.com> wrote: >>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>> Hi folks, >>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>> I had some spare time yesterday and thought it'd be fun to >>>>>>>>>>>>>>>>> implement a transform which runs WebAssembly modules as a >>>>>>>>>>>>>>>>> lightweight way >>>>>>>>>>>>>>>>> to implement cross language transforms for languages which >>>>>>>>>>>>>>>>> don't (yet) have >>>>>>>>>>>>>>>>> a SDK implementation. >>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>> I've got a small proof of concept running in the Python >>>>>>>>>>>>>>>>> SDK as a DoFn with Wasmer as the WebAssembly runtime and >>>>>>>>>>>>>>>>> simple support for >>>>>>>>>>>>>>>>> marshalling between the host and guest environment with the >>>>>>>>>>>>>>>>> RowCoder. The >>>>>>>>>>>>>>>>> module I've constructed is mostly useless, but demonstrates >>>>>>>>>>>>>>>>> the host >>>>>>>>>>>>>>>>> copying the encoded element into the guest's memory, the >>>>>>>>>>>>>>>>> guest copying >>>>>>>>>>>>>>>>> those bytes elsewhere in its linear memory buffer, the guest >>>>>>>>>>>>>>>>> calling back >>>>>>>>>>>>>>>>> to the host with the offset and size and the host copying and >>>>>>>>>>>>>>>>> decoding from >>>>>>>>>>>>>>>>> the guest's memory. >>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>> Any thoughts/interest? I'm not sure where I was going with >>>>>>>>>>>>>>>>> this, since it was mostly just a "wouldn't it be cool if..." >>>>>>>>>>>>>>>>> on a Monday >>>>>>>>>>>>>>>>> afternoon, but I can see a few use cases for this. >>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>> Regards, >>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>> Steve >>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>> Steven van Rossum | Strategic Cloud Engineer | >>>>>>>>>>>>>>>>> sjvanros...@google.com | (+31) (0)6 21174069 >>>>>>>>>>>>>>>>> <+31%206%2021174069> >>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>> *Google Netherlands B.V.* >>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>> *Reg: Claude Debussylaan 34 15th floor, 1082 MD >>>>>>>>>>>>>>>>> Amsterdam34198589NETHERLANDSVAT / Tax ID:- 812788515 B01* >>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>> *If you received this communication by mistake, please >>>>>>>>>>>>>>>>> don't forward it to anyone else (it may contain confidential >>>>>>>>>>>>>>>>> or privileged >>>>>>>>>>>>>>>>> information), please erase all copies of it, including all >>>>>>>>>>>>>>>>> attachments, and >>>>>>>>>>>>>>>>> please let the sender know it went to the wrong person. >>>>>>>>>>>>>>>>> Thanks.* >>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>> *The above terms reflect a potential business arrangement, >>>>>>>>>>>>>>>>> are provided solely as a basis for further discussion, and >>>>>>>>>>>>>>>>> are not intended >>>>>>>>>>>>>>>>> to be and do not constitute a legally binding obligation. No >>>>>>>>>>>>>>>>> legally >>>>>>>>>>>>>>>>> binding obligations will be created, implied, or inferred >>>>>>>>>>>>>>>>> until an >>>>>>>>>>>>>>>>> agreement in final form is executed in writing by all parties >>>>>>>>>>>>>>>>> involved.* >>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>