I agree with Luke. Targeting little helper UDFs that go along with IOs are
actually a major feature gap for xlang - like timestamp extractors that
have to parse particular data formats. This could be a very useful place to
try out the design options. I think we can simplify the problem by
insisting that they are pure functions that do not access state or side
inputs.

On Wed, Jul 13, 2022 at 7:52 PM Luke Cwik via dev <dev@beam.apache.org>
wrote:

> I think an easier target would be to support things like
> DynamicDestinations for Java IO connectors that are exposed as XLang for
> Go/Python <https://goto.google.com/Python>.
>
> This is because Go/Python <https://goto.google.com/Python> have good
> transpiling support to WebAssembly and we already exposed several Java IO
> XLang connectors already so its about plumbing one more thing through for
> these IO connectors.
>
> What interface should we expect for UDFs / UDAFs and should they be
> purpose oriented or should we do something like we did for portability
> where we have a graph of transforms that we feed arbitrary data in/out
> from. The latter would have the benefit of allowing the runner to embed the
> language execution directly within the runner and would pay the Wasm
> communication tax instead of the gRPC communication tax. If we do the
> former we still have the same issues where we have to be able to have a
> type system to pass information between the host system and the transpiled
> WebAssembly code that wraps the users UDF/UDAF and what if the UDF wants
> access to side inputs or user state ...
>
> On Wed, Jul 13, 2022 at 4:09 PM Chamikara Jayalath <chamik...@google.com>
> wrote:
>
>>
>>
>> On Wed, Jul 13, 2022 at 9:31 AM Luke Cwik <lc...@google.com> wrote:
>>
>>> First we'll want to choose whether we want to target Wasm, WASI or Wagi.
>>>
>>
>> These terms are defined here
>> <https://www.fermyon.com/blog/wasm-wasi-wagi?gclid=CjwKCAjw2rmWBhB4EiwAiJ0mtVhiTuMZmy4bJSlk4nJj1deNX3KueomLgkG8JMyGeiHJ3FJRPpVn7BoCs58QAvD_BwE>
>> if anybody is confused as I am :)
>>
>>
>>> WASI adds a lot of simple things like access to a clock, random number
>>> generator, ... that would expand the scope of what transpiled code can do.
>>> It is debatable whether we'll want the power to run the transpiled code as
>>> a microservice. Using UDFs for XLang and UDFs and UDAFs for SQL as our
>>> expected use cases seem to make WASI the best choice. The issue is in the
>>> details as there is a hodgepodge of what language runtimes support and what
>>> are the limits of transpiling from a language to WebAssembly.
>>>
>>
>> Agree that WASI seems like a good target since it gives access to
>> additional system resources/tooling.
>>
>>
>>>
>>> Assuming WASI then it breaks down to these two aspects:
>>> 1) Does the host language have a runtime?
>>> Java: https://github.com/wasmerio/wasmer-java
>>> Python: https://github.com/wasmerio/wasmer-python
>>> Go: https://github.com/wasmerio/wasmer-go
>>>
>>> 2) How good is compilation from source language to WebAssembly
>>> <https://github.com/appcypher/awesome-wasm-langs>?
>>> Java (very limited):
>>> Issues with garbage collection and the need to transpile/replace much of
>>> the VM's capabilities plus the large standard library that everyone uses
>>> causes a lot of challenges.
>>> JWebAssembly can do simple things like basic classes, strings, method
>>> calls. Should be able to compile trivial lambdas to Wasm. There are other
>>> choices but to my knowledge all are very limited.
>>>
>>
>> That's unfortunate. But hopefully Java support will be implemented soon ?
>>
>>
>>>
>>> Python <https://pythondev.readthedocs.io/wasm.html> (quite good):
>>> Features CPython Emscripten browser CPython Emscripten node Pyodide
>>> subprocess (fork, exec) no no no
>>> threads no YES WIP
>>> file system no (only MEMFS) YES (Node raw FS) YES (IDB, Node, …)
>>> shared extension modules WIP WIP YES
>>> PyPI packages no no YES
>>> sockets ? ? ?
>>> urllib, asyncio no no WebAPI fetch / WebSocket
>>> signals no WIP YES
>>>
>>> Go (excellent): Native support in go compiler
>>>
>>
>> Great. Could executing Go UDFs in Python x-lang transforms (for example,
>> Dataframe, RunInference, Python Map) be a good first target ?
>>
>> Thanks,
>> Cham
>>
>>
>>>
>>> On Tue, Jul 12, 2022 at 5:51 PM Chamikara Jayalath via dev <
>>> dev@beam.apache.org> wrote:
>>>
>>>>
>>>>
>>>> On Wed, Jun 29, 2022 at 9:31 AM Luke Cwik <lc...@google.com> wrote:
>>>>
>>>>> I have had interest in integrating Wasm within Beam as well as I have
>>>>> had a lot of interest in improving language portability.
>>>>>
>>>>> Wasm has a lot of benefits over using docker containers to provide a
>>>>> place for code to execute. From experience implementing working on the
>>>>> Beam's portability layer and internal Flume knowledge:
>>>>> * encoding and decoding data is expensive, anything which ensures that
>>>>> in-memory representations for data being transferred from the host to the
>>>>> guest and back without transcoding/re-interpreting will be a big win.
>>>>> * reducing the amount of times we need to pass data between guest and
>>>>> host and back is important
>>>>>   * fusing transforms reduces the number of data passing points
>>>>>   * batching (row or columnar) data reduces the amount of times we
>>>>> need to pass data at each data passing point
>>>>> * there are enough complicated use cases (state & timers, large
>>>>> iterables, side inputs) where handling the trivial map/flatmap usecase 
>>>>> will
>>>>> provide little value since it will prevent fusion
>>>>>
>>>>> I have been meaning to work on a prototype where we replace the
>>>>> current gRPC + docker path with one in which we use Wasm to execute a 
>>>>> fused
>>>>> graph re-using large parts of the existing code base written to support
>>>>> portability.
>>>>>
>>>>
>>>> This sounds very interesting. Probably using Wasm to implement proper
>>>> UDF support for x-lang (for example, executing Python timestamp/watermark
>>>> functions provided through the Kafka Python x-lang wrapper on the Java
>>>> Kafka transform) will be a good first target ? My main question for this at
>>>> this point is whether Wasm has adequate support for existing SDKs that use
>>>> x-lang to implement this in a useful way.
>>>>
>>>> Thanks,
>>>> Cham
>>>>
>>>>
>>>>>
>>>>>
>>>>> On Fri, Jun 17, 2022 at 2:19 PM Brian Hulette <bhule...@google.com>
>>>>> wrote:
>>>>>
>>>>>> Re: Arrow - it's long been my dream to use Arrow for interchange in
>>>>>> Beam [1]. I'm trying to move us in that direction with
>>>>>> https://s.apache.org/batched-dofns (arrow is discussed briefly in
>>>>>> the Future Work section). This gives the Python SDK a concept of batches 
>>>>>> of
>>>>>> logical elements. My goal is Beam schemas + batches of logical elements 
>>>>>> ->
>>>>>> Arrow RecordBatches.
>>>>>>
>>>>>> The Batched DoFn infrastructure is stable as of the 2.40.0 release
>>>>>> cut and I'm currently working on adding what I'm calling a 
>>>>>> "BatchConverter"
>>>>>> [2] for Beam Rows -> Arrow RecordBatch. Once that's done it could be
>>>>>> interesting to experiment with a "WasmDoFn" that uses Arrow for 
>>>>>> interchange.
>>>>>>
>>>>>> Brian
>>>>>>
>>>>>> [1]
>>>>>> https://docs.google.com/presentation/d/1D9vigwYTCuAuz_CO8nex3GK3h873acmQJE5Ui8TFsDY/edit#slide=id.g608e662464_0_160
>>>>>> [2]
>>>>>> https://github.com/apache/beam/blob/master/sdks/python/apache_beam/typehints/batch.py
>>>>>>
>>>>>>
>>>>>> On Thu, Jun 16, 2022 at 10:55 AM Sean Jensen-Grey <
>>>>>> jenseng...@google.com> wrote:
>>>>>>
>>>>>>> Interesting.
>>>>>>>
>>>>>>> Robert, I was just served an ad for Redpanda when I searched for
>>>>>>> "golang wasm" :)
>>>>>>>
>>>>>>> The storage and execution grid systems are all embracing wasm in
>>>>>>> some way.
>>>>>>>
>>>>>>> https://redpanda.com/
>>>>>>> https://www.fluvio.io/
>>>>>>> https://temporal.io/ (Cadence fork by the Cadence folks, I met
>>>>>>> Maxim the lead at Temporal at the 2020 Wasm Summit)
>>>>>>> https://github.com/pachyderm/pachyderm no mention of wasm, yet.
>>>>>>>
>>>>>>> Keep the Wasm+Beam demos coming.
>>>>>>>
>>>>>>> Sean
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>> On Thu, Jun 16, 2022 at 4:23 AM Steven van Rossum <
>>>>>>> sjvanros...@google.com> wrote:
>>>>>>>
>>>>>>>> I caught up with all the replies through the web interface, but I
>>>>>>>> didn't have my list subscription set up correctly so my reply (TL;DR 
>>>>>>>> sample
>>>>>>>> code available at https://github.com/sjvanrossum/beam-wasm) didn't
>>>>>>>> come through until a bit later yesterday I think.
>>>>>>>>
>>>>>>>> Sean, I agree with your suggestion of Arrow as the interchange
>>>>>>>> format for Wasm transforms and it's something I thought about exploring
>>>>>>>> when I was adding serialization/deserialization of complex (meaning
>>>>>>>> anything that's not an integer or float in the context of Wasm) data 
>>>>>>>> types
>>>>>>>> in the demo. It's an unfortunate bit of overhead which could very well 
>>>>>>>> be
>>>>>>>> solved with Arrow and shared memory between Wasm modules.
>>>>>>>> I've seen Wasm transforms pop up in a few other places, notably in
>>>>>>>> streaming data platforms like Fluvio and Redpanda and they seem to 
>>>>>>>> incur
>>>>>>>> the same overhead when moving data into and out of the guest context so
>>>>>>>> maybe it's negligible, but I haven't done any serious benchmark yet to
>>>>>>>> validate that.
>>>>>>>>
>>>>>>>> Regards,
>>>>>>>>
>>>>>>>> Steve
>>>>>>>>
>>>>>>>> On Thu, Jun 16, 2022 at 3:04 AM Robert Burke <rob...@frantil.com>
>>>>>>>> wrote:
>>>>>>>>
>>>>>>>>> Obligatory mention that WASM is basically an architecture that any
>>>>>>>>> well meaning compiler can target, eg the Go compiler
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> https://www.bradcypert.com/an-introduction-to-targeting-web-assembly-with-golang/
>>>>>>>>>
>>>>>>>>> (Among many articles for the last few years)
>>>>>>>>>
>>>>>>>>> Robert Burke
>>>>>>>>> Beam Go Busybody
>>>>>>>>>
>>>>>>>>> On Wed, Jun 15, 2022, 2:04 PM Sean Jensen-Grey <
>>>>>>>>> jenseng...@google.com> wrote:
>>>>>>>>>
>>>>>>>>>> Heh, my stage fright was so strong, I didn't realize that the
>>>>>>>>>> talk was recorded. :)
>>>>>>>>>>
>>>>>>>>>> Steven, I'd love to chat about Wasm in Beam. This email is a bit
>>>>>>>>>> rough.
>>>>>>>>>>
>>>>>>>>>> I haven't explored Wasm in Beam much since that talk. I think the
>>>>>>>>>> most compelling use is in the portability of logic between data 
>>>>>>>>>> processing
>>>>>>>>>> systems. Esp in the use of probabilistic data structures like Bloom
>>>>>>>>>> Filters, Count-Min-Sketch, HyperLogLog, where it is nice to persist 
>>>>>>>>>> the
>>>>>>>>>> data structure and use it on a different system. Like generating a 
>>>>>>>>>> bloom
>>>>>>>>>> filter in Beam and using it inside of a BQ query w/o having to 
>>>>>>>>>> reimplement
>>>>>>>>>> and test across many platforms.
>>>>>>>>>>
>>>>>>>>>> I have used Wasm in BQ, as BQ UDFs are driven by V8. Anywhere V8
>>>>>>>>>> exists, Wasm support exists for free unless the embedder goes out of 
>>>>>>>>>> their
>>>>>>>>>> way to disable it. So it is supported in Deno/Node as well. In 
>>>>>>>>>> Python, Wasm
>>>>>>>>>> support via Wasmtime
>>>>>>>>>> <https://github.com/bytecodealliance/wasmtime> is really good.
>>>>>>>>>> There are *many* options for execution environments, one of the 
>>>>>>>>>> downsides
>>>>>>>>>> of passing through JS one is in string and number 
>>>>>>>>>> support(float/int64)
>>>>>>>>>> issues, afaik. I could be wrong, maybe JS has fixed all this by now.
>>>>>>>>>>
>>>>>>>>>> The qualities in order of importance (for me) are
>>>>>>>>>>
>>>>>>>>>>    1. Portability, run the same code everywhere
>>>>>>>>>>    2. Security, memory safety for the caller. Running Wasm
>>>>>>>>>>    inside of Python should never crash your Python interpreter. The 
>>>>>>>>>> capability
>>>>>>>>>>    model ensures that the Wasm module can only do what you allow it 
>>>>>>>>>> to
>>>>>>>>>>    3. Performance (portable), compile once and run everywhere
>>>>>>>>>>    within some margin of native.  Python makes this look good :)
>>>>>>>>>>
>>>>>>>>>> I think something worth exploring is moving opaque-ish Arrow
>>>>>>>>>> objects around via Beam, so that Beam is now mostly in the control 
>>>>>>>>>> plane
>>>>>>>>>> and computation happens in Wasm, this should reduce the serialization
>>>>>>>>>> overhead and also get Python out of the datapath.
>>>>>>>>>>
>>>>>>>>>> I see someone exploring Wasm+Arrow here,
>>>>>>>>>> https://github.com/domoritz/arrow-wasm
>>>>>>>>>>
>>>>>>>>>> Another possibly interesting avenue to explore is compiling
>>>>>>>>>> command line programs to Wasi (WebAssembly System Interface), the 
>>>>>>>>>> POSIX
>>>>>>>>>> like shim, so that they can be run inprocess without the 
>>>>>>>>>> fork/exec/pipe
>>>>>>>>>> overhead of running a subprocess. A neat demo might be running 
>>>>>>>>>> something
>>>>>>>>>> like Jq <https://stedolan.github.io/jq/> inside of a Beam job.
>>>>>>>>>>
>>>>>>>>>> Not to make Wasm sound like a Python only technology, it can be
>>>>>>>>>> used via Java/JVM via
>>>>>>>>>>
>>>>>>>>>>    - https://www.graalvm.org/22.1/reference-manual/wasm/
>>>>>>>>>>    - https://github.com/kawamuray/wasmtime-java
>>>>>>>>>>
>>>>>>>>>> Sean
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> On Wed, Jun 15, 2022 at 9:35 AM Pablo Estrada <pabl...@google.com>
>>>>>>>>>> wrote:
>>>>>>>>>>
>>>>>>>>>>> adding Steven in case he didn't get the replies : )
>>>>>>>>>>>
>>>>>>>>>>> On Wed, Jun 15, 2022 at 9:29 AM Daniel Collins <
>>>>>>>>>>> dpcoll...@google.com> wrote:
>>>>>>>>>>>
>>>>>>>>>>>> If we ever do anything with the JS runtime, this would seem to
>>>>>>>>>>>> be the best place to run WASM.
>>>>>>>>>>>>
>>>>>>>>>>>> On Tue, Jun 14, 2022 at 8:13 PM Brian Hulette <
>>>>>>>>>>>> bhule...@google.com> wrote:
>>>>>>>>>>>>
>>>>>>>>>>>>> FYI: @Sean Jensen-Grey <jenseng...@google.com> gave a talk
>>>>>>>>>>>>> back in 2020 where he had integrated Rust with the Python SDK. I 
>>>>>>>>>>>>> thought he
>>>>>>>>>>>>> used WebAssembly for that, but it looks like he used some other 
>>>>>>>>>>>>> approaches,
>>>>>>>>>>>>> and his talk mentioned WebAssembly as future work. Not sure if 
>>>>>>>>>>>>> that was
>>>>>>>>>>>>> ever explored.
>>>>>>>>>>>>>
>>>>>>>>>>>>> https://www.youtube.com/watch?v=fZK_Tiu7q1o
>>>>>>>>>>>>> https://github.com/seanjensengrey/beam-rust-python-java
>>>>>>>>>>>>>
>>>>>>>>>>>>> Brian
>>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>>>>> On Tue, Jun 14, 2022 at 5:05 PM Ahmet Altay <al...@google.com>
>>>>>>>>>>>>> wrote:
>>>>>>>>>>>>>
>>>>>>>>>>>>>> Adding @Lukasz Cwik <lc...@google.com> - he was interested
>>>>>>>>>>>>>> in the WebAssembly topic.
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> On Tue, Jun 14, 2022 at 3:09 PM Pablo Estrada <
>>>>>>>>>>>>>> pabl...@google.com> wrote:
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> Would you open a pull request for it? Or at least share a
>>>>>>>>>>>>>>> branch? : )
>>>>>>>>>>>>>>> Even if we don't want to merge it, it would be great to have
>>>>>>>>>>>>>>> a PR as a way to showcase the work, its usefulness, and receive 
>>>>>>>>>>>>>>> comments on
>>>>>>>>>>>>>>> this thread once we can see something more specific.
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> On Tue, Jun 14, 2022 at 3:05 PM Steven van Rossum <
>>>>>>>>>>>>>>> sjvanros...@google.com> wrote:
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> Hi folks,
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> I had some spare time yesterday and thought it'd be fun to
>>>>>>>>>>>>>>>> implement a transform which runs WebAssembly modules as a 
>>>>>>>>>>>>>>>> lightweight way
>>>>>>>>>>>>>>>> to implement cross language transforms for languages which 
>>>>>>>>>>>>>>>> don't (yet) have
>>>>>>>>>>>>>>>> a SDK implementation.
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> I've got a small proof of concept running in the Python SDK
>>>>>>>>>>>>>>>> as a DoFn with Wasmer as the WebAssembly runtime and simple 
>>>>>>>>>>>>>>>> support for
>>>>>>>>>>>>>>>> marshalling between the host and guest environment with the 
>>>>>>>>>>>>>>>> RowCoder. The
>>>>>>>>>>>>>>>> module I've constructed is mostly useless, but demonstrates 
>>>>>>>>>>>>>>>> the host
>>>>>>>>>>>>>>>> copying the encoded element into the guest's memory, the guest 
>>>>>>>>>>>>>>>> copying
>>>>>>>>>>>>>>>> those bytes elsewhere in its linear memory buffer, the guest 
>>>>>>>>>>>>>>>> calling back
>>>>>>>>>>>>>>>> to the host with the offset and size and the host copying and 
>>>>>>>>>>>>>>>> decoding from
>>>>>>>>>>>>>>>> the guest's memory.
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> Any thoughts/interest? I'm not sure where I was going with
>>>>>>>>>>>>>>>> this, since it was mostly just a "wouldn't it be cool if..." 
>>>>>>>>>>>>>>>> on a Monday
>>>>>>>>>>>>>>>> afternoon, but I can see a few use cases for this.
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> Regards,
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> Steve
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> Steven van Rossum |  Strategic Cloud Engineer |
>>>>>>>>>>>>>>>> sjvanros...@google.com |  (+31) (0)6 21174069
>>>>>>>>>>>>>>>> <+31%206%2021174069>
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> *Google Netherlands B.V.*
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> *Reg: Claude Debussylaan 34 15th floor, 1082 MD
>>>>>>>>>>>>>>>> Amsterdam34198589NETHERLANDSVAT / Tax ID:- 812788515 B01*
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> *If you received this communication by mistake, please
>>>>>>>>>>>>>>>> don't forward it to anyone else (it may contain confidential 
>>>>>>>>>>>>>>>> or privileged
>>>>>>>>>>>>>>>> information), please erase all copies of it, including all 
>>>>>>>>>>>>>>>> attachments, and
>>>>>>>>>>>>>>>> please let the sender know it went to the wrong person. 
>>>>>>>>>>>>>>>> Thanks.*
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> *The above terms reflect a potential business arrangement,
>>>>>>>>>>>>>>>> are provided solely as a basis for further discussion, and are 
>>>>>>>>>>>>>>>> not intended
>>>>>>>>>>>>>>>> to be and do not constitute a legally binding obligation. No 
>>>>>>>>>>>>>>>> legally
>>>>>>>>>>>>>>>> binding obligations will be created, implied, or inferred 
>>>>>>>>>>>>>>>> until an
>>>>>>>>>>>>>>>> agreement in final form is executed in writing by all parties 
>>>>>>>>>>>>>>>> involved.*
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>

Reply via email to