+1 and this is exactly what I suggested as well. Python Dataframe,
RunInference, Python Map are available via x-lang for Java already [1] and
all three need/use simple UDFs to customize operation. There is some logic
that needs to be added to use Python transforms from Go SDK. As you
suggested there are many Java x-lang transforms that can use simple UDF
support as well. Either language combination should work to implement a
first proof of concept for WASI support while also addressing an existing
limitation.

Thanks,
Cham

[1]
https://github.com/apache/beam/tree/master/sdks/java/extensions/python/src/main/java/org/apache/beam/sdk/extensions/python/transforms

On Wed, Jul 13, 2022 at 8:26 PM Kenneth Knowles <k...@apache.org> wrote:

> I agree with Luke. Targeting little helper UDFs that go along with IOs are
> actually a major feature gap for xlang - like timestamp extractors that
> have to parse particular data formats. This could be a very useful place to
> try out the design options. I think we can simplify the problem by
> insisting that they are pure functions that do not access state or side
> inputs.
>
> On Wed, Jul 13, 2022 at 7:52 PM Luke Cwik via dev <dev@beam.apache.org>
> wrote:
>
>> I think an easier target would be to support things like
>> DynamicDestinations for Java IO connectors that are exposed as XLang for
>> Go/Python <https://goto.google.com/Python>.
>>
>> This is because Go/Python <https://goto.google.com/Python> have good
>> transpiling support to WebAssembly and we already exposed several Java IO
>> XLang connectors already so its about plumbing one more thing through for
>> these IO connectors.
>>
>> What interface should we expect for UDFs / UDAFs and should they be
>> purpose oriented or should we do something like we did for portability
>> where we have a graph of transforms that we feed arbitrary data in/out
>> from. The latter would have the benefit of allowing the runner to embed the
>> language execution directly within the runner and would pay the Wasm
>> communication tax instead of the gRPC communication tax. If we do the
>> former we still have the same issues where we have to be able to have a
>> type system to pass information between the host system and the transpiled
>> WebAssembly code that wraps the users UDF/UDAF and what if the UDF wants
>> access to side inputs or user state ...
>>
>> On Wed, Jul 13, 2022 at 4:09 PM Chamikara Jayalath <chamik...@google.com>
>> wrote:
>>
>>>
>>>
>>> On Wed, Jul 13, 2022 at 9:31 AM Luke Cwik <lc...@google.com> wrote:
>>>
>>>> First we'll want to choose whether we want to target Wasm, WASI or Wagi.
>>>>
>>>
>>> These terms are defined here
>>> <https://www.fermyon.com/blog/wasm-wasi-wagi?gclid=CjwKCAjw2rmWBhB4EiwAiJ0mtVhiTuMZmy4bJSlk4nJj1deNX3KueomLgkG8JMyGeiHJ3FJRPpVn7BoCs58QAvD_BwE>
>>> if anybody is confused as I am :)
>>>
>>>
>>>> WASI adds a lot of simple things like access to a clock, random number
>>>> generator, ... that would expand the scope of what transpiled code can do.
>>>> It is debatable whether we'll want the power to run the transpiled code as
>>>> a microservice. Using UDFs for XLang and UDFs and UDAFs for SQL as our
>>>> expected use cases seem to make WASI the best choice. The issue is in the
>>>> details as there is a hodgepodge of what language runtimes support and what
>>>> are the limits of transpiling from a language to WebAssembly.
>>>>
>>>
>>> Agree that WASI seems like a good target since it gives access to
>>> additional system resources/tooling.
>>>
>>>
>>>>
>>>> Assuming WASI then it breaks down to these two aspects:
>>>> 1) Does the host language have a runtime?
>>>> Java: https://github.com/wasmerio/wasmer-java
>>>> Python: https://github.com/wasmerio/wasmer-python
>>>> Go: https://github.com/wasmerio/wasmer-go
>>>>
>>>> 2) How good is compilation from source language to WebAssembly
>>>> <https://github.com/appcypher/awesome-wasm-langs>?
>>>> Java (very limited):
>>>> Issues with garbage collection and the need to transpile/replace much
>>>> of the VM's capabilities plus the large standard library that everyone uses
>>>> causes a lot of challenges.
>>>> JWebAssembly can do simple things like basic classes, strings, method
>>>> calls. Should be able to compile trivial lambdas to Wasm. There are other
>>>> choices but to my knowledge all are very limited.
>>>>
>>>
>>> That's unfortunate. But hopefully Java support will be implemented soon ?
>>>
>>>
>>>>
>>>> Python <https://pythondev.readthedocs.io/wasm.html> (quite good):
>>>> Features CPython Emscripten browser CPython Emscripten node Pyodide
>>>> subprocess (fork, exec) no no no
>>>> threads no YES WIP
>>>> file system no (only MEMFS) YES (Node raw FS) YES (IDB, Node, …)
>>>> shared extension modules WIP WIP YES
>>>> PyPI packages no no YES
>>>> sockets ? ? ?
>>>> urllib, asyncio no no WebAPI fetch / WebSocket
>>>> signals no WIP YES
>>>>
>>>> Go (excellent): Native support in go compiler
>>>>
>>>
>>> Great. Could executing Go UDFs in Python x-lang transforms (for example,
>>> Dataframe, RunInference, Python Map) be a good first target ?
>>>
>>> Thanks,
>>> Cham
>>>
>>>
>>>>
>>>> On Tue, Jul 12, 2022 at 5:51 PM Chamikara Jayalath via dev <
>>>> dev@beam.apache.org> wrote:
>>>>
>>>>>
>>>>>
>>>>> On Wed, Jun 29, 2022 at 9:31 AM Luke Cwik <lc...@google.com> wrote:
>>>>>
>>>>>> I have had interest in integrating Wasm within Beam as well as I have
>>>>>> had a lot of interest in improving language portability.
>>>>>>
>>>>>> Wasm has a lot of benefits over using docker containers to provide a
>>>>>> place for code to execute. From experience implementing working on the
>>>>>> Beam's portability layer and internal Flume knowledge:
>>>>>> * encoding and decoding data is expensive, anything which ensures
>>>>>> that in-memory representations for data being transferred from the host 
>>>>>> to
>>>>>> the guest and back without transcoding/re-interpreting will be a big win.
>>>>>> * reducing the amount of times we need to pass data between guest and
>>>>>> host and back is important
>>>>>>   * fusing transforms reduces the number of data passing points
>>>>>>   * batching (row or columnar) data reduces the amount of times we
>>>>>> need to pass data at each data passing point
>>>>>> * there are enough complicated use cases (state & timers, large
>>>>>> iterables, side inputs) where handling the trivial map/flatmap usecase 
>>>>>> will
>>>>>> provide little value since it will prevent fusion
>>>>>>
>>>>>> I have been meaning to work on a prototype where we replace the
>>>>>> current gRPC + docker path with one in which we use Wasm to execute a 
>>>>>> fused
>>>>>> graph re-using large parts of the existing code base written to support
>>>>>> portability.
>>>>>>
>>>>>
>>>>> This sounds very interesting. Probably using Wasm to implement proper
>>>>> UDF support for x-lang (for example, executing Python timestamp/watermark
>>>>> functions provided through the Kafka Python x-lang wrapper on the Java
>>>>> Kafka transform) will be a good first target ? My main question for this 
>>>>> at
>>>>> this point is whether Wasm has adequate support for existing SDKs that use
>>>>> x-lang to implement this in a useful way.
>>>>>
>>>>> Thanks,
>>>>> Cham
>>>>>
>>>>>
>>>>>>
>>>>>>
>>>>>> On Fri, Jun 17, 2022 at 2:19 PM Brian Hulette <bhule...@google.com>
>>>>>> wrote:
>>>>>>
>>>>>>> Re: Arrow - it's long been my dream to use Arrow for interchange in
>>>>>>> Beam [1]. I'm trying to move us in that direction with
>>>>>>> https://s.apache.org/batched-dofns (arrow is discussed briefly in
>>>>>>> the Future Work section). This gives the Python SDK a concept of 
>>>>>>> batches of
>>>>>>> logical elements. My goal is Beam schemas + batches of logical elements 
>>>>>>> ->
>>>>>>> Arrow RecordBatches.
>>>>>>>
>>>>>>> The Batched DoFn infrastructure is stable as of the 2.40.0 release
>>>>>>> cut and I'm currently working on adding what I'm calling a 
>>>>>>> "BatchConverter"
>>>>>>> [2] for Beam Rows -> Arrow RecordBatch. Once that's done it could be
>>>>>>> interesting to experiment with a "WasmDoFn" that uses Arrow for 
>>>>>>> interchange.
>>>>>>>
>>>>>>> Brian
>>>>>>>
>>>>>>> [1]
>>>>>>> https://docs.google.com/presentation/d/1D9vigwYTCuAuz_CO8nex3GK3h873acmQJE5Ui8TFsDY/edit#slide=id.g608e662464_0_160
>>>>>>> [2]
>>>>>>> https://github.com/apache/beam/blob/master/sdks/python/apache_beam/typehints/batch.py
>>>>>>>
>>>>>>>
>>>>>>> On Thu, Jun 16, 2022 at 10:55 AM Sean Jensen-Grey <
>>>>>>> jenseng...@google.com> wrote:
>>>>>>>
>>>>>>>> Interesting.
>>>>>>>>
>>>>>>>> Robert, I was just served an ad for Redpanda when I searched for
>>>>>>>> "golang wasm" :)
>>>>>>>>
>>>>>>>> The storage and execution grid systems are all embracing wasm in
>>>>>>>> some way.
>>>>>>>>
>>>>>>>> https://redpanda.com/
>>>>>>>> https://www.fluvio.io/
>>>>>>>> https://temporal.io/ (Cadence fork by the Cadence folks, I met
>>>>>>>> Maxim the lead at Temporal at the 2020 Wasm Summit)
>>>>>>>> https://github.com/pachyderm/pachyderm no mention of wasm, yet.
>>>>>>>>
>>>>>>>> Keep the Wasm+Beam demos coming.
>>>>>>>>
>>>>>>>> Sean
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>> On Thu, Jun 16, 2022 at 4:23 AM Steven van Rossum <
>>>>>>>> sjvanros...@google.com> wrote:
>>>>>>>>
>>>>>>>>> I caught up with all the replies through the web interface, but I
>>>>>>>>> didn't have my list subscription set up correctly so my reply (TL;DR 
>>>>>>>>> sample
>>>>>>>>> code available at https://github.com/sjvanrossum/beam-wasm)
>>>>>>>>> didn't come through until a bit later yesterday I think.
>>>>>>>>>
>>>>>>>>> Sean, I agree with your suggestion of Arrow as the interchange
>>>>>>>>> format for Wasm transforms and it's something I thought about 
>>>>>>>>> exploring
>>>>>>>>> when I was adding serialization/deserialization of complex (meaning
>>>>>>>>> anything that's not an integer or float in the context of Wasm) data 
>>>>>>>>> types
>>>>>>>>> in the demo. It's an unfortunate bit of overhead which could very 
>>>>>>>>> well be
>>>>>>>>> solved with Arrow and shared memory between Wasm modules.
>>>>>>>>> I've seen Wasm transforms pop up in a few other places, notably in
>>>>>>>>> streaming data platforms like Fluvio and Redpanda and they seem to 
>>>>>>>>> incur
>>>>>>>>> the same overhead when moving data into and out of the guest context 
>>>>>>>>> so
>>>>>>>>> maybe it's negligible, but I haven't done any serious benchmark yet to
>>>>>>>>> validate that.
>>>>>>>>>
>>>>>>>>> Regards,
>>>>>>>>>
>>>>>>>>> Steve
>>>>>>>>>
>>>>>>>>> On Thu, Jun 16, 2022 at 3:04 AM Robert Burke <rob...@frantil.com>
>>>>>>>>> wrote:
>>>>>>>>>
>>>>>>>>>> Obligatory mention that WASM is basically an architecture that
>>>>>>>>>> any well meaning compiler can target, eg the Go compiler
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> https://www.bradcypert.com/an-introduction-to-targeting-web-assembly-with-golang/
>>>>>>>>>>
>>>>>>>>>> (Among many articles for the last few years)
>>>>>>>>>>
>>>>>>>>>> Robert Burke
>>>>>>>>>> Beam Go Busybody
>>>>>>>>>>
>>>>>>>>>> On Wed, Jun 15, 2022, 2:04 PM Sean Jensen-Grey <
>>>>>>>>>> jenseng...@google.com> wrote:
>>>>>>>>>>
>>>>>>>>>>> Heh, my stage fright was so strong, I didn't realize that the
>>>>>>>>>>> talk was recorded. :)
>>>>>>>>>>>
>>>>>>>>>>> Steven, I'd love to chat about Wasm in Beam. This email is a bit
>>>>>>>>>>> rough.
>>>>>>>>>>>
>>>>>>>>>>> I haven't explored Wasm in Beam much since that talk. I think
>>>>>>>>>>> the most compelling use is in the portability of logic between data
>>>>>>>>>>> processing systems. Esp in the use of probabilistic data structures 
>>>>>>>>>>> like
>>>>>>>>>>> Bloom Filters, Count-Min-Sketch, HyperLogLog, where it is nice to
>>>>>>>>>>> persist the data structure and use it on a different system. Like
>>>>>>>>>>> generating a bloom filter in Beam and using it inside of a BQ query 
>>>>>>>>>>> w/o
>>>>>>>>>>> having to reimplement and test across many platforms.
>>>>>>>>>>>
>>>>>>>>>>> I have used Wasm in BQ, as BQ UDFs are driven by V8. Anywhere V8
>>>>>>>>>>> exists, Wasm support exists for free unless the embedder goes out 
>>>>>>>>>>> of their
>>>>>>>>>>> way to disable it. So it is supported in Deno/Node as well. In 
>>>>>>>>>>> Python, Wasm
>>>>>>>>>>> support via Wasmtime
>>>>>>>>>>> <https://github.com/bytecodealliance/wasmtime> is really good.
>>>>>>>>>>> There are *many* options for execution environments, one of the 
>>>>>>>>>>> downsides
>>>>>>>>>>> of passing through JS one is in string and number 
>>>>>>>>>>> support(float/int64)
>>>>>>>>>>> issues, afaik. I could be wrong, maybe JS has fixed all this by now.
>>>>>>>>>>>
>>>>>>>>>>> The qualities in order of importance (for me) are
>>>>>>>>>>>
>>>>>>>>>>>    1. Portability, run the same code everywhere
>>>>>>>>>>>    2. Security, memory safety for the caller. Running Wasm
>>>>>>>>>>>    inside of Python should never crash your Python interpreter. The 
>>>>>>>>>>> capability
>>>>>>>>>>>    model ensures that the Wasm module can only do what you allow it 
>>>>>>>>>>> to
>>>>>>>>>>>    3. Performance (portable), compile once and run everywhere
>>>>>>>>>>>    within some margin of native.  Python makes this look good :)
>>>>>>>>>>>
>>>>>>>>>>> I think something worth exploring is moving opaque-ish Arrow
>>>>>>>>>>> objects around via Beam, so that Beam is now mostly in the control 
>>>>>>>>>>> plane
>>>>>>>>>>> and computation happens in Wasm, this should reduce the 
>>>>>>>>>>> serialization
>>>>>>>>>>> overhead and also get Python out of the datapath.
>>>>>>>>>>>
>>>>>>>>>>> I see someone exploring Wasm+Arrow here,
>>>>>>>>>>> https://github.com/domoritz/arrow-wasm
>>>>>>>>>>>
>>>>>>>>>>> Another possibly interesting avenue to explore is compiling
>>>>>>>>>>> command line programs to Wasi (WebAssembly System Interface), the 
>>>>>>>>>>> POSIX
>>>>>>>>>>> like shim, so that they can be run inprocess without the 
>>>>>>>>>>> fork/exec/pipe
>>>>>>>>>>> overhead of running a subprocess. A neat demo might be running 
>>>>>>>>>>> something
>>>>>>>>>>> like Jq <https://stedolan.github.io/jq/> inside of a Beam job.
>>>>>>>>>>>
>>>>>>>>>>> Not to make Wasm sound like a Python only technology, it can be
>>>>>>>>>>> used via Java/JVM via
>>>>>>>>>>>
>>>>>>>>>>>    - https://www.graalvm.org/22.1/reference-manual/wasm/
>>>>>>>>>>>    - https://github.com/kawamuray/wasmtime-java
>>>>>>>>>>>
>>>>>>>>>>> Sean
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>> On Wed, Jun 15, 2022 at 9:35 AM Pablo Estrada <
>>>>>>>>>>> pabl...@google.com> wrote:
>>>>>>>>>>>
>>>>>>>>>>>> adding Steven in case he didn't get the replies : )
>>>>>>>>>>>>
>>>>>>>>>>>> On Wed, Jun 15, 2022 at 9:29 AM Daniel Collins <
>>>>>>>>>>>> dpcoll...@google.com> wrote:
>>>>>>>>>>>>
>>>>>>>>>>>>> If we ever do anything with the JS runtime, this would seem to
>>>>>>>>>>>>> be the best place to run WASM.
>>>>>>>>>>>>>
>>>>>>>>>>>>> On Tue, Jun 14, 2022 at 8:13 PM Brian Hulette <
>>>>>>>>>>>>> bhule...@google.com> wrote:
>>>>>>>>>>>>>
>>>>>>>>>>>>>> FYI: @Sean Jensen-Grey <jenseng...@google.com> gave a talk
>>>>>>>>>>>>>> back in 2020 where he had integrated Rust with the Python SDK. I 
>>>>>>>>>>>>>> thought he
>>>>>>>>>>>>>> used WebAssembly for that, but it looks like he used some other 
>>>>>>>>>>>>>> approaches,
>>>>>>>>>>>>>> and his talk mentioned WebAssembly as future work. Not sure if 
>>>>>>>>>>>>>> that was
>>>>>>>>>>>>>> ever explored.
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> https://www.youtube.com/watch?v=fZK_Tiu7q1o
>>>>>>>>>>>>>> https://github.com/seanjensengrey/beam-rust-python-java
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> Brian
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> On Tue, Jun 14, 2022 at 5:05 PM Ahmet Altay <al...@google.com>
>>>>>>>>>>>>>> wrote:
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> Adding @Lukasz Cwik <lc...@google.com> - he was interested
>>>>>>>>>>>>>>> in the WebAssembly topic.
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> On Tue, Jun 14, 2022 at 3:09 PM Pablo Estrada <
>>>>>>>>>>>>>>> pabl...@google.com> wrote:
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> Would you open a pull request for it? Or at least share a
>>>>>>>>>>>>>>>> branch? : )
>>>>>>>>>>>>>>>> Even if we don't want to merge it, it would be great to
>>>>>>>>>>>>>>>> have a PR as a way to showcase the work, its usefulness, and 
>>>>>>>>>>>>>>>> receive
>>>>>>>>>>>>>>>> comments on this thread once we can see something more 
>>>>>>>>>>>>>>>> specific.
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> On Tue, Jun 14, 2022 at 3:05 PM Steven van Rossum <
>>>>>>>>>>>>>>>> sjvanros...@google.com> wrote:
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> Hi folks,
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> I had some spare time yesterday and thought it'd be fun to
>>>>>>>>>>>>>>>>> implement a transform which runs WebAssembly modules as a 
>>>>>>>>>>>>>>>>> lightweight way
>>>>>>>>>>>>>>>>> to implement cross language transforms for languages which 
>>>>>>>>>>>>>>>>> don't (yet) have
>>>>>>>>>>>>>>>>> a SDK implementation.
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> I've got a small proof of concept running in the Python
>>>>>>>>>>>>>>>>> SDK as a DoFn with Wasmer as the WebAssembly runtime and 
>>>>>>>>>>>>>>>>> simple support for
>>>>>>>>>>>>>>>>> marshalling between the host and guest environment with the 
>>>>>>>>>>>>>>>>> RowCoder. The
>>>>>>>>>>>>>>>>> module I've constructed is mostly useless, but demonstrates 
>>>>>>>>>>>>>>>>> the host
>>>>>>>>>>>>>>>>> copying the encoded element into the guest's memory, the 
>>>>>>>>>>>>>>>>> guest copying
>>>>>>>>>>>>>>>>> those bytes elsewhere in its linear memory buffer, the guest 
>>>>>>>>>>>>>>>>> calling back
>>>>>>>>>>>>>>>>> to the host with the offset and size and the host copying and 
>>>>>>>>>>>>>>>>> decoding from
>>>>>>>>>>>>>>>>> the guest's memory.
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> Any thoughts/interest? I'm not sure where I was going with
>>>>>>>>>>>>>>>>> this, since it was mostly just a "wouldn't it be cool if..." 
>>>>>>>>>>>>>>>>> on a Monday
>>>>>>>>>>>>>>>>> afternoon, but I can see a few use cases for this.
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> Regards,
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> Steve
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> Steven van Rossum |  Strategic Cloud Engineer |
>>>>>>>>>>>>>>>>> sjvanros...@google.com |  (+31) (0)6 21174069
>>>>>>>>>>>>>>>>> <+31%206%2021174069>
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> *Google Netherlands B.V.*
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> *Reg: Claude Debussylaan 34 15th floor, 1082 MD
>>>>>>>>>>>>>>>>> Amsterdam34198589NETHERLANDSVAT / Tax ID:- 812788515 B01*
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> *If you received this communication by mistake, please
>>>>>>>>>>>>>>>>> don't forward it to anyone else (it may contain confidential 
>>>>>>>>>>>>>>>>> or privileged
>>>>>>>>>>>>>>>>> information), please erase all copies of it, including all 
>>>>>>>>>>>>>>>>> attachments, and
>>>>>>>>>>>>>>>>> please let the sender know it went to the wrong person. 
>>>>>>>>>>>>>>>>> Thanks.*
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> *The above terms reflect a potential business arrangement,
>>>>>>>>>>>>>>>>> are provided solely as a basis for further discussion, and 
>>>>>>>>>>>>>>>>> are not intended
>>>>>>>>>>>>>>>>> to be and do not constitute a legally binding obligation. No 
>>>>>>>>>>>>>>>>> legally
>>>>>>>>>>>>>>>>> binding obligations will be created, implied, or inferred 
>>>>>>>>>>>>>>>>> until an
>>>>>>>>>>>>>>>>> agreement in final form is executed in writing by all parties 
>>>>>>>>>>>>>>>>> involved.*
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>

Reply via email to