+1 (non-PMC) On Wed, May 4, 2022, 3:37 PM Ahmet Altay <[email protected]> wrote:
> Thank you! > > On Wed, May 4, 2022 at 4:22 PM Sachin Agarwal <[email protected]> wrote: > >> Wow - great work y'all! >> >> On Wed, May 4, 2022 at 3:21 PM Robert Bradshaw <[email protected]> >> wrote: >> >>> The entire SDK has now been reviewed and all outstanding issues >>> addressed. https://github.com/apache/beam/pull/17341 (Big shout out to >>> Danny McCormick for his tireless work here!) This does not mean the >>> SDK is done, but it's marked as experimental (and isolated) and IMHO >>> to a point where we can continue to iterate on the main branch similar >>> to how we do our other development. >>> >>> Any objections or other thoughts on merging? >>> >> > +1 to merging to the main branch. > > >> >>> On Mon, Feb 7, 2022 at 9:21 AM Robert Bradshaw <[email protected]> >>> wrote: >>> > >>> > +1 to separating things out if bundling them together becomes too >>> > burdensome, though I agree we're not at that point yet (and there is a >>> > non-trivial amount of overhead in just doing a release--speaking of >>> > which I encourage everyone to look at and vote on the pending RC). >>> > >>> > That being said, the portability API, and the ability to evolve it in >>> > a backwards compatible way with capabilities and requirements, makes >>> > it easy to evolve each SDK and Runner independently and not have to >>> > worry about which subset of the cross product is actually supported. >>> > >>> > On Mon, Feb 7, 2022 at 1:44 AM Jan Lukavský <[email protected]> wrote: >>> > > >>> > > I'll add one note from a different perspective. I think that >>> long-term we should consider having separate release cycles for core, SDKs, >>> DSLs and runners. It feels releasing all parts as a single "monolith" will >>> gradually cause the core parts (e.g. model, runners-core, ...) to be more >>> and more expensive to modify, because each modification to these core >>> parts, might affect more and more other components. Enabling all SDKs and >>> runners to "choose" the supported SDK-core or runner-core (while >>> encouraging them to support the most recent!) is more maintainable for the >>> future. >>> > > >>> > > I'm not saying we need to do something right now before merging the >>> JS SDK, but on the other hand adding like 10 more SDKs would start to be an >>> issue. We probably could talk about if (and how) we could make some sort of >>> separation. >>> > > >>> > > Jan >>> > > >>> > > On 2/4/22 18:42, Robert Burke wrote: >>> > > >>> > > I imagine by the nature of the Apache 2.0 license, the quality of >>> the code in a given release is not a given without some other statement by >>> the maintainers. We should clear and present warning signs. Erm. >>> Experimental labeling. >>> > > >>> > > On Fri, Feb 4, 2022 at 8:27 AM Kenneth Knowles <[email protected]> >>> wrote: >>> > >> >>> > >> >>> > >> >>> > >> On Thu, Feb 3, 2022 at 3:43 PM Robert Burke <[email protected]> >>> wrote: >>> > >>> >>> > >>> Personally, if it gets added to the repo at all I'd rather we rip >>> off the band-aid and at least have all the tests regularly run, and various >>> GitHub actions. Even if we aren't doing the container release activities, >>> because it's experimental, that's much better than bit rot and being part >>> of the main repo has a simpler contribution convention. >>> > >> >>> > >> >>> > >> 100% agree about bit rot and it also makes it more accessible for >>> contribution and experiment. This is a strong motivation for me to get it >>> right onto master. Some contributions to branches are probably just unknown >>> to a lot of contributors (or adventurous users), for example >>> https://github.com/apache/beam/tree/tez-runner >>> https://github.com/apache/beam/tree/jstorm-runner >>> https://github.com/apache/beam/tree/mr-runner >>> > >> >>> > >> I'm guessing since node has a distribution via npm if we do nothing >>> it is essentially "not released". I don't see it as a big problem having it >>> in the archived ASF source releases, as long as licenses and whatnot are >>> good, though I may be overlooking something. >>> > >> >>> > >> Kenn >>> > >> >>> > >>> >>> > >>> >>> > >>> Those are my 2 cents. >>> > >>> Robert B >>> > >>> Beam Go Busybody >>> > >>> >>> > >>> On Thu, Feb 3, 2022, 3:29 PM Kenneth Knowles <[email protected]> >>> wrote: >>> > >>>> >>> > >>>> We did the same for the Go SDK for some time. I imagine just "not >>> doing the work to release it" suffices? Maybe +Robert Burke has some other >>> memories of how to not release. >>> > >>>> >>> > >>>> Kenn >>> > >>>> >>> > >>>> On Mon, Jan 31, 2022 at 1:05 PM Kerry Donny-Clark < >>> [email protected]> wrote: >>> > >>>>> >>> > >>>>> This project was a great way to kickstart a new SDK. I'd like to >>> bring this into Beam and start cleanup. Are there any steps to take before >>> making a PR? Is there a way to mark this as experimental/not for release? >>> > >>>>> Kerry >>> > >>>>> >>> > >>>>> On Mon, Jan 17, 2022 at 1:22 AM Pablo Estrada < >>> [email protected]> wrote: >>> > >>>>>> >>> > >>>>>> This project was fun, and I learned a lot putting some time >>> into it. I'd love for it to be brought into the main repository and worked >>> over some time to be fully supported. >>> > >>>>>> Best >>> > >>>>>> -P. >>> > >>>>>> >>> > >>>>>> On Fri, Jan 14, 2022 at 4:46 PM Ahmet Altay <[email protected]> >>> wrote: >>> > >>>>>>> >>> > >>>>>>> Really nice! Congratulations to all who worked on this project. >>> > >>>>>>> >>> > >>>>>>> On Fri, Jan 14, 2022 at 4:41 PM Kenneth Knowles < >>> [email protected]> wrote: >>> > >>>>>>>> >>> > >>>>>>>> This was super fun, and I really hope it can be an >>> inspiration to others that you can build a working Beam SDK in a week! >>> > >>>>>>>> >>> > >>>>>>>> (hint hint https://issues.apache.org/jira/browse/BEAM-4010 >>> and https://issues.apache.org/jira/browse/BEAM-12658 :-) >>> > >>>>>>>> >>> > >>>>>>>> On Fri, Jan 14, 2022 at 11:38 AM Robert Bradshaw < >>> [email protected]> wrote: >>> > >>>>>>>>> >>> > >>>>>>>>> And, of course, an example: >>> > >>>>>>>>> >>> > >>>>>>>>> >>> https://github.com/robertwb/beam-javascript/blob/javascript/sdks/node-ts/src/apache_beam/examples/wordcount.ts >>> > >>>>>>>>> >>> > >>>>>>>>> On Fri, Jan 14, 2022 at 11:35 AM Robert Bradshaw < >>> [email protected]> wrote: >>> > >>>>>>>>> > >>> > >>>>>>>>> > Last week at Google we had a hackathon to kick off the new >>> year, and >>> > >>>>>>>>> > one of the projects we came up with was seeing how far we >>> could get in >>> > >>>>>>>>> > putting together a typescript SDK. Starting from nothing >>> we were able >>> > >>>>>>>>> > to make a lot of progress and I wanted to share the >>> results here. >>> > >>>>>>>>> > >>> > >>>>>>>>> > >>> https://github.com/robertwb/beam-javascript/blob/javascript/sdks/node-ts/README.md >>> > >>>>>>>>> > >>> > >>>>>>>>> > I think this is an exciting project and look forward to >>> officially >>> > >>>>>>>>> > supporting a new language. Clearly there is still a fair >>> amount to do, >>> > >>>>>>>>> > and we also need to figure out the best way to get this >>> reviewed (we'd >>> > >>>>>>>>> > especially welcome feedback (and contributions) from >>> those, if any, in >>> > >>>>>>>>> > the know about javascript/typescript/node even if they're >>> not beam or >>> > >>>>>>>>> > distributed computing experts) and into the main >>> repository (assuming >>> > >>>>>>>>> > the community is as interested in this as I am). >>> > >>>>>>>>> > >>> > >>>>>>>>> > The above link is a decent overview, but copying below for >>> posterity >>> > >>>>>>>>> > as that will likely evolve over time (e.g. as decisions >>> get made and >>> > >>>>>>>>> > TODOs get resolved). >>> > >>>>>>>>> > >>> > >>>>>>>>> > - Robert >>> > >>>>>>>>> > >>> > >>>>>>>>> > >>> > >>>>>>>>> > -------------------- >>> > >>>>>>>>> > >>> > >>>>>>>>> > # Node Beam SDK >>> > >>>>>>>>> > >>> > >>>>>>>>> > This is the start of a fully functioning Javascript >>> (actually, >>> > >>>>>>>>> > Typescript) SDK. There are two distinct aims with this SDK >>> > >>>>>>>>> > >>> > >>>>>>>>> > 1. Tap into the large (and relatively underserved, by >>> existing data >>> > >>>>>>>>> > processing frameworks) community of javascript developers >>> with a >>> > >>>>>>>>> > native SDK targeting this language. >>> > >>>>>>>>> > >>> > >>>>>>>>> > 1. Develop a new SDK which can serve both as a proof of >>> concept and >>> > >>>>>>>>> > reference that highlights the (relative) ease of porting >>> Beam to new >>> > >>>>>>>>> > languages, a differentiating feature of Beam and Dataflow. >>> > >>>>>>>>> > >>> > >>>>>>>>> > To accomplish this, we lean heavily on the portability >>> framework. For >>> > >>>>>>>>> > example, we make heavy use of cross-language transforms, >>> in particular >>> > >>>>>>>>> > for IOs (as a full SDF implementation may not fit into the >>> week). In >>> > >>>>>>>>> > addition, the direct runner is simply an extension of the >>> worker >>> > >>>>>>>>> > suitable for running on portable runners such as the ULR, >>> which will >>> > >>>>>>>>> > directly transfer to running on production runners such as >>> Dataflow >>> > >>>>>>>>> > and Flink. The target audience should hopefully not be put >>> off by >>> > >>>>>>>>> > running other language code encapsulated in docker images. >>> > >>>>>>>>> > >>> > >>>>>>>>> > ## API >>> > >>>>>>>>> > >>> > >>>>>>>>> > We generally try to apply the concepts from the Beam API >>> in a >>> > >>>>>>>>> > Typescript idiomatic way, but it should be noted that few >>> of the >>> > >>>>>>>>> > initial developers have extensive (if any) >>> Javascript/Typescript >>> > >>>>>>>>> > development experience, so feedback is greatly appreciated. >>> > >>>>>>>>> > >>> > >>>>>>>>> > In addition, some notable departures are taken from the >>> traditional SDKs: >>> > >>>>>>>>> > >>> > >>>>>>>>> > * We take a "relational foundations" approach, where >>> [schema'd >>> > >>>>>>>>> > data]( >>> https://docs.google.com/document/d/1tnG2DPHZYbsomvihIpXruUmQ12pHGK0QIvXS1FOTgRc/edit#heading=h.puuotbien1gf >>> ) >>> > >>>>>>>>> > is the primary way to interact with data, and we generally >>> eschew the >>> > >>>>>>>>> > key-value requiring transforms in favor of a more flexible >>> approach >>> > >>>>>>>>> > naming fields or expressions. Javascript's native Object >>> is used as >>> > >>>>>>>>> > the row type. >>> > >>>>>>>>> > >>> > >>>>>>>>> > * As part of being schema-first we also de-emphasize >>> Coders as a >>> > >>>>>>>>> > first-class concept in the SDK, relegating it to an >>> advance feature >>> > >>>>>>>>> > used for interop. Though we can infer schemas from >>> individual >>> > >>>>>>>>> > elements, it is still TBD to >>> > >>>>>>>>> > figure out if/how we can leverage the type system and/or >>> function >>> > >>>>>>>>> > introspection to regularly infer schemas at construction >>> time. A >>> > >>>>>>>>> > fallback coder using BSON encoding is used when we don't >>> have >>> > >>>>>>>>> > sufficient type information. >>> > >>>>>>>>> > >>> > >>>>>>>>> > * We have added additional methods to the PCollection >>> object, notably >>> > >>>>>>>>> > `map` and `flatmap`, [rather than only allowing >>> > >>>>>>>>> > apply]( >>> https://www.mail-archive.com/[email protected]/msg06035.html). >>> > >>>>>>>>> > In addition, `apply` can accept a function argument >>> `(PColletion) => >>> > >>>>>>>>> > ...` as well as a PTransform subclass, which treats this >>> callable as >>> > >>>>>>>>> > if it were a PTransform's expand. >>> > >>>>>>>>> > >>> > >>>>>>>>> > * In the other direction, we have eliminated the >>> [problematic Pipeline >>> > >>>>>>>>> > object](https://s.apache.org/no-beam-pipeline) from the >>> API, instead >>> > >>>>>>>>> > providing a `Root` PValue on which pipelines are built, >>> and invoking >>> > >>>>>>>>> > run() on a Runner. We offer a less error-prone >>> `Runner.run` which >>> > >>>>>>>>> > finishes only when the pipeline is completely finished as >>> well as >>> > >>>>>>>>> > `Runner.runAsync` which returns a handle to the running >>> pipeline. >>> > >>>>>>>>> > >>> > >>>>>>>>> > * Rather than introduce PCollectionTuple, PCollectionList, >>> etc. we let >>> > >>>>>>>>> > PValue literally be an [array or object with PValue >>> > >>>>>>>>> > values]( >>> https://github.com/robertwb/beam-javascript/blob/de4390dd767f046903ac23fead5db333290462db/sdks/node-ts/src/apache_beam/pvalue.ts#L116 >>> ) >>> > >>>>>>>>> > which transforms can consume or produce. These are applied >>> by wrapping >>> > >>>>>>>>> > them with the `P` operator, e.g. `P([pc1, pc2, >>> pc3]).apply(new >>> > >>>>>>>>> > Flatten())`. >>> > >>>>>>>>> > >>> > >>>>>>>>> > * Like Python, `flatMap` and `ParDo.process` return >>> multiple elements >>> > >>>>>>>>> > by yielding them from a generator, rather than invoking a >>> passed-in >>> > >>>>>>>>> > callback. TBD how to output to multiple distinct >>> PCollections. There >>> > >>>>>>>>> > is currently an operation to split a PCollection into >>> multiple >>> > >>>>>>>>> > PCollections based on the properties of the elements, and >>> we may >>> > >>>>>>>>> > consider using a callback for side outputs. >>> > >>>>>>>>> > >>> > >>>>>>>>> > * The `map`, `flatmap`, and `ParDo.proceess` methods take >>> an >>> > >>>>>>>>> > additional (optional) context argument, which is similar >>> to the >>> > >>>>>>>>> > keyword arguments used in Python. These can be "ordinary" >>> javascript >>> > >>>>>>>>> > objects (which are passed as is) or special DoFnParam >>> objects which >>> > >>>>>>>>> > provide getters to element-specific information (such as >>> the current >>> > >>>>>>>>> > timestamp, window, or side input) at runtime. >>> > >>>>>>>>> > >>> > >>>>>>>>> > * Javascript supports (and encourages) an asynchronous >>> programing >>> > >>>>>>>>> > model, with many libraries requiring use of the >>> async/await paradigm. >>> > >>>>>>>>> > As there is no way (by design) to go from the asyncronous >>> style back >>> > >>>>>>>>> > to the synchronous style, this needs to be taken into >>> account when >>> > >>>>>>>>> > designing the API. We currently offer asynchronous >>> variants of >>> > >>>>>>>>> > `PValue.apply(...)` (in addition to the synchronous ones, >>> as they are >>> > >>>>>>>>> > easier to chain) as well as making `Runner.run` >>> asynchronous. TBD to >>> > >>>>>>>>> > do this for all user callbacks as well. >>> > >>>>>>>>> > >>> > >>>>>>>>> > ## TODO >>> > >>>>>>>>> > >>> > >>>>>>>>> > This SDK is a work in progress. In January 2022 we >>> developed the >>> > >>>>>>>>> > ability to construct and run basic pipelines (including >>> external >>> > >>>>>>>>> > transforms and running on a portable runner) but the >>> following >>> > >>>>>>>>> > big-ticket items remain. >>> > >>>>>>>>> > >>> > >>>>>>>>> > * Containerization >>> > >>>>>>>>> > >>> > >>>>>>>>> > * Function and object serialization: we currently only >>> support >>> > >>>>>>>>> > "loopback" mode; to be able to run on a remote, >>> distributed manner we >>> > >>>>>>>>> > need to finish up the work in picking closures and DoFn >>> objects. Some >>> > >>>>>>>>> > investigation has been started here, but all existing >>> libraries have >>> > >>>>>>>>> > non-trivial drawbacks. >>> > >>>>>>>>> > >>> > >>>>>>>>> > * Finish the work in building a full SDK container image >>> that starts >>> > >>>>>>>>> > the worker. >>> > >>>>>>>>> > >>> > >>>>>>>>> > * External transforms >>> > >>>>>>>>> > >>> > >>>>>>>>> > * Using external transforms requires that the external >>> expansion >>> > >>>>>>>>> > service already be started and its address provided. We >>> would like to >>> > >>>>>>>>> > automatically start it as we do in Python. >>> > >>>>>>>>> > >>> > >>>>>>>>> > * Artifacts are not currently supported, which will be >>> essential for >>> > >>>>>>>>> > using Java transforms. (All tests use Python.) >>> > >>>>>>>>> > >>> > >>>>>>>>> > * API >>> > >>>>>>>>> > >>> > >>>>>>>>> > * Side inputs are not yet supported. >>> > >>>>>>>>> > >>> > >>>>>>>>> > * There are several TODOs of minor features or design >>> decisions to finalize. >>> > >>>>>>>>> > >>> > >>>>>>>>> > * Advanced features like metrics, state, timers, and >>> SDF. Possibly >>> > >>>>>>>>> > some of these can wait. >>> > >>>>>>>>> > >>> > >>>>>>>>> > * Infrastructure >>> > >>>>>>>>> > >>> > >>>>>>>>> > * Gradle and Jenkins integration for tests and style >>> enforcement. >>> > >>>>>>>>> > >>> > >>>>>>>>> > * Other >>> > >>>>>>>>> > >>> > >>>>>>>>> > * Standardize on a way for users to pass PTransform >>> names, and >>> > >>>>>>>>> > enforce unique names for pipeline update. >>> > >>>>>>>>> > >>> > >>>>>>>>> > * Use a Javascript Object rather than proto Struct for >>> pipeline options. >>> > >>>>>>>>> > >>> > >>>>>>>>> > * Though Dataflow Runner v2 supports portability, >>> submission is >>> > >>>>>>>>> > still done via v1beta3 and interaction with GCS rather >>> than the job >>> > >>>>>>>>> > submission API. >>> > >>>>>>>>> > >>> > >>>>>>>>> > * Properly wait for bundle completion. >>> > >>>>>>>>> > >>> > >>>>>>>>> > There is probably more; there are many TODOs littered >>> throughout the code. >>> > >>>>>>>>> > >>> > >>>>>>>>> > This code has also not yet been fully peer reviewed (it >>> was the result >>> > >>>>>>>>> > of a hackathon) which needs to be done before putting it >>> into the man >>> > >>>>>>>>> > repository. >>> > >>>>>>>>> > >>> > >>>>>>>>> > >>> > >>>>>>>>> > ## Development. >>> > >>>>>>>>> > >>> > >>>>>>>>> > ### Getting stared >>> > >>>>>>>>> > >>> > >>>>>>>>> > Install node.js, and then from within `sdks/node-ts`. >>> > >>>>>>>>> > >>> > >>>>>>>>> > ``` >>> > >>>>>>>>> > npm install >>> > >>>>>>>>> > ``` >>> > >>>>>>>>> > >>> > >>>>>>>>> > ### Running tests >>> > >>>>>>>>> > >>> > >>>>>>>>> > ``` >>> > >>>>>>>>> > $ npm test >>> > >>>>>>>>> > ``` >>> > >>>>>>>>> > >>> > >>>>>>>>> > ### Style >>> > >>>>>>>>> > >>> > >>>>>>>>> > We have adopted prettier which can be run with >>> > >>>>>>>>> > >>> > >>>>>>>>> > ``` >>> > >>>>>>>>> > # npx prettier --write . >>> > >>>>>>>>> > ``` >>> >>
