I also agree we need a better way to develop SDKs that doesn't necessarily lean on a production runner, or requires someone whose authored an SDK before leading the project :p.
Thats one of the outcomes I'm iterating on for a replacement for the Go direct runner, since i think we can make SDKs much easier to develop. But that work is still in progress as I sort out the details of my proposal and prototype. Shouldn't be too much longer though. On Wed, May 4, 2022, 5:42 PM Austin Bennett <[email protected]> wrote: > +1 -- had started playing with this a couple weeks ago, is really shaping > up! > > > Some questions about docs, and making [ developing any language ] more > approachable --> > > I wonder whether we have learned enough from this for a guide of sorts for > future language development. Perhaps, since fresh, would be a good time to > ensure things are noted. > > I'm thinking some helpful docs to include somewhere ( if they don't > already exist ), specifically aimed at making it more approachable for > someone to consider starting to work on another language ( ex: i'm thinking > of dart, since I've been writing alot of that lately :-), though there are > plenty of candidate languages ) > * What were the tricky points ( language specific? or model specific? )? > * What would be needed to be considered a real MVP? Ex: could be > considered suggestions - rather than requirements. Ex: suggested to start > with XXX, and YYY can be more tricky [ at least in ZZZ context ] so > potentially save that for later. > * I'd also like to get a sense of the minimum level of features needed for > something to be accepted into main? Ex: this one is easier since it was > led by a bunch of well-known-and-core-community members. But, somewhat > outlining a process would potentially be helpful for people to see there is > a route to making something happen. > * etc... > > > > Also, at what point do we think things marked @Experimental should get on > the website? I'm thinking about getting on the sdks/language page -- > https://beam.apache.org/documentation/sdks/python/ Naturally, is a > function of when someone is willing to do the work, but I also don't know > whether we'd overly want to highlight something that is still > rather-early/experimental on the general website. > > > > > > > > > > > On Wed, May 4, 2022 at 3:43 PM Robert Burke <[email protected]> wrote: > >> +1 (non-PMC) >> >> On Wed, May 4, 2022, 3:37 PM Ahmet Altay <[email protected]> wrote: >> >>> Thank you! >>> >>> On Wed, May 4, 2022 at 4:22 PM Sachin Agarwal <[email protected]> >>> wrote: >>> >>>> Wow - great work y'all! >>>> >>>> On Wed, May 4, 2022 at 3:21 PM Robert Bradshaw <[email protected]> >>>> wrote: >>>> >>>>> The entire SDK has now been reviewed and all outstanding issues >>>>> addressed. https://github.com/apache/beam/pull/17341 (Big shout out to >>>>> Danny McCormick for his tireless work here!) This does not mean the >>>>> SDK is done, but it's marked as experimental (and isolated) and IMHO >>>>> to a point where we can continue to iterate on the main branch similar >>>>> to how we do our other development. >>>>> >>>>> Any objections or other thoughts on merging? >>>>> >>>> >>> +1 to merging to the main branch. >>> >>> >>>> >>>>> On Mon, Feb 7, 2022 at 9:21 AM Robert Bradshaw <[email protected]> >>>>> wrote: >>>>> > >>>>> > +1 to separating things out if bundling them together becomes too >>>>> > burdensome, though I agree we're not at that point yet (and there is >>>>> a >>>>> > non-trivial amount of overhead in just doing a release--speaking of >>>>> > which I encourage everyone to look at and vote on the pending RC). >>>>> > >>>>> > That being said, the portability API, and the ability to evolve it in >>>>> > a backwards compatible way with capabilities and requirements, makes >>>>> > it easy to evolve each SDK and Runner independently and not have to >>>>> > worry about which subset of the cross product is actually supported. >>>>> > >>>>> > On Mon, Feb 7, 2022 at 1:44 AM Jan Lukavský <[email protected]> wrote: >>>>> > > >>>>> > > I'll add one note from a different perspective. I think that >>>>> long-term we should consider having separate release cycles for core, >>>>> SDKs, >>>>> DSLs and runners. It feels releasing all parts as a single "monolith" will >>>>> gradually cause the core parts (e.g. model, runners-core, ...) to be more >>>>> and more expensive to modify, because each modification to these core >>>>> parts, might affect more and more other components. Enabling all SDKs and >>>>> runners to "choose" the supported SDK-core or runner-core (while >>>>> encouraging them to support the most recent!) is more maintainable for the >>>>> future. >>>>> > > >>>>> > > I'm not saying we need to do something right now before merging >>>>> the JS SDK, but on the other hand adding like 10 more SDKs would start to >>>>> be an issue. We probably could talk about if (and how) we could make some >>>>> sort of separation. >>>>> > > >>>>> > > Jan >>>>> > > >>>>> > > On 2/4/22 18:42, Robert Burke wrote: >>>>> > > >>>>> > > I imagine by the nature of the Apache 2.0 license, the quality of >>>>> the code in a given release is not a given without some other statement by >>>>> the maintainers. We should clear and present warning signs. Erm. >>>>> Experimental labeling. >>>>> > > >>>>> > > On Fri, Feb 4, 2022 at 8:27 AM Kenneth Knowles <[email protected]> >>>>> wrote: >>>>> > >> >>>>> > >> >>>>> > >> >>>>> > >> On Thu, Feb 3, 2022 at 3:43 PM Robert Burke <[email protected]> >>>>> wrote: >>>>> > >>> >>>>> > >>> Personally, if it gets added to the repo at all I'd rather we >>>>> rip off the band-aid and at least have all the tests regularly run, and >>>>> various GitHub actions. Even if we aren't doing the container release >>>>> activities, because it's experimental, that's much better than bit rot and >>>>> being part of the main repo has a simpler contribution convention. >>>>> > >> >>>>> > >> >>>>> > >> 100% agree about bit rot and it also makes it more accessible for >>>>> contribution and experiment. This is a strong motivation for me to get it >>>>> right onto master. Some contributions to branches are probably just >>>>> unknown >>>>> to a lot of contributors (or adventurous users), for example >>>>> https://github.com/apache/beam/tree/tez-runner >>>>> https://github.com/apache/beam/tree/jstorm-runner >>>>> https://github.com/apache/beam/tree/mr-runner >>>>> > >> >>>>> > >> I'm guessing since node has a distribution via npm if we do >>>>> nothing it is essentially "not released". I don't see it as a big problem >>>>> having it in the archived ASF source releases, as long as licenses and >>>>> whatnot are good, though I may be overlooking something. >>>>> > >> >>>>> > >> Kenn >>>>> > >> >>>>> > >>> >>>>> > >>> >>>>> > >>> Those are my 2 cents. >>>>> > >>> Robert B >>>>> > >>> Beam Go Busybody >>>>> > >>> >>>>> > >>> On Thu, Feb 3, 2022, 3:29 PM Kenneth Knowles <[email protected]> >>>>> wrote: >>>>> > >>>> >>>>> > >>>> We did the same for the Go SDK for some time. I imagine just >>>>> "not doing the work to release it" suffices? Maybe +Robert Burke has some >>>>> other memories of how to not release. >>>>> > >>>> >>>>> > >>>> Kenn >>>>> > >>>> >>>>> > >>>> On Mon, Jan 31, 2022 at 1:05 PM Kerry Donny-Clark < >>>>> [email protected]> wrote: >>>>> > >>>>> >>>>> > >>>>> This project was a great way to kickstart a new SDK. I'd like >>>>> to bring this into Beam and start cleanup. Are there any steps to take >>>>> before making a PR? Is there a way to mark this as experimental/not for >>>>> release? >>>>> > >>>>> Kerry >>>>> > >>>>> >>>>> > >>>>> On Mon, Jan 17, 2022 at 1:22 AM Pablo Estrada < >>>>> [email protected]> wrote: >>>>> > >>>>>> >>>>> > >>>>>> This project was fun, and I learned a lot putting some time >>>>> into it. I'd love for it to be brought into the main repository and worked >>>>> over some time to be fully supported. >>>>> > >>>>>> Best >>>>> > >>>>>> -P. >>>>> > >>>>>> >>>>> > >>>>>> On Fri, Jan 14, 2022 at 4:46 PM Ahmet Altay <[email protected]> >>>>> wrote: >>>>> > >>>>>>> >>>>> > >>>>>>> Really nice! Congratulations to all who worked on this >>>>> project. >>>>> > >>>>>>> >>>>> > >>>>>>> On Fri, Jan 14, 2022 at 4:41 PM Kenneth Knowles < >>>>> [email protected]> wrote: >>>>> > >>>>>>>> >>>>> > >>>>>>>> This was super fun, and I really hope it can be an >>>>> inspiration to others that you can build a working Beam SDK in a week! >>>>> > >>>>>>>> >>>>> > >>>>>>>> (hint hint https://issues.apache.org/jira/browse/BEAM-4010 >>>>> and https://issues.apache.org/jira/browse/BEAM-12658 :-) >>>>> > >>>>>>>> >>>>> > >>>>>>>> On Fri, Jan 14, 2022 at 11:38 AM Robert Bradshaw < >>>>> [email protected]> wrote: >>>>> > >>>>>>>>> >>>>> > >>>>>>>>> And, of course, an example: >>>>> > >>>>>>>>> >>>>> > >>>>>>>>> >>>>> https://github.com/robertwb/beam-javascript/blob/javascript/sdks/node-ts/src/apache_beam/examples/wordcount.ts >>>>> > >>>>>>>>> >>>>> > >>>>>>>>> On Fri, Jan 14, 2022 at 11:35 AM Robert Bradshaw < >>>>> [email protected]> wrote: >>>>> > >>>>>>>>> > >>>>> > >>>>>>>>> > Last week at Google we had a hackathon to kick off the >>>>> new year, and >>>>> > >>>>>>>>> > one of the projects we came up with was seeing how far >>>>> we could get in >>>>> > >>>>>>>>> > putting together a typescript SDK. Starting from nothing >>>>> we were able >>>>> > >>>>>>>>> > to make a lot of progress and I wanted to share the >>>>> results here. >>>>> > >>>>>>>>> > >>>>> > >>>>>>>>> > >>>>> https://github.com/robertwb/beam-javascript/blob/javascript/sdks/node-ts/README.md >>>>> > >>>>>>>>> > >>>>> > >>>>>>>>> > I think this is an exciting project and look forward to >>>>> officially >>>>> > >>>>>>>>> > supporting a new language. Clearly there is still a fair >>>>> amount to do, >>>>> > >>>>>>>>> > and we also need to figure out the best way to get this >>>>> reviewed (we'd >>>>> > >>>>>>>>> > especially welcome feedback (and contributions) from >>>>> those, if any, in >>>>> > >>>>>>>>> > the know about javascript/typescript/node even if >>>>> they're not beam or >>>>> > >>>>>>>>> > distributed computing experts) and into the main >>>>> repository (assuming >>>>> > >>>>>>>>> > the community is as interested in this as I am). >>>>> > >>>>>>>>> > >>>>> > >>>>>>>>> > The above link is a decent overview, but copying below >>>>> for posterity >>>>> > >>>>>>>>> > as that will likely evolve over time (e.g. as decisions >>>>> get made and >>>>> > >>>>>>>>> > TODOs get resolved). >>>>> > >>>>>>>>> > >>>>> > >>>>>>>>> > - Robert >>>>> > >>>>>>>>> > >>>>> > >>>>>>>>> > >>>>> > >>>>>>>>> > -------------------- >>>>> > >>>>>>>>> > >>>>> > >>>>>>>>> > # Node Beam SDK >>>>> > >>>>>>>>> > >>>>> > >>>>>>>>> > This is the start of a fully functioning Javascript >>>>> (actually, >>>>> > >>>>>>>>> > Typescript) SDK. There are two distinct aims with this >>>>> SDK >>>>> > >>>>>>>>> > >>>>> > >>>>>>>>> > 1. Tap into the large (and relatively underserved, by >>>>> existing data >>>>> > >>>>>>>>> > processing frameworks) community of javascript >>>>> developers with a >>>>> > >>>>>>>>> > native SDK targeting this language. >>>>> > >>>>>>>>> > >>>>> > >>>>>>>>> > 1. Develop a new SDK which can serve both as a proof of >>>>> concept and >>>>> > >>>>>>>>> > reference that highlights the (relative) ease of porting >>>>> Beam to new >>>>> > >>>>>>>>> > languages, a differentiating feature of Beam and >>>>> Dataflow. >>>>> > >>>>>>>>> > >>>>> > >>>>>>>>> > To accomplish this, we lean heavily on the portability >>>>> framework. For >>>>> > >>>>>>>>> > example, we make heavy use of cross-language transforms, >>>>> in particular >>>>> > >>>>>>>>> > for IOs (as a full SDF implementation may not fit into >>>>> the week). In >>>>> > >>>>>>>>> > addition, the direct runner is simply an extension of >>>>> the worker >>>>> > >>>>>>>>> > suitable for running on portable runners such as the >>>>> ULR, which will >>>>> > >>>>>>>>> > directly transfer to running on production runners such >>>>> as Dataflow >>>>> > >>>>>>>>> > and Flink. The target audience should hopefully not be >>>>> put off by >>>>> > >>>>>>>>> > running other language code encapsulated in docker >>>>> images. >>>>> > >>>>>>>>> > >>>>> > >>>>>>>>> > ## API >>>>> > >>>>>>>>> > >>>>> > >>>>>>>>> > We generally try to apply the concepts from the Beam API >>>>> in a >>>>> > >>>>>>>>> > Typescript idiomatic way, but it should be noted that >>>>> few of the >>>>> > >>>>>>>>> > initial developers have extensive (if any) >>>>> Javascript/Typescript >>>>> > >>>>>>>>> > development experience, so feedback is greatly >>>>> appreciated. >>>>> > >>>>>>>>> > >>>>> > >>>>>>>>> > In addition, some notable departures are taken from the >>>>> traditional SDKs: >>>>> > >>>>>>>>> > >>>>> > >>>>>>>>> > * We take a "relational foundations" approach, where >>>>> [schema'd >>>>> > >>>>>>>>> > data]( >>>>> https://docs.google.com/document/d/1tnG2DPHZYbsomvihIpXruUmQ12pHGK0QIvXS1FOTgRc/edit#heading=h.puuotbien1gf >>>>> ) >>>>> > >>>>>>>>> > is the primary way to interact with data, and we >>>>> generally eschew the >>>>> > >>>>>>>>> > key-value requiring transforms in favor of a more >>>>> flexible approach >>>>> > >>>>>>>>> > naming fields or expressions. Javascript's native Object >>>>> is used as >>>>> > >>>>>>>>> > the row type. >>>>> > >>>>>>>>> > >>>>> > >>>>>>>>> > * As part of being schema-first we also de-emphasize >>>>> Coders as a >>>>> > >>>>>>>>> > first-class concept in the SDK, relegating it to an >>>>> advance feature >>>>> > >>>>>>>>> > used for interop. Though we can infer schemas from >>>>> individual >>>>> > >>>>>>>>> > elements, it is still TBD to >>>>> > >>>>>>>>> > figure out if/how we can leverage the type system and/or >>>>> function >>>>> > >>>>>>>>> > introspection to regularly infer schemas at construction >>>>> time. A >>>>> > >>>>>>>>> > fallback coder using BSON encoding is used when we don't >>>>> have >>>>> > >>>>>>>>> > sufficient type information. >>>>> > >>>>>>>>> > >>>>> > >>>>>>>>> > * We have added additional methods to the PCollection >>>>> object, notably >>>>> > >>>>>>>>> > `map` and `flatmap`, [rather than only allowing >>>>> > >>>>>>>>> > apply]( >>>>> https://www.mail-archive.com/[email protected]/msg06035.html). >>>>> > >>>>>>>>> > In addition, `apply` can accept a function argument >>>>> `(PColletion) => >>>>> > >>>>>>>>> > ...` as well as a PTransform subclass, which treats this >>>>> callable as >>>>> > >>>>>>>>> > if it were a PTransform's expand. >>>>> > >>>>>>>>> > >>>>> > >>>>>>>>> > * In the other direction, we have eliminated the >>>>> [problematic Pipeline >>>>> > >>>>>>>>> > object](https://s.apache.org/no-beam-pipeline) from the >>>>> API, instead >>>>> > >>>>>>>>> > providing a `Root` PValue on which pipelines are built, >>>>> and invoking >>>>> > >>>>>>>>> > run() on a Runner. We offer a less error-prone >>>>> `Runner.run` which >>>>> > >>>>>>>>> > finishes only when the pipeline is completely finished >>>>> as well as >>>>> > >>>>>>>>> > `Runner.runAsync` which returns a handle to the running >>>>> pipeline. >>>>> > >>>>>>>>> > >>>>> > >>>>>>>>> > * Rather than introduce PCollectionTuple, >>>>> PCollectionList, etc. we let >>>>> > >>>>>>>>> > PValue literally be an [array or object with PValue >>>>> > >>>>>>>>> > values]( >>>>> https://github.com/robertwb/beam-javascript/blob/de4390dd767f046903ac23fead5db333290462db/sdks/node-ts/src/apache_beam/pvalue.ts#L116 >>>>> ) >>>>> > >>>>>>>>> > which transforms can consume or produce. These are >>>>> applied by wrapping >>>>> > >>>>>>>>> > them with the `P` operator, e.g. `P([pc1, pc2, >>>>> pc3]).apply(new >>>>> > >>>>>>>>> > Flatten())`. >>>>> > >>>>>>>>> > >>>>> > >>>>>>>>> > * Like Python, `flatMap` and `ParDo.process` return >>>>> multiple elements >>>>> > >>>>>>>>> > by yielding them from a generator, rather than invoking >>>>> a passed-in >>>>> > >>>>>>>>> > callback. TBD how to output to multiple distinct >>>>> PCollections. There >>>>> > >>>>>>>>> > is currently an operation to split a PCollection into >>>>> multiple >>>>> > >>>>>>>>> > PCollections based on the properties of the elements, >>>>> and we may >>>>> > >>>>>>>>> > consider using a callback for side outputs. >>>>> > >>>>>>>>> > >>>>> > >>>>>>>>> > * The `map`, `flatmap`, and `ParDo.proceess` methods >>>>> take an >>>>> > >>>>>>>>> > additional (optional) context argument, which is similar >>>>> to the >>>>> > >>>>>>>>> > keyword arguments used in Python. These can be >>>>> "ordinary" javascript >>>>> > >>>>>>>>> > objects (which are passed as is) or special DoFnParam >>>>> objects which >>>>> > >>>>>>>>> > provide getters to element-specific information (such as >>>>> the current >>>>> > >>>>>>>>> > timestamp, window, or side input) at runtime. >>>>> > >>>>>>>>> > >>>>> > >>>>>>>>> > * Javascript supports (and encourages) an asynchronous >>>>> programing >>>>> > >>>>>>>>> > model, with many libraries requiring use of the >>>>> async/await paradigm. >>>>> > >>>>>>>>> > As there is no way (by design) to go from the >>>>> asyncronous style back >>>>> > >>>>>>>>> > to the synchronous style, this needs to be taken into >>>>> account when >>>>> > >>>>>>>>> > designing the API. We currently offer asynchronous >>>>> variants of >>>>> > >>>>>>>>> > `PValue.apply(...)` (in addition to the synchronous >>>>> ones, as they are >>>>> > >>>>>>>>> > easier to chain) as well as making `Runner.run` >>>>> asynchronous. TBD to >>>>> > >>>>>>>>> > do this for all user callbacks as well. >>>>> > >>>>>>>>> > >>>>> > >>>>>>>>> > ## TODO >>>>> > >>>>>>>>> > >>>>> > >>>>>>>>> > This SDK is a work in progress. In January 2022 we >>>>> developed the >>>>> > >>>>>>>>> > ability to construct and run basic pipelines (including >>>>> external >>>>> > >>>>>>>>> > transforms and running on a portable runner) but the >>>>> following >>>>> > >>>>>>>>> > big-ticket items remain. >>>>> > >>>>>>>>> > >>>>> > >>>>>>>>> > * Containerization >>>>> > >>>>>>>>> > >>>>> > >>>>>>>>> > * Function and object serialization: we currently only >>>>> support >>>>> > >>>>>>>>> > "loopback" mode; to be able to run on a remote, >>>>> distributed manner we >>>>> > >>>>>>>>> > need to finish up the work in picking closures and DoFn >>>>> objects. Some >>>>> > >>>>>>>>> > investigation has been started here, but all existing >>>>> libraries have >>>>> > >>>>>>>>> > non-trivial drawbacks. >>>>> > >>>>>>>>> > >>>>> > >>>>>>>>> > * Finish the work in building a full SDK container >>>>> image that starts >>>>> > >>>>>>>>> > the worker. >>>>> > >>>>>>>>> > >>>>> > >>>>>>>>> > * External transforms >>>>> > >>>>>>>>> > >>>>> > >>>>>>>>> > * Using external transforms requires that the external >>>>> expansion >>>>> > >>>>>>>>> > service already be started and its address provided. We >>>>> would like to >>>>> > >>>>>>>>> > automatically start it as we do in Python. >>>>> > >>>>>>>>> > >>>>> > >>>>>>>>> > * Artifacts are not currently supported, which will be >>>>> essential for >>>>> > >>>>>>>>> > using Java transforms. (All tests use Python.) >>>>> > >>>>>>>>> > >>>>> > >>>>>>>>> > * API >>>>> > >>>>>>>>> > >>>>> > >>>>>>>>> > * Side inputs are not yet supported. >>>>> > >>>>>>>>> > >>>>> > >>>>>>>>> > * There are several TODOs of minor features or design >>>>> decisions to finalize. >>>>> > >>>>>>>>> > >>>>> > >>>>>>>>> > * Advanced features like metrics, state, timers, and >>>>> SDF. Possibly >>>>> > >>>>>>>>> > some of these can wait. >>>>> > >>>>>>>>> > >>>>> > >>>>>>>>> > * Infrastructure >>>>> > >>>>>>>>> > >>>>> > >>>>>>>>> > * Gradle and Jenkins integration for tests and style >>>>> enforcement. >>>>> > >>>>>>>>> > >>>>> > >>>>>>>>> > * Other >>>>> > >>>>>>>>> > >>>>> > >>>>>>>>> > * Standardize on a way for users to pass PTransform >>>>> names, and >>>>> > >>>>>>>>> > enforce unique names for pipeline update. >>>>> > >>>>>>>>> > >>>>> > >>>>>>>>> > * Use a Javascript Object rather than proto Struct for >>>>> pipeline options. >>>>> > >>>>>>>>> > >>>>> > >>>>>>>>> > * Though Dataflow Runner v2 supports portability, >>>>> submission is >>>>> > >>>>>>>>> > still done via v1beta3 and interaction with GCS rather >>>>> than the job >>>>> > >>>>>>>>> > submission API. >>>>> > >>>>>>>>> > >>>>> > >>>>>>>>> > * Properly wait for bundle completion. >>>>> > >>>>>>>>> > >>>>> > >>>>>>>>> > There is probably more; there are many TODOs littered >>>>> throughout the code. >>>>> > >>>>>>>>> > >>>>> > >>>>>>>>> > This code has also not yet been fully peer reviewed (it >>>>> was the result >>>>> > >>>>>>>>> > of a hackathon) which needs to be done before putting it >>>>> into the man >>>>> > >>>>>>>>> > repository. >>>>> > >>>>>>>>> > >>>>> > >>>>>>>>> > >>>>> > >>>>>>>>> > ## Development. >>>>> > >>>>>>>>> > >>>>> > >>>>>>>>> > ### Getting stared >>>>> > >>>>>>>>> > >>>>> > >>>>>>>>> > Install node.js, and then from within `sdks/node-ts`. >>>>> > >>>>>>>>> > >>>>> > >>>>>>>>> > ``` >>>>> > >>>>>>>>> > npm install >>>>> > >>>>>>>>> > ``` >>>>> > >>>>>>>>> > >>>>> > >>>>>>>>> > ### Running tests >>>>> > >>>>>>>>> > >>>>> > >>>>>>>>> > ``` >>>>> > >>>>>>>>> > $ npm test >>>>> > >>>>>>>>> > ``` >>>>> > >>>>>>>>> > >>>>> > >>>>>>>>> > ### Style >>>>> > >>>>>>>>> > >>>>> > >>>>>>>>> > We have adopted prettier which can be run with >>>>> > >>>>>>>>> > >>>>> > >>>>>>>>> > ``` >>>>> > >>>>>>>>> > # npx prettier --write . >>>>> > >>>>>>>>> > ``` >>>>> >>>>
