Let me share some of my thoughts on this. >> - shall we filter out the package name from the release? >> Until new runner is not ready to be used in production (or, at least, be used for beta testing but users should be clearly warned about that in this case), I believe we need to filter out its classes from published jar to avoid a confusion. >> - should we release 2 jars: one for the old and one for the new ? >> >> - should we release 3 jars: one for the new, one for the new and one for >> both ? >> Once new runner will be released, then I think we need to provide only one single jar and allow user to switch between different Spark runners with CLI option. >> - should we create a special entry to the capability matrix ? >>
Sure, since it has its own uniq characteristics and implementation, but again, only once new runner will be "officially released". > On 28 Oct 2019, at 10:27, Etienne Chauchot <[email protected]> wrote: > > Hi guys, > > Any opinions on the point2 communication to users ? > > Etienne > On 24/10/2019 15:44, Etienne Chauchot wrote: >> Hi guys, >> >> I'm glad to announce that the PR for the merge to master of the new runner >> based on Spark Structured Streaming framework is submitted: >> >> https://github.com/apache/beam/pull/9866 >> <https://github.com/apache/beam/pull/9866> >> >> 1. Regarding the status of the runner: >> -the runner passes 93% of the validates runner tests in batch mode. >> >> -Streaming mode is barely started (waiting for the multi-aggregations >> support in spark Structured Streaming framework from the Spark community) >> >> -Runner can execute Nexmark >> >> -Some things are not wired up yet >> >> -Beam Schemas not wired with Spark Schemas >> >> -Optional features of the model not implemented: state api, timer api, >> splittable doFn api, … >> >> 2. Regarding the communication to users: >> >> - for reasons explained by Ismael: the runner is in the same module as the >> "older" one. But it is in a different sub-package and both runners share the >> same build. >> - How should we communicate to users: >> - shall we filter out the package name from the release? >> - should we release 2 jars: one for the old and one for the new ? >> - should we release 3 jars: one for the new, one for the new and one for >> both ? >> >> - should we create a special entry to the capability matrix ? >> >> WDYT ? >> Best >> >> Etienne >> >> On 23/10/2019 19:11, Mikhail Gryzykhin wrote: >>> +1 to merge. >>> >>> It is worth keeping things in master with explicitly marked status. It will >>> make effort more visible to users and easier to get feedback upon. >>> >>> --Mikhail >>> >>> On Wed, Oct 23, 2019 at 8:36 AM Etienne Chauchot <[email protected] >>> <mailto:[email protected]>> wrote: >>> Hi guys, >>> >>> The new spark runner now supports beam coders and passes 93% of the batch >>> validates runner tests (+4%). I think it is time to merge it to master. I >>> will submit a PR in the coming days. >>> >>> next steps: support schemas and thus better leverage catalyst optimizer >>> (among other things optims based on data), port perfs optims that were done >>> in the current runner. >>> Best >>> Etienne >>> On 11/10/2019 22:48, Pablo Estrada wrote: >>>> +1 for merging : ) >>>> >>>> On Fri, Oct 11, 2019 at 12:43 PM Robert Bradshaw <[email protected] >>>> <mailto:[email protected]>> wrote: >>>> Sounds like a good plan to me. >>>> >>>> On Fri, Oct 11, 2019 at 6:20 AM Etienne Chauchot <[email protected] >>>> <mailto:[email protected]>> wrote: >>>> Comments inline >>>> On 10/10/2019 23:44, Ismaël Mejía wrote: >>>>> +1 >>>>> >>>>> The earlier we get to master the better to encourage not only code >>>>> contributions but as important to have early user feedback. >>>>> >>>>>> Question is: do we keep the "old" spark runner for a while or not (or >>>>>> just keep on previous version/tag on git) ? >>>>> It is still too early to even start discussing when to remove the >>>>> classical runner given that the new runner is still a WIP. However the >>>>> overall goal is that this runner becomes the de-facto one once the VR >>>>> tests and the performance become at least equal to the classical >>>>> runner, in the meantime the best for users is that they co-exist, >>>>> let’s not forget that the other runner has been already battle tested >>>>> for more than 3 years and has had lots of improvements in the last >>>>> year. >>>> +1 on what Ismael says: no soon removal, >>>> The plan I had in mind at first (that I showed at the apacheCon) was this >>>> but I'm proposing moving the first gray label to before the red box. >>>> <beogijnhpieapoll.png> >>>> >>>> >>>>>> I don't think the number of commits should be an issue--we shouldn't >>>>>> just squash years worth of history away. (OTOH, if this is a case of >>>>>> this branch containing lots of little, irrelevant commits that would >>>>>> have normally been squashed away in the normal review process we do >>>>>> for the main branch, then, yes, some cleanup could be nice.) >>>>> About the commits we should encourage a clear history but we have also >>>>> to remove useless commits that are still present in the branch, >>>>> commits of the “Fix errorprone” / “Cleaning” kind and even commits >>>>> that make a better narrative sense together should be probably >>>>> squashed, because they do not bring much to the history. It is not >>>>> about more or less commits it is about its relevance as Robert >>>>> mentions. >>>>> >>>>>> I think our experiences with things that go to master early have been >>>>>> very good. So I am in favor ASAP. We can exclude it from releases easily >>>>>> until it is ready for end users. >>>>>> I have the same question as Robert - how much is modifications and how >>>>>> much is new? I notice it is in a subdirectory of the beam-runners-spark >>>>>> module. >>>>> In its current form we cannot exclude it but this relates to the other >>>>> question, so better to explain a bit of history: The new runner used >>>>> to live in its own module and subdirectory because it is a full blank >>>>> page rewrite and the decision was not to use any of the classical >>>>> runner classes to not be constrained by its evolution. >>>>> >>>>> However the reason to put it back in the same module as a subdirectory >>>>> was to encourage early use, in more detail: The way you deploy spark >>>>> jobs today is usually by packaging and staging an uber jar (~200MB of >>>>> pure dependency joy) that contains the user pipeline classes, the >>>>> spark runner module and its dependencies. If we have two spark runners >>>>> in separate modules the user would need to repackage and redeploy >>>>> their pipelines every time they want to switch from the classical >>>>> Spark runner to the structured streaming runner which is painful and >>>>> time and space consuming compared with the one module approach where >>>>> they just change the name of the runner class and that’s it. The idea >>>>> here is to make easy for users to test the new runner, but at the same >>>>> time to make easy to come back to the classical runner in case of any >>>>> issue. >>>>> >>>>> Ismaël >>>>> >>>>> On Thu, Oct 10, 2019 at 9:02 PM Kenneth Knowles <[email protected]> >>>>> <mailto:[email protected]> wrote: >>>>>> +1 >>>>>> >>>>>> I think our experiences with things that go to master early have been >>>>>> very good. So I am in favor ASAP. We can exclude it from releases easily >>>>>> until it is ready for end users. >>>>>> >>>>>> I have the same question as Robert - how much is modifications and how >>>>>> much is new? I notice it is in a subdirectory of the beam-runners-spark >>>>>> module. >>>>>> >>>>>> I did not see any major changes to dependencies but I will also ask if >>>>>> it has major version differences so that you might want a separate >>>>>> artifact? >>>>>> >>>>>> Kenn >>>>>> >>>>>> On Thu, Oct 10, 2019 at 11:50 AM Robert Bradshaw <[email protected]> >>>>>> <mailto:[email protected]> wrote: >>>>>>> On Thu, Oct 10, 2019 at 12:39 AM Etienne Chauchot >>>>>>> <[email protected]> <mailto:[email protected]> wrote: >>>>>>>> Hi guys, >>>>>>>> >>>>>>>> You probably know that there has been for several months an work >>>>>>>> developing a new Spark runner based on Spark Structured Streaming >>>>>>>> framework. This work is located in a feature branch here: >>>>>>>> https://github.com/apache/beam/tree/spark-runner_structured-streaming >>>>>>>> <https://github.com/apache/beam/tree/spark-runner_structured-streaming> >>>>>>>> >>>>>>>> To attract more contributors and get some user feedback, we think it is >>>>>>>> time to merge it to master. Before doing so, some steps need to be >>>>>>>> achieved: >>>>>>>> >>>>>>>> - finish the work on spark Encoders (that allow to call Beam coders) >>>>>>>> because, right now, the runner is in an unstable state (some transforms >>>>>>>> use the new way of doing ser/de and some use the old one, making a >>>>>>>> pipeline incoherent toward serialization) >>>>>>>> >>>>>>>> - clean history: The history contains commits from November 2018, so >>>>>>>> there is a good amount of work, thus a consequent number of commits. >>>>>>>> They were already squashed but not from September 2019 >>>>>>> I don't think the number of commits should be an issue--we shouldn't >>>>>>> just squash years worth of history away. (OTOH, if this is a case of >>>>>>> this branch containing lots of little, irrelevant commits that would >>>>>>> have normally been squashed away in the normal review process we do >>>>>>> for the main branch, then, yes, some cleanup could be nice.) >>>>>>> >>>>>>>> Regarding status: >>>>>>>> >>>>>>>> - the runner passes 89% of the validates runner tests in batch mode. We >>>>>>>> hope to pass more with the new Encoders >>>>>>>> >>>>>>>> - Streaming mode is barely started (waiting for the multi-aggregations >>>>>>>> support in spark SS framework from the Spark community) >>>>>>>> >>>>>>>> - Runner can execute Nexmark >>>>>>>> >>>>>>>> - Some things are not wired up yet >>>>>>>> >>>>>>>> - Beam Schemas not wired with Spark Schemas >>>>>>>> >>>>>>>> - Optional features of the model not implemented: state api, >>>>>>>> timer >>>>>>>> api, splittable doFn api, … >>>>>>>> >>>>>>>> WDYT, can we merge it to master once the 2 steps are done ? >>>>>>> I think that as long as it sits parallel to the existing runner, and >>>>>>> is clearly marked with its status, it makes sense to me. How many >>>>>>> changes does it make to the existing codebase (as opposed to add new >>>>>>> code)?
