Oh, I mean that we ship just 2 jars. And since Spark users always build an uber jar, they can still depend on both of ours and be able to switch runners with a flag.
I really dislike projects shipping overlapping jars. It is confusing and causes major diamond dependency problems. Kenn On Wed, Oct 30, 2019 at 11:12 AM Alexey Romanenko <[email protected]> wrote: > Yes, agree, two jars included in uber jar will work in the similar way. > Though having 3 jars looks still quite confusing for me. > > On 29 Oct 2019, at 23:54, Kenneth Knowles <[email protected]> wrote: > > Is it just as easy to have two jars and build an uber jar with both > included? Then the runner can still be toggled with a flag. > > Kenn > > On Tue, Oct 29, 2019 at 9:38 AM Alexey Romanenko <[email protected]> > wrote: > >> Hmm, I don’t think that jar size should play a big role comparing to the >> whole size of shaded jar of users job. Even more, I think it will be quite >> confusing for users to choose which jar to use if we will have 3 different >> ones for similar purposes. Though, let’s see what others think. >> >> On 29 Oct 2019, at 15:32, Etienne Chauchot <[email protected]> wrote: >> >> Hi Alexey, >> >> Thanks for your opinion ! >> >> Comments inline >> >> Etienne >> On 28/10/2019 17:34, Alexey Romanenko wrote: >> >> Let me share some of my thoughts on this. >> >> - shall we filter out the package name from the release? >> >> Until new runner is not ready to be used in production (or, at least, be >> used for beta testing but users should be clearly warned about that in this >> case), I believe we need to filter out its classes from published jar to >> avoid a confusion. >> >> Yes that is what I think also >> >> - should we release 2 jars: one for the old and one for the new ? >> >> - should we release 3 jars: one for the new, one for the new and one >> for both ? >> >> Once new runner will be released, then I think we need to provide only >> one single jar and allow user to switch between different Spark runners >> with CLI option. >> >> I would vote for 3 jars: one for new, one for old, and one for both. >> Indeed, in some cases, users are looking very closely at the size of jars. >> This solution meets all use cases >> >> - should we create a special entry to the capability matrix ? >> >> Sure, since it has its own uniq characteristics and implementation, but >> again, only once new runner will be "officially released". >> >> +1 >> >> >> >> On 28 Oct 2019, at 10:27, Etienne Chauchot <[email protected]> wrote: >> >> Hi guys, >> >> Any opinions on the point2 communication to users ? >> >> Etienne >> On 24/10/2019 15:44, Etienne Chauchot wrote: >> >> Hi guys, >> >> I'm glad to announce that the PR for the merge to master of the new >> runner based on Spark Structured Streaming framework is submitted: >> >> https://github.com/apache/beam/pull/9866 >> >> >> 1. Regarding the status of the runner: >> >> -the runner passes 93% of the validates runner tests in batch mode. >> >> -Streaming mode is barely started (waiting for the multi-aggregations >> support in spark Structured Streaming framework from the Spark community) >> >> -Runner can execute Nexmark >> >> -Some things are not wired up yet >> >> -Beam Schemas not wired with Spark Schemas >> >> -Optional features of the model not implemented: state api, timer api, >> splittable doFn api, … >> >> >> 2. Regarding the communication to users: >> >> - for reasons explained by Ismael: the runner is in the same module as >> the "older" one. But it is in a different sub-package and both runners >> share the same build. >> >> - How should we communicate to users: >> >> - shall we filter out the package name from the release? >> >> - should we release 2 jars: one for the old and one for the new ? >> >> - should we release 3 jars: one for the new, one for the new and one >> for both ? >> >> - should we create a special entry to the capability matrix ? >> >> WDYT ? >> >> Best >> >> Etienne >> >> >> On 23/10/2019 19:11, Mikhail Gryzykhin wrote: >> >> +1 to merge. >> >> It is worth keeping things in master with explicitly marked status. It >> will make effort more visible to users and easier to get feedback upon. >> >> --Mikhail >> >> On Wed, Oct 23, 2019 at 8:36 AM Etienne Chauchot <[email protected]> >> wrote: >> >>> Hi guys, >>> >>> The new spark runner now supports beam coders and passes 93% of the >>> batch validates runner tests (+4%). I think it is time to merge it to >>> master. I will submit a PR in the coming days. >>> >>> next steps: support schemas and thus better leverage catalyst optimizer >>> (among other things optims based on data), port perfs optims that were done >>> in the current runner. >>> >>> Best >>> >>> Etienne >>> On 11/10/2019 22:48, Pablo Estrada wrote: >>> >>> +1 for merging : ) >>> >>> On Fri, Oct 11, 2019 at 12:43 PM Robert Bradshaw <[email protected]> >>> wrote: >>> >>>> Sounds like a good plan to me. >>>> >>>> On Fri, Oct 11, 2019 at 6:20 AM Etienne Chauchot <[email protected]> >>>> wrote: >>>> >>>>> Comments inline >>>>> On 10/10/2019 23:44, Ismaël Mejía wrote: >>>>> >>>>> +1 >>>>> >>>>> The earlier we get to master the better to encourage not only code >>>>> contributions but as important to have early user feedback. >>>>> >>>>> >>>>> Question is: do we keep the "old" spark runner for a while or not (or >>>>> just keep on previous version/tag on git) ? >>>>> >>>>> It is still too early to even start discussing when to remove the >>>>> classical runner given that the new runner is still a WIP. However the >>>>> overall goal is that this runner becomes the de-facto one once the VR >>>>> tests and the performance become at least equal to the classical >>>>> runner, in the meantime the best for users is that they co-exist, >>>>> let’s not forget that the other runner has been already battle tested >>>>> for more than 3 years and has had lots of improvements in the last >>>>> year. >>>>> >>>>> +1 on what Ismael says: no soon removal, >>>>> >>>>> The plan I had in mind at first (that I showed at the apacheCon) was >>>>> this but I'm proposing moving the first gray label to before the red box. >>>>> >>>>> <beogijnhpieapoll.png> >>>>> >>>>> >>>>> I don't think the number of commits should be an issue--we shouldn't >>>>> just squash years worth of history away. (OTOH, if this is a case of >>>>> this branch containing lots of little, irrelevant commits that would >>>>> have normally been squashed away in the normal review process we do >>>>> for the main branch, then, yes, some cleanup could be nice.) >>>>> >>>>> About the commits we should encourage a clear history but we have also >>>>> to remove useless commits that are still present in the branch, >>>>> commits of the “Fix errorprone” / “Cleaning” kind and even commits >>>>> that make a better narrative sense together should be probably >>>>> squashed, because they do not bring much to the history. It is not >>>>> about more or less commits it is about its relevance as Robert >>>>> mentions. >>>>> >>>>> >>>>> I think our experiences with things that go to master early have been >>>>> very good. So I am in favor ASAP. We can exclude it from releases easily >>>>> until it is ready for end users. >>>>> I have the same question as Robert - how much is modifications and how >>>>> much is new? I notice it is in a subdirectory of the beam-runners-spark >>>>> module. >>>>> >>>>> In its current form we cannot exclude it but this relates to the other >>>>> question, so better to explain a bit of history: The new runner used >>>>> to live in its own module and subdirectory because it is a full blank >>>>> page rewrite and the decision was not to use any of the classical >>>>> runner classes to not be constrained by its evolution. >>>>> >>>>> However the reason to put it back in the same module as a subdirectory >>>>> was to encourage early use, in more detail: The way you deploy spark >>>>> jobs today is usually by packaging and staging an uber jar (~200MB of >>>>> pure dependency joy) that contains the user pipeline classes, the >>>>> spark runner module and its dependencies. If we have two spark runners >>>>> in separate modules the user would need to repackage and redeploy >>>>> their pipelines every time they want to switch from the classical >>>>> Spark runner to the structured streaming runner which is painful and >>>>> time and space consuming compared with the one module approach where >>>>> they just change the name of the runner class and that’s it. The idea >>>>> here is to make easy for users to test the new runner, but at the same >>>>> time to make easy to come back to the classical runner in case of any >>>>> issue. >>>>> >>>>> Ismaël >>>>> >>>>> On Thu, Oct 10, 2019 at 9:02 PM Kenneth Knowles <[email protected]> >>>>> <[email protected]> wrote: >>>>> >>>>> +1 >>>>> >>>>> I think our experiences with things that go to master early have been >>>>> very good. So I am in favor ASAP. We can exclude it from releases easily >>>>> until it is ready for end users. >>>>> >>>>> I have the same question as Robert - how much is modifications and how >>>>> much is new? I notice it is in a subdirectory of the beam-runners-spark >>>>> module. >>>>> >>>>> I did not see any major changes to dependencies but I will also ask if it >>>>> has major version differences so that you might want a separate artifact? >>>>> >>>>> Kenn >>>>> >>>>> On Thu, Oct 10, 2019 at 11:50 AM Robert Bradshaw <[email protected]> >>>>> <[email protected]> wrote: >>>>> >>>>> On Thu, Oct 10, 2019 at 12:39 AM Etienne Chauchot <[email protected]> >>>>> <[email protected]> wrote: >>>>> >>>>> Hi guys, >>>>> >>>>> You probably know that there has been for several months an work >>>>> developing a new Spark runner based on Spark Structured Streaming >>>>> framework. This work is located in a feature branch >>>>> here:https://github.com/apache/beam/tree/spark-runner_structured-streaming >>>>> >>>>> To attract more contributors and get some user feedback, we think it is >>>>> time to merge it to master. Before doing so, some steps need to be >>>>> achieved: >>>>> >>>>> - finish the work on spark Encoders (that allow to call Beam coders) >>>>> because, right now, the runner is in an unstable state (some transforms >>>>> use the new way of doing ser/de and some use the old one, making a >>>>> pipeline incoherent toward serialization) >>>>> >>>>> - clean history: The history contains commits from November 2018, so >>>>> there is a good amount of work, thus a consequent number of commits. >>>>> They were already squashed but not from September 2019 >>>>> >>>>> I don't think the number of commits should be an issue--we shouldn't >>>>> just squash years worth of history away. (OTOH, if this is a case of >>>>> this branch containing lots of little, irrelevant commits that would >>>>> have normally been squashed away in the normal review process we do >>>>> for the main branch, then, yes, some cleanup could be nice.) >>>>> >>>>> >>>>> Regarding status: >>>>> >>>>> - the runner passes 89% of the validates runner tests in batch mode. We >>>>> hope to pass more with the new Encoders >>>>> >>>>> - Streaming mode is barely started (waiting for the multi-aggregations >>>>> support in spark SS framework from the Spark community) >>>>> >>>>> - Runner can execute Nexmark >>>>> >>>>> - Some things are not wired up yet >>>>> >>>>> - Beam Schemas not wired with Spark Schemas >>>>> >>>>> - Optional features of the model not implemented: state api, timer >>>>> api, splittable doFn api, … >>>>> >>>>> WDYT, can we merge it to master once the 2 steps are done ? >>>>> >>>>> I think that as long as it sits parallel to the existing runner, and >>>>> is clearly marked with its status, it makes sense to me. How many >>>>> changes does it make to the existing codebase (as opposed to add new >>>>> code)? >>>>> >>>>> >> >> >
