+1 to "one uber jar to rule them all." As Ryan said, we snuck the portable runner into master at a much less usable state than the structured streaming runner, and it didn't seem to cause any issues (although in retrospect we probably should have tagged it as @Experimental, I wasn't aware that was a thing at the time).
In general, until adding visible documentation on the website, I suspect most users will be completely unaware the new runner is even there. On Fri, Nov 8, 2019 at 9:50 AM Kenneth Knowles <[email protected]> wrote: > On Thu, Nov 7, 2019 at 5:32 PM Etienne Chauchot <[email protected]> > wrote: > > > > Hi guys > > > > @Kenn, > > > > I just wanted to mention that I did answered your question on > dependencies here: > https://lists.apache.org/thread.html/5a85caac41e796c2aa351d835b3483808ebbbd4512b480940d494439@%3Cdev.beam.apache.org%3E > > Ah, sorry! In that case there is no problem at all. > > > > I'm not in favor of having the 2 runners in one jar, the point about > having 2 jars was to: > > > > - avoid making promises to users on a work in progress runner (make it > explicit with a different jar) > > - avoid confusion for them (why are there 2 pipeline options? etc....) > > > > If the community believes that there is no confusion or wrong promises > with the one jar solution, we could leave the 2 runners in one jar. > > > > Maybe we could start a vote on that? > > It seems unanimous among others to have one jar. There were some > suggestions of how to avoid promises and confusion, like Ryan's most recent > email. Did any of the ideas sound good to you? > > Kenn > > > I have no objection to putting the experimental runner alongside the >> stable, mature runner. We have some precedence with the portable >> spark runner, and that's worked out pretty well -- at least, I haven't >> heard any complaints from confused users! >> >> That being said: >> >> 1. It really should be marked @Experimental in the code *and* clearly >> warned in API (javadoc) and documentation. >> >> 2. Ideally, I'd like to see a warning banner in the logs when it's >> used, pointing to the stable SparkRunner and/or documentation on the >> current known issues. >> >> All my best, Ryan >> >> >> >> >> >> >> > regarding jars: >> > >> > I don't like 3 jars either. >> > >> > >> > Etienne >> > >> > On 31/10/2019 02:06, Kenneth Knowles wrote: >> > >> > Very good points. We definitely ship a lot of code/features in very >> early stages, and there seems to be no problem. >> > >> > I intend mostly to leave this judgment to people like you who know >> better about Spark users. >> > >> > But I do think 1 or 2 jars is better than 3. I really don't like "3 >> jars" and I did give two reasons: >> > >> > 1. diamond deps where things overlap >> > 2. figuring out which thing to depend on >> > >> > Both are annoying for users. I am not certain if it could lead to a >> real unsolvable situation. This is just a Java ecosystem problem so I feel >> qualified to comment. >> > >> > I did also ask if there were major dependency differences between the >> two that could cause problems for users. This question was dropped and no >> one cares to comment so I assume it is not an issue. So then I favor having >> just 1 jar with both runners. >> > >> > Kenn >> > >> > On Wed, Oct 30, 2019 at 2:46 PM Ismaël Mejía <[email protected]> wrote: >> >> >> >> I am still a bit lost about why we are discussing options without >> giving any >> >> arguments or reasons for the options? Why is 2 modules better than 3 >> or 3 better >> >> than 2, or even better, what forces us to have something different >> than a single >> >> module? >> >> >> >> What are the reasons for wanting to have separate jars? If the issue >> is that the >> >> code is unfinished or not passing the tests, the impact for end users >> is minimal >> >> because they cannot accidentally end up running the new runner, and if >> they >> >> decide to do so we can warn them it is at their own risk and not ready >> for >> >> production in the documentation + runner. >> >> >> >> If the fear is that new code may end up being intertwined with the >> classic and >> >> portable runners and have some side effects. We have the >> ValidatesRunner + >> >> Nexmark in the CI to cover this so again I do not see what is the >> problem that >> >> requires modules to be separate. >> >> >> >> If the issue is being uncomfortable about having in-progress code in >> released >> >> artifacts we have been doing this in Beam forever, for example most of >> the work >> >> on portability and Schema/SQL, and all of those were still part of >> artifacts >> >> long time before they were ready for prime use, so I still don't see >> why this >> >> case is different to require different artifacts. >> >> >> >> I have the impression we are trying to solve a non-issue by adding a >> lot of >> >> artificial complexity (in particular to the users), or am I missing >> something >> >> else? >> >> >> >> On Wed, Oct 30, 2019 at 7:40 PM Kenneth Knowles <[email protected]> >> wrote: >> >> > >> >> > Oh, I mean that we ship just 2 jars. >> >> > >> >> > And since Spark users always build an uber jar, they can still >> depend on both of ours and be able to switch runners with a flag. >> >> > >> >> > I really dislike projects shipping overlapping jars. It is confusing >> and causes major diamond dependency problems. >> >> > >> >> > Kenn >> >> > >> >> > On Wed, Oct 30, 2019 at 11:12 AM Alexey Romanenko < >> [email protected]> wrote: >> >> >> >> >> >> Yes, agree, two jars included in uber jar will work in the similar >> way. Though having 3 jars looks still quite confusing for me. >> >> >> >> >> >> On 29 Oct 2019, at 23:54, Kenneth Knowles <[email protected]> wrote: >> >> >> >> >> >> Is it just as easy to have two jars and build an uber jar with both >> included? Then the runner can still be toggled with a flag. >> >> >> >> >> >> Kenn >> >> >> >> >> >> On Tue, Oct 29, 2019 at 9:38 AM Alexey Romanenko < >> [email protected]> wrote: >> >> >>> >> >> >>> Hmm, I don’t think that jar size should play a big role comparing >> to the whole size of shaded jar of users job. Even more, I think it will be >> quite confusing for users to choose which jar to use if we will have 3 >> different ones for similar purposes. Though, let’s see what others think. >> >> >>> >> >> >>> On 29 Oct 2019, at 15:32, Etienne Chauchot <[email protected]> >> wrote: >> >> >>> >> >> >>> Hi Alexey, >> >> >>> >> >> >>> Thanks for your opinion ! >> >> >>> >> >> >>> Comments inline >> >> >>> >> >> >>> Etienne >> >> >>> >> >> >>> On 28/10/2019 17:34, Alexey Romanenko wrote: >> >> >>> >> >> >>> Let me share some of my thoughts on this. >> >> >>> >> >> >>> - shall we filter out the package name from the release? >> >> >>> >> >> >>> Until new runner is not ready to be used in production (or, at >> least, be used for beta testing but users should be clearly warned about >> that in this case), I believe we need to filter out its classes from >> published jar to avoid a confusion. >> >> >>> >> >> >>> Yes that is what I think also >> >> >>> >> >> >>> - should we release 2 jars: one for the old and one for the >> new ? >> >> >>> >> >> >>> - should we release 3 jars: one for the new, one for the new >> and one for both ? >> >> >>> >> >> >>> Once new runner will be released, then I think we need to provide >> only one single jar and allow user to switch between different Spark >> runners with CLI option. >> >> >>> >> >> >>> I would vote for 3 jars: one for new, one for old, and one for >> both. Indeed, in some cases, users are looking very closely at the size of >> jars. This solution meets all use cases >> >> >>> >> >> >>> - should we create a special entry to the capability matrix ? >> >> >>> >> >> >>> Sure, since it has its own uniq characteristics and >> implementation, but again, only once new runner will be "officially >> released". >> >> >>> >> >> >>> +1 >> >> >>> >> >> >>> >> >> >>> >> >> >>> On 28 Oct 2019, at 10:27, Etienne Chauchot <[email protected]> >> wrote: >> >> >>> >> >> >>> Hi guys, >> >> >>> >> >> >>> Any opinions on the point2 communication to users ? >> >> >>> >> >> >>> Etienne >> >> >>> >> >> >>> On 24/10/2019 15:44, Etienne Chauchot wrote: >> >> >>> >> >> >>> Hi guys, >> >> >>> >> >> >>> I'm glad to announce that the PR for the merge to master of the >> new runner based on Spark Structured Streaming framework is submitted: >> >> >>> >> >> >>> https://github.com/apache/beam/pull/9866 >> >> >>> >> >> >>> >> >> >>> 1. Regarding the status of the runner: >> >> >>> >> >> >>> -the runner passes 93% of the validates runner tests in batch mode. >> >> >>> >> >> >>> -Streaming mode is barely started (waiting for the >> multi-aggregations support in spark Structured Streaming framework from the >> Spark community) >> >> >>> >> >> >>> -Runner can execute Nexmark >> >> >>> >> >> >>> -Some things are not wired up yet >> >> >>> >> >> >>> -Beam Schemas not wired with Spark Schemas >> >> >>> >> >> >>> -Optional features of the model not implemented: state api, >> timer api, splittable doFn api, … >> >> >>> >> >> >>> >> >> >>> 2. Regarding the communication to users: >> >> >>> >> >> >>> - for reasons explained by Ismael: the runner is in the same >> module as the "older" one. But it is in a different sub-package and both >> runners share the same build. >> >> >>> >> >> >>> - How should we communicate to users: >> >> >>> >> >> >>> - shall we filter out the package name from the release? >> >> >>> >> >> >>> - should we release 2 jars: one for the old and one for the >> new ? >> >> >>> >> >> >>> - should we release 3 jars: one for the new, one for the new >> and one for both ? >> >> >>> >> >> >>> - should we create a special entry to the capability matrix ? >> >> >>> >> >> >>> WDYT ? >> >> >>> >> >> >>> Best >> >> >>> >> >> >>> Etienne >> >> >>> >> >> >>> >> >> >>> On 23/10/2019 19:11, Mikhail Gryzykhin wrote: >> >> >>> >> >> >>> +1 to merge. >> >> >>> >> >> >>> It is worth keeping things in master with explicitly marked >> status. It will make effort more visible to users and easier to get >> feedback upon. >> >> >>> >> >> >>> --Mikhail >> >> >>> >> >> >>> On Wed, Oct 23, 2019 at 8:36 AM Etienne Chauchot < >> [email protected]> wrote: >> >> >>>> >> >> >>>> Hi guys, >> >> >>>> >> >> >>>> The new spark runner now supports beam coders and passes 93% of >> the batch validates runner tests (+4%). I think it is time to merge it to >> master. I will submit a PR in the coming days. >> >> >>>> >> >> >>>> next steps: support schemas and thus better leverage catalyst >> optimizer (among other things optims based on data), port perfs optims that >> were done in the current runner. >> >> >>>> >> >> >>>> Best >> >> >>>> >> >> >>>> Etienne >> >> >>>> >> >> >>>> On 11/10/2019 22:48, Pablo Estrada wrote: >> >> >>>> >> >> >>>> +1 for merging : ) >> >> >>>> >> >> >>>> On Fri, Oct 11, 2019 at 12:43 PM Robert Bradshaw < >> [email protected]> wrote: >> >> >>>>> >> >> >>>>> Sounds like a good plan to me. >> >> >>>>> >> >> >>>>> On Fri, Oct 11, 2019 at 6:20 AM Etienne Chauchot < >> [email protected]> wrote: >> >> >>>>>> >> >> >>>>>> Comments inline >> >> >>>>>> >> >> >>>>>> On 10/10/2019 23:44, Ismaël Mejía wrote: >> >> >>>>>> >> >> >>>>>> +1 >> >> >>>>>> >> >> >>>>>> The earlier we get to master the better to encourage not only >> code >> >> >>>>>> contributions but as important to have early user feedback. >> >> >>>>>> >> >> >>>>>> Question is: do we keep the "old" spark runner for a while or >> not (or just keep on previous version/tag on git) ? >> >> >>>>>> >> >> >>>>>> It is still too early to even start discussing when to remove >> the >> >> >>>>>> classical runner given that the new runner is still a WIP. >> However the >> >> >>>>>> overall goal is that this runner becomes the de-facto one once >> the VR >> >> >>>>>> tests and the performance become at least equal to the classical >> >> >>>>>> runner, in the meantime the best for users is that they >> co-exist, >> >> >>>>>> let’s not forget that the other runner has been already battle >> tested >> >> >>>>>> for more than 3 years and has had lots of improvements in the >> last >> >> >>>>>> year. >> >> >>>>>> >> >> >>>>>> +1 on what Ismael says: no soon removal, >> >> >>>>>> >> >> >>>>>> The plan I had in mind at first (that I showed at the >> apacheCon) was this but I'm proposing moving the first gray label to before >> the red box. >> >> >>>>>> >> >> >>>>>> <beogijnhpieapoll.png> >> >> >>>>>> >> >> >>>>>> >> >> >>>>>> I don't think the number of commits should be an issue--we >> shouldn't >> >> >>>>>> just squash years worth of history away. (OTOH, if this is a >> case of >> >> >>>>>> this branch containing lots of little, irrelevant commits that >> would >> >> >>>>>> have normally been squashed away in the normal review process >> we do >> >> >>>>>> for the main branch, then, yes, some cleanup could be nice.) >> >> >>>>>> >> >> >>>>>> About the commits we should encourage a clear history but we >> have also >> >> >>>>>> to remove useless commits that are still present in the branch, >> >> >>>>>> commits of the “Fix errorprone” / “Cleaning” kind and even >> commits >> >> >>>>>> that make a better narrative sense together should be probably >> >> >>>>>> squashed, because they do not bring much to the history. It is >> not >> >> >>>>>> about more or less commits it is about its relevance as Robert >> >> >>>>>> mentions. >> >> >>>>>> >> >> >>>>>> I think our experiences with things that go to master early >> have been very good. So I am in favor ASAP. We can exclude it from releases >> easily until it is ready for end users. >> >> >>>>>> I have the same question as Robert - how much is modifications >> and how much is new? I notice it is in a subdirectory of the >> beam-runners-spark module. >> >> >>>>>> >> >> >>>>>> In its current form we cannot exclude it but this relates to >> the other >> >> >>>>>> question, so better to explain a bit of history: The new runner >> used >> >> >>>>>> to live in its own module and subdirectory because it is a full >> blank >> >> >>>>>> page rewrite and the decision was not to use any of the >> classical >> >> >>>>>> runner classes to not be constrained by its evolution. >> >> >>>>>> >> >> >>>>>> However the reason to put it back in the same module as a >> subdirectory >> >> >>>>>> was to encourage early use, in more detail: The way you deploy >> spark >> >> >>>>>> jobs today is usually by packaging and staging an uber jar >> (~200MB of >> >> >>>>>> pure dependency joy) that contains the user pipeline classes, >> the >> >> >>>>>> spark runner module and its dependencies. If we have two spark >> runners >> >> >>>>>> in separate modules the user would need to repackage and >> redeploy >> >> >>>>>> their pipelines every time they want to switch from the >> classical >> >> >>>>>> Spark runner to the structured streaming runner which is >> painful and >> >> >>>>>> time and space consuming compared with the one module approach >> where >> >> >>>>>> they just change the name of the runner class and that’s it. >> The idea >> >> >>>>>> here is to make easy for users to test the new runner, but at >> the same >> >> >>>>>> time to make easy to come back to the classical runner in case >> of any >> >> >>>>>> issue. >> >> >>>>>> >> >> >>>>>> Ismaël >> >> >>>>>> >> >> >>>>>> On Thu, Oct 10, 2019 at 9:02 PM Kenneth Knowles < >> [email protected]> wrote: >> >> >>>>>> >> >> >>>>>> +1 >> >> >>>>>> >> >> >>>>>> I think our experiences with things that go to master early >> have been very good. So I am in favor ASAP. We can exclude it from releases >> easily until it is ready for end users. >> >> >>>>>> >> >> >>>>>> I have the same question as Robert - how much is modifications >> and how much is new? I notice it is in a subdirectory of the >> beam-runners-spark module. >> >> >>>>>> >> >> >>>>>> I did not see any major changes to dependencies but I will also >> ask if it has major version differences so that you might want a separate >> artifact? >> >> >>>>>> >> >> >>>>>> Kenn >> >> >>>>>> >> >> >>>>>> On Thu, Oct 10, 2019 at 11:50 AM Robert Bradshaw < >> [email protected]> wrote: >> >> >>>>>> >> >> >>>>>> On Thu, Oct 10, 2019 at 12:39 AM Etienne Chauchot < >> [email protected]> wrote: >> >> >>>>>> >> >> >>>>>> Hi guys, >> >> >>>>>> >> >> >>>>>> You probably know that there has been for several months an work >> >> >>>>>> developing a new Spark runner based on Spark Structured >> Streaming >> >> >>>>>> framework. This work is located in a feature branch here: >> >> >>>>>> >> https://github.com/apache/beam/tree/spark-runner_structured-streaming >> >> >>>>>> >> >> >>>>>> To attract more contributors and get some user feedback, we >> think it is >> >> >>>>>> time to merge it to master. Before doing so, some steps need to >> be achieved: >> >> >>>>>> >> >> >>>>>> - finish the work on spark Encoders (that allow to call Beam >> coders) >> >> >>>>>> because, right now, the runner is in an unstable state (some >> transforms >> >> >>>>>> use the new way of doing ser/de and some use the old one, >> making a >> >> >>>>>> pipeline incoherent toward serialization) >> >> >>>>>> >> >> >>>>>> - clean history: The history contains commits from November >> 2018, so >> >> >>>>>> there is a good amount of work, thus a consequent number of >> commits. >> >> >>>>>> They were already squashed but not from September 2019 >> >> >>>>>> >> >> >>>>>> I don't think the number of commits should be an issue--we >> shouldn't >> >> >>>>>> just squash years worth of history away. (OTOH, if this is a >> case of >> >> >>>>>> this branch containing lots of little, irrelevant commits that >> would >> >> >>>>>> have normally been squashed away in the normal review process >> we do >> >> >>>>>> for the main branch, then, yes, some cleanup could be nice.) >> >> >>>>>> >> >> >>>>>> Regarding status: >> >> >>>>>> >> >> >>>>>> - the runner passes 89% of the validates runner tests in batch >> mode. We >> >> >>>>>> hope to pass more with the new Encoders >> >> >>>>>> >> >> >>>>>> - Streaming mode is barely started (waiting for the >> multi-aggregations >> >> >>>>>> support in spark SS framework from the Spark community) >> >> >>>>>> >> >> >>>>>> - Runner can execute Nexmark >> >> >>>>>> >> >> >>>>>> - Some things are not wired up yet >> >> >>>>>> >> >> >>>>>> - Beam Schemas not wired with Spark Schemas >> >> >>>>>> >> >> >>>>>> - Optional features of the model not implemented: state >> api, timer >> >> >>>>>> api, splittable doFn api, … >> >> >>>>>> >> >> >>>>>> WDYT, can we merge it to master once the 2 steps are done ? >> >> >>>>>> >> >> >>>>>> I think that as long as it sits parallel to the existing >> runner, and >> >> >>>>>> is clearly marked with its status, it makes sense to me. How >> many >> >> >>>>>> changes does it make to the existing codebase (as opposed to >> add new >> >> >>>>>> code)? >> >> >>> >> >> >>> >> >> >>> >> >> >> >> >
