Very good points. We definitely ship a lot of code/features in very early stages, and there seems to be no problem.
I intend mostly to leave this judgment to people like you who know better about Spark users. But I do think 1 or 2 jars is better than 3. I really don't like "3 jars" and I did give two reasons: 1. diamond deps where things overlap 2. figuring out which thing to depend on Both are annoying for users. I am not certain if it could lead to a real unsolvable situation. This is just a Java ecosystem problem so I feel qualified to comment. I did also ask if there were major dependency differences between the two that could cause problems for users. This question was dropped and no one cares to comment so I assume it is not an issue. So then I favor having just 1 jar with both runners. Kenn On Wed, Oct 30, 2019 at 2:46 PM Ismaël Mejía <ieme...@gmail.com> wrote: > I am still a bit lost about why we are discussing options without giving > any > arguments or reasons for the options? Why is 2 modules better than 3 or 3 > better > than 2, or even better, what forces us to have something different than a > single > module? > > What are the reasons for wanting to have separate jars? If the issue is > that the > code is unfinished or not passing the tests, the impact for end users is > minimal > because they cannot accidentally end up running the new runner, and if they > decide to do so we can warn them it is at their own risk and not ready for > production in the documentation + runner. > > If the fear is that new code may end up being intertwined with the classic > and > portable runners and have some side effects. We have the ValidatesRunner + > Nexmark in the CI to cover this so again I do not see what is the problem > that > requires modules to be separate. > > If the issue is being uncomfortable about having in-progress code in > released > artifacts we have been doing this in Beam forever, for example most of the > work > on portability and Schema/SQL, and all of those were still part of > artifacts > long time before they were ready for prime use, so I still don't see why > this > case is different to require different artifacts. > > I have the impression we are trying to solve a non-issue by adding a lot of > artificial complexity (in particular to the users), or am I missing > something > else? > > On Wed, Oct 30, 2019 at 7:40 PM Kenneth Knowles <k...@apache.org> wrote: > > > > Oh, I mean that we ship just 2 jars. > > > > And since Spark users always build an uber jar, they can still depend on > both of ours and be able to switch runners with a flag. > > > > I really dislike projects shipping overlapping jars. It is confusing and > causes major diamond dependency problems. > > > > Kenn > > > > On Wed, Oct 30, 2019 at 11:12 AM Alexey Romanenko < > aromanenko....@gmail.com> wrote: > >> > >> Yes, agree, two jars included in uber jar will work in the similar way. > Though having 3 jars looks still quite confusing for me. > >> > >> On 29 Oct 2019, at 23:54, Kenneth Knowles <k...@apache.org> wrote: > >> > >> Is it just as easy to have two jars and build an uber jar with both > included? Then the runner can still be toggled with a flag. > >> > >> Kenn > >> > >> On Tue, Oct 29, 2019 at 9:38 AM Alexey Romanenko < > aromanenko....@gmail.com> wrote: > >>> > >>> Hmm, I don’t think that jar size should play a big role comparing to > the whole size of shaded jar of users job. Even more, I think it will be > quite confusing for users to choose which jar to use if we will have 3 > different ones for similar purposes. Though, let’s see what others think. > >>> > >>> On 29 Oct 2019, at 15:32, Etienne Chauchot <echauc...@apache.org> > wrote: > >>> > >>> Hi Alexey, > >>> > >>> Thanks for your opinion ! > >>> > >>> Comments inline > >>> > >>> Etienne > >>> > >>> On 28/10/2019 17:34, Alexey Romanenko wrote: > >>> > >>> Let me share some of my thoughts on this. > >>> > >>> - shall we filter out the package name from the release? > >>> > >>> Until new runner is not ready to be used in production (or, at least, > be used for beta testing but users should be clearly warned about that in > this case), I believe we need to filter out its classes from published jar > to avoid a confusion. > >>> > >>> Yes that is what I think also > >>> > >>> - should we release 2 jars: one for the old and one for the new ? > >>> > >>> - should we release 3 jars: one for the new, one for the new and > one for both ? > >>> > >>> Once new runner will be released, then I think we need to provide only > one single jar and allow user to switch between different Spark runners > with CLI option. > >>> > >>> I would vote for 3 jars: one for new, one for old, and one for both. > Indeed, in some cases, users are looking very closely at the size of jars. > This solution meets all use cases > >>> > >>> - should we create a special entry to the capability matrix ? > >>> > >>> Sure, since it has its own uniq characteristics and implementation, > but again, only once new runner will be "officially released". > >>> > >>> +1 > >>> > >>> > >>> > >>> On 28 Oct 2019, at 10:27, Etienne Chauchot <echauc...@apache.org> > wrote: > >>> > >>> Hi guys, > >>> > >>> Any opinions on the point2 communication to users ? > >>> > >>> Etienne > >>> > >>> On 24/10/2019 15:44, Etienne Chauchot wrote: > >>> > >>> Hi guys, > >>> > >>> I'm glad to announce that the PR for the merge to master of the new > runner based on Spark Structured Streaming framework is submitted: > >>> > >>> https://github.com/apache/beam/pull/9866 > >>> > >>> > >>> 1. Regarding the status of the runner: > >>> > >>> -the runner passes 93% of the validates runner tests in batch mode. > >>> > >>> -Streaming mode is barely started (waiting for the multi-aggregations > support in spark Structured Streaming framework from the Spark community) > >>> > >>> -Runner can execute Nexmark > >>> > >>> -Some things are not wired up yet > >>> > >>> -Beam Schemas not wired with Spark Schemas > >>> > >>> -Optional features of the model not implemented: state api, timer > api, splittable doFn api, … > >>> > >>> > >>> 2. Regarding the communication to users: > >>> > >>> - for reasons explained by Ismael: the runner is in the same module as > the "older" one. But it is in a different sub-package and both runners > share the same build. > >>> > >>> - How should we communicate to users: > >>> > >>> - shall we filter out the package name from the release? > >>> > >>> - should we release 2 jars: one for the old and one for the new ? > >>> > >>> - should we release 3 jars: one for the new, one for the new and > one for both ? > >>> > >>> - should we create a special entry to the capability matrix ? > >>> > >>> WDYT ? > >>> > >>> Best > >>> > >>> Etienne > >>> > >>> > >>> On 23/10/2019 19:11, Mikhail Gryzykhin wrote: > >>> > >>> +1 to merge. > >>> > >>> It is worth keeping things in master with explicitly marked status. It > will make effort more visible to users and easier to get feedback upon. > >>> > >>> --Mikhail > >>> > >>> On Wed, Oct 23, 2019 at 8:36 AM Etienne Chauchot <echauc...@apache.org> > wrote: > >>>> > >>>> Hi guys, > >>>> > >>>> The new spark runner now supports beam coders and passes 93% of the > batch validates runner tests (+4%). I think it is time to merge it to > master. I will submit a PR in the coming days. > >>>> > >>>> next steps: support schemas and thus better leverage catalyst > optimizer (among other things optims based on data), port perfs optims that > were done in the current runner. > >>>> > >>>> Best > >>>> > >>>> Etienne > >>>> > >>>> On 11/10/2019 22:48, Pablo Estrada wrote: > >>>> > >>>> +1 for merging : ) > >>>> > >>>> On Fri, Oct 11, 2019 at 12:43 PM Robert Bradshaw <rober...@google.com> > wrote: > >>>>> > >>>>> Sounds like a good plan to me. > >>>>> > >>>>> On Fri, Oct 11, 2019 at 6:20 AM Etienne Chauchot < > echauc...@apache.org> wrote: > >>>>>> > >>>>>> Comments inline > >>>>>> > >>>>>> On 10/10/2019 23:44, Ismaël Mejía wrote: > >>>>>> > >>>>>> +1 > >>>>>> > >>>>>> The earlier we get to master the better to encourage not only code > >>>>>> contributions but as important to have early user feedback. > >>>>>> > >>>>>> Question is: do we keep the "old" spark runner for a while or not > (or just keep on previous version/tag on git) ? > >>>>>> > >>>>>> It is still too early to even start discussing when to remove the > >>>>>> classical runner given that the new runner is still a WIP. However > the > >>>>>> overall goal is that this runner becomes the de-facto one once the > VR > >>>>>> tests and the performance become at least equal to the classical > >>>>>> runner, in the meantime the best for users is that they co-exist, > >>>>>> let’s not forget that the other runner has been already battle > tested > >>>>>> for more than 3 years and has had lots of improvements in the last > >>>>>> year. > >>>>>> > >>>>>> +1 on what Ismael says: no soon removal, > >>>>>> > >>>>>> The plan I had in mind at first (that I showed at the apacheCon) > was this but I'm proposing moving the first gray label to before the red > box. > >>>>>> > >>>>>> <beogijnhpieapoll.png> > >>>>>> > >>>>>> > >>>>>> I don't think the number of commits should be an issue--we shouldn't > >>>>>> just squash years worth of history away. (OTOH, if this is a case of > >>>>>> this branch containing lots of little, irrelevant commits that would > >>>>>> have normally been squashed away in the normal review process we do > >>>>>> for the main branch, then, yes, some cleanup could be nice.) > >>>>>> > >>>>>> About the commits we should encourage a clear history but we have > also > >>>>>> to remove useless commits that are still present in the branch, > >>>>>> commits of the “Fix errorprone” / “Cleaning” kind and even commits > >>>>>> that make a better narrative sense together should be probably > >>>>>> squashed, because they do not bring much to the history. It is not > >>>>>> about more or less commits it is about its relevance as Robert > >>>>>> mentions. > >>>>>> > >>>>>> I think our experiences with things that go to master early have > been very good. So I am in favor ASAP. We can exclude it from releases > easily until it is ready for end users. > >>>>>> I have the same question as Robert - how much is modifications and > how much is new? I notice it is in a subdirectory of the beam-runners-spark > module. > >>>>>> > >>>>>> In its current form we cannot exclude it but this relates to the > other > >>>>>> question, so better to explain a bit of history: The new runner used > >>>>>> to live in its own module and subdirectory because it is a full > blank > >>>>>> page rewrite and the decision was not to use any of the classical > >>>>>> runner classes to not be constrained by its evolution. > >>>>>> > >>>>>> However the reason to put it back in the same module as a > subdirectory > >>>>>> was to encourage early use, in more detail: The way you deploy spark > >>>>>> jobs today is usually by packaging and staging an uber jar (~200MB > of > >>>>>> pure dependency joy) that contains the user pipeline classes, the > >>>>>> spark runner module and its dependencies. If we have two spark > runners > >>>>>> in separate modules the user would need to repackage and redeploy > >>>>>> their pipelines every time they want to switch from the classical > >>>>>> Spark runner to the structured streaming runner which is painful and > >>>>>> time and space consuming compared with the one module approach where > >>>>>> they just change the name of the runner class and that’s it. The > idea > >>>>>> here is to make easy for users to test the new runner, but at the > same > >>>>>> time to make easy to come back to the classical runner in case of > any > >>>>>> issue. > >>>>>> > >>>>>> Ismaël > >>>>>> > >>>>>> On Thu, Oct 10, 2019 at 9:02 PM Kenneth Knowles <k...@apache.org> > wrote: > >>>>>> > >>>>>> +1 > >>>>>> > >>>>>> I think our experiences with things that go to master early have > been very good. So I am in favor ASAP. We can exclude it from releases > easily until it is ready for end users. > >>>>>> > >>>>>> I have the same question as Robert - how much is modifications and > how much is new? I notice it is in a subdirectory of the beam-runners-spark > module. > >>>>>> > >>>>>> I did not see any major changes to dependencies but I will also ask > if it has major version differences so that you might want a separate > artifact? > >>>>>> > >>>>>> Kenn > >>>>>> > >>>>>> On Thu, Oct 10, 2019 at 11:50 AM Robert Bradshaw < > rober...@google.com> wrote: > >>>>>> > >>>>>> On Thu, Oct 10, 2019 at 12:39 AM Etienne Chauchot < > echauc...@apache.org> wrote: > >>>>>> > >>>>>> Hi guys, > >>>>>> > >>>>>> You probably know that there has been for several months an work > >>>>>> developing a new Spark runner based on Spark Structured Streaming > >>>>>> framework. This work is located in a feature branch here: > >>>>>> > https://github.com/apache/beam/tree/spark-runner_structured-streaming > >>>>>> > >>>>>> To attract more contributors and get some user feedback, we think > it is > >>>>>> time to merge it to master. Before doing so, some steps need to be > achieved: > >>>>>> > >>>>>> - finish the work on spark Encoders (that allow to call Beam coders) > >>>>>> because, right now, the runner is in an unstable state (some > transforms > >>>>>> use the new way of doing ser/de and some use the old one, making a > >>>>>> pipeline incoherent toward serialization) > >>>>>> > >>>>>> - clean history: The history contains commits from November 2018, so > >>>>>> there is a good amount of work, thus a consequent number of commits. > >>>>>> They were already squashed but not from September 2019 > >>>>>> > >>>>>> I don't think the number of commits should be an issue--we shouldn't > >>>>>> just squash years worth of history away. (OTOH, if this is a case of > >>>>>> this branch containing lots of little, irrelevant commits that would > >>>>>> have normally been squashed away in the normal review process we do > >>>>>> for the main branch, then, yes, some cleanup could be nice.) > >>>>>> > >>>>>> Regarding status: > >>>>>> > >>>>>> - the runner passes 89% of the validates runner tests in batch > mode. We > >>>>>> hope to pass more with the new Encoders > >>>>>> > >>>>>> - Streaming mode is barely started (waiting for the > multi-aggregations > >>>>>> support in spark SS framework from the Spark community) > >>>>>> > >>>>>> - Runner can execute Nexmark > >>>>>> > >>>>>> - Some things are not wired up yet > >>>>>> > >>>>>> - Beam Schemas not wired with Spark Schemas > >>>>>> > >>>>>> - Optional features of the model not implemented: state api, > timer > >>>>>> api, splittable doFn api, … > >>>>>> > >>>>>> WDYT, can we merge it to master once the 2 steps are done ? > >>>>>> > >>>>>> I think that as long as it sits parallel to the existing runner, and > >>>>>> is clearly marked with its status, it makes sense to me. How many > >>>>>> changes does it make to the existing codebase (as opposed to add new > >>>>>> code)? > >>> > >>> > >>> > >> >