Yes, agree, two jars included in uber jar will work in the similar way. Though having 3 jars looks still quite confusing for me.
> On 29 Oct 2019, at 23:54, Kenneth Knowles <k...@apache.org> wrote: > > Is it just as easy to have two jars and build an uber jar with both included? > Then the runner can still be toggled with a flag. > > Kenn > > On Tue, Oct 29, 2019 at 9:38 AM Alexey Romanenko <aromanenko....@gmail.com > <mailto:aromanenko....@gmail.com>> wrote: > Hmm, I don’t think that jar size should play a big role comparing to the > whole size of shaded jar of users job. Even more, I think it will be quite > confusing for users to choose which jar to use if we will have 3 different > ones for similar purposes. Though, let’s see what others think. > >> On 29 Oct 2019, at 15:32, Etienne Chauchot <echauc...@apache.org >> <mailto:echauc...@apache.org>> wrote: >> >> Hi Alexey, >> Thanks for your opinion ! >> >> Comments inline >> >> Etienne >> On 28/10/2019 17:34, Alexey Romanenko wrote: >>> Let me share some of my thoughts on this. >>>>> - shall we filter out the package name from the release? >>> Until new runner is not ready to be used in production (or, at least, be >>> used for beta testing but users should be clearly warned about that in this >>> case), I believe we need to filter out its classes from published jar to >>> avoid a confusion. >> Yes that is what I think also >>>>> - should we release 2 jars: one for the old and one for the new ? >>>>> - should we release 3 jars: one for the new, one for the new and one >>>>> for both ? >>>>> >>> Once new runner will be released, then I think we need to provide only one >>> single jar and allow user to switch between different Spark runners with >>> CLI option. >> I would vote for 3 jars: one for new, one for old, and one for both. Indeed, >> in some cases, users are looking very closely at the size of jars. This >> solution meets all use cases >>>>> - should we create a special entry to the capability matrix ? >>>>> >>> >>> Sure, since it has its own uniq characteristics and implementation, but >>> again, only once new runner will be "officially released". >> +1 >>> >>> >>>> On 28 Oct 2019, at 10:27, Etienne Chauchot <echauc...@apache.org >>>> <mailto:echauc...@apache.org>> wrote: >>>> >>>> Hi guys, >>>> >>>> Any opinions on the point2 communication to users ? >>>> >>>> Etienne >>>> On 24/10/2019 15:44, Etienne Chauchot wrote: >>>>> Hi guys, >>>>> >>>>> I'm glad to announce that the PR for the merge to master of the new >>>>> runner based on Spark Structured Streaming framework is submitted: >>>>> >>>>> https://github.com/apache/beam/pull/9866 >>>>> <https://github.com/apache/beam/pull/9866> >>>>> >>>>> 1. Regarding the status of the runner: >>>>> -the runner passes 93% of the validates runner tests in batch mode. >>>>> >>>>> -Streaming mode is barely started (waiting for the multi-aggregations >>>>> support in spark Structured Streaming framework from the Spark community) >>>>> >>>>> -Runner can execute Nexmark >>>>> >>>>> -Some things are not wired up yet >>>>> >>>>> -Beam Schemas not wired with Spark Schemas >>>>> >>>>> -Optional features of the model not implemented: state api, timer api, >>>>> splittable doFn api, … >>>>> >>>>> 2. Regarding the communication to users: >>>>> >>>>> - for reasons explained by Ismael: the runner is in the same module as >>>>> the "older" one. But it is in a different sub-package and both runners >>>>> share the same build. >>>>> - How should we communicate to users: >>>>> - shall we filter out the package name from the release? >>>>> - should we release 2 jars: one for the old and one for the new ? >>>>> - should we release 3 jars: one for the new, one for the new and one >>>>> for both ? >>>>> >>>>> - should we create a special entry to the capability matrix ? >>>>> >>>>> WDYT ? >>>>> Best >>>>> >>>>> Etienne >>>>> >>>>> On 23/10/2019 19:11, Mikhail Gryzykhin wrote: >>>>>> +1 to merge. >>>>>> >>>>>> It is worth keeping things in master with explicitly marked status. It >>>>>> will make effort more visible to users and easier to get feedback upon. >>>>>> >>>>>> --Mikhail >>>>>> >>>>>> On Wed, Oct 23, 2019 at 8:36 AM Etienne Chauchot <echauc...@apache.org >>>>>> <mailto:echauc...@apache.org>> wrote: >>>>>> Hi guys, >>>>>> >>>>>> The new spark runner now supports beam coders and passes 93% of the >>>>>> batch validates runner tests (+4%). I think it is time to merge it to >>>>>> master. I will submit a PR in the coming days. >>>>>> >>>>>> next steps: support schemas and thus better leverage catalyst optimizer >>>>>> (among other things optims based on data), port perfs optims that were >>>>>> done in the current runner. >>>>>> Best >>>>>> Etienne >>>>>> On 11/10/2019 22:48, Pablo Estrada wrote: >>>>>>> +1 for merging : ) >>>>>>> >>>>>>> On Fri, Oct 11, 2019 at 12:43 PM Robert Bradshaw <rober...@google.com >>>>>>> <mailto:rober...@google.com>> wrote: >>>>>>> Sounds like a good plan to me. >>>>>>> >>>>>>> On Fri, Oct 11, 2019 at 6:20 AM Etienne Chauchot <echauc...@apache.org >>>>>>> <mailto:echauc...@apache.org>> wrote: >>>>>>> Comments inline >>>>>>> On 10/10/2019 23:44, Ismaël Mejía wrote: >>>>>>>> +1 >>>>>>>> >>>>>>>> The earlier we get to master the better to encourage not only code >>>>>>>> contributions but as important to have early user feedback. >>>>>>>> >>>>>>>>> Question is: do we keep the "old" spark runner for a while or not (or >>>>>>>>> just keep on previous version/tag on git) ? >>>>>>>> It is still too early to even start discussing when to remove the >>>>>>>> classical runner given that the new runner is still a WIP. However the >>>>>>>> overall goal is that this runner becomes the de-facto one once the VR >>>>>>>> tests and the performance become at least equal to the classical >>>>>>>> runner, in the meantime the best for users is that they co-exist, >>>>>>>> let’s not forget that the other runner has been already battle tested >>>>>>>> for more than 3 years and has had lots of improvements in the last >>>>>>>> year. >>>>>>> +1 on what Ismael says: no soon removal, >>>>>>> The plan I had in mind at first (that I showed at the apacheCon) was >>>>>>> this but I'm proposing moving the first gray label to before the red >>>>>>> box. >>>>>>> <beogijnhpieapoll.png> >>>>>>> >>>>>>> >>>>>>>>> I don't think the number of commits should be an issue--we shouldn't >>>>>>>>> just squash years worth of history away. (OTOH, if this is a case of >>>>>>>>> this branch containing lots of little, irrelevant commits that would >>>>>>>>> have normally been squashed away in the normal review process we do >>>>>>>>> for the main branch, then, yes, some cleanup could be nice.) >>>>>>>> About the commits we should encourage a clear history but we have also >>>>>>>> to remove useless commits that are still present in the branch, >>>>>>>> commits of the “Fix errorprone” / “Cleaning” kind and even commits >>>>>>>> that make a better narrative sense together should be probably >>>>>>>> squashed, because they do not bring much to the history. It is not >>>>>>>> about more or less commits it is about its relevance as Robert >>>>>>>> mentions. >>>>>>>> >>>>>>>>> I think our experiences with things that go to master early have been >>>>>>>>> very good. So I am in favor ASAP. We can exclude it from releases >>>>>>>>> easily until it is ready for end users. >>>>>>>>> I have the same question as Robert - how much is modifications and >>>>>>>>> how much is new? I notice it is in a subdirectory of the >>>>>>>>> beam-runners-spark module. >>>>>>>> In its current form we cannot exclude it but this relates to the other >>>>>>>> question, so better to explain a bit of history: The new runner used >>>>>>>> to live in its own module and subdirectory because it is a full blank >>>>>>>> page rewrite and the decision was not to use any of the classical >>>>>>>> runner classes to not be constrained by its evolution. >>>>>>>> >>>>>>>> However the reason to put it back in the same module as a subdirectory >>>>>>>> was to encourage early use, in more detail: The way you deploy spark >>>>>>>> jobs today is usually by packaging and staging an uber jar (~200MB of >>>>>>>> pure dependency joy) that contains the user pipeline classes, the >>>>>>>> spark runner module and its dependencies. If we have two spark runners >>>>>>>> in separate modules the user would need to repackage and redeploy >>>>>>>> their pipelines every time they want to switch from the classical >>>>>>>> Spark runner to the structured streaming runner which is painful and >>>>>>>> time and space consuming compared with the one module approach where >>>>>>>> they just change the name of the runner class and that’s it. The idea >>>>>>>> here is to make easy for users to test the new runner, but at the same >>>>>>>> time to make easy to come back to the classical runner in case of any >>>>>>>> issue. >>>>>>>> >>>>>>>> Ismaël >>>>>>>> >>>>>>>> On Thu, Oct 10, 2019 at 9:02 PM Kenneth Knowles <k...@apache.org> >>>>>>>> <mailto:k...@apache.org> wrote: >>>>>>>>> +1 >>>>>>>>> >>>>>>>>> I think our experiences with things that go to master early have been >>>>>>>>> very good. So I am in favor ASAP. We can exclude it from releases >>>>>>>>> easily until it is ready for end users. >>>>>>>>> >>>>>>>>> I have the same question as Robert - how much is modifications and >>>>>>>>> how much is new? I notice it is in a subdirectory of the >>>>>>>>> beam-runners-spark module. >>>>>>>>> >>>>>>>>> I did not see any major changes to dependencies but I will also ask >>>>>>>>> if it has major version differences so that you might want a separate >>>>>>>>> artifact? >>>>>>>>> >>>>>>>>> Kenn >>>>>>>>> >>>>>>>>> On Thu, Oct 10, 2019 at 11:50 AM Robert Bradshaw >>>>>>>>> <rober...@google.com> <mailto:rober...@google.com> wrote: >>>>>>>>>> On Thu, Oct 10, 2019 at 12:39 AM Etienne Chauchot >>>>>>>>>> <echauc...@apache.org> <mailto:echauc...@apache.org> wrote: >>>>>>>>>>> Hi guys, >>>>>>>>>>> >>>>>>>>>>> You probably know that there has been for several months an work >>>>>>>>>>> developing a new Spark runner based on Spark Structured Streaming >>>>>>>>>>> framework. This work is located in a feature branch here: >>>>>>>>>>> https://github.com/apache/beam/tree/spark-runner_structured-streaming >>>>>>>>>>> >>>>>>>>>>> <https://github.com/apache/beam/tree/spark-runner_structured-streaming> >>>>>>>>>>> >>>>>>>>>>> To attract more contributors and get some user feedback, we think >>>>>>>>>>> it is >>>>>>>>>>> time to merge it to master. Before doing so, some steps need to be >>>>>>>>>>> achieved: >>>>>>>>>>> >>>>>>>>>>> - finish the work on spark Encoders (that allow to call Beam coders) >>>>>>>>>>> because, right now, the runner is in an unstable state (some >>>>>>>>>>> transforms >>>>>>>>>>> use the new way of doing ser/de and some use the old one, making a >>>>>>>>>>> pipeline incoherent toward serialization) >>>>>>>>>>> >>>>>>>>>>> - clean history: The history contains commits from November 2018, so >>>>>>>>>>> there is a good amount of work, thus a consequent number of commits. >>>>>>>>>>> They were already squashed but not from September 2019 >>>>>>>>>> I don't think the number of commits should be an issue--we shouldn't >>>>>>>>>> just squash years worth of history away. (OTOH, if this is a case of >>>>>>>>>> this branch containing lots of little, irrelevant commits that would >>>>>>>>>> have normally been squashed away in the normal review process we do >>>>>>>>>> for the main branch, then, yes, some cleanup could be nice.) >>>>>>>>>> >>>>>>>>>>> Regarding status: >>>>>>>>>>> >>>>>>>>>>> - the runner passes 89% of the validates runner tests in batch >>>>>>>>>>> mode. We >>>>>>>>>>> hope to pass more with the new Encoders >>>>>>>>>>> >>>>>>>>>>> - Streaming mode is barely started (waiting for the >>>>>>>>>>> multi-aggregations >>>>>>>>>>> support in spark SS framework from the Spark community) >>>>>>>>>>> >>>>>>>>>>> - Runner can execute Nexmark >>>>>>>>>>> >>>>>>>>>>> - Some things are not wired up yet >>>>>>>>>>> >>>>>>>>>>> - Beam Schemas not wired with Spark Schemas >>>>>>>>>>> >>>>>>>>>>> - Optional features of the model not implemented: state api, >>>>>>>>>>> timer >>>>>>>>>>> api, splittable doFn api, … >>>>>>>>>>> >>>>>>>>>>> WDYT, can we merge it to master once the 2 steps are done ? >>>>>>>>>> I think that as long as it sits parallel to the existing runner, and >>>>>>>>>> is clearly marked with its status, it makes sense to me. How many >>>>>>>>>> changes does it make to the existing codebase (as opposed to add new >>>>>>>>>> code)? >>> >