Re: [spark structured streaming runner] merge to master?

Ryan Skraba Thu, 07 Nov 2019 09:08:43 -0800

Just a personal opinion!  I would prefer ONE jar including all spark
runners, and I think the new Spark runner should be present in the
release artifacts even if it's in an incomplete state.


I have no objection to putting the experimental runner alongside the
stable, mature runner.  We have some precedence with the portable
spark runner, and that's worked out pretty well -- at least, I haven't
heard any complaints from confused users!

That being said:

1.  It really should be marked @Experimental in the code *and* clearly
warned in API (javadoc) and documentation.

2.  Ideally, I'd like to see a warning banner in the logs when it's
used, pointing to the stable SparkRunner and/or documentation on the
current known issues.

All my best, Ryan





On Thu, Nov 7, 2019 at 5:32 PM Etienne Chauchot <[email protected]> wrote:
>
> Hi guys
>
> @Kenn,
>
> I just wanted to mention that I did answered your question on dependencies 
> here: 
> https://lists.apache.org/thread.html/5a85caac41e796c2aa351d835b3483808ebbbd4512b480940d494439@%3Cdev.beam.apache.org%3E
>
> regarding jars:
>
> I don't like 3 jars either.
>
> I'm not in favor of having the 2 runners in one jar, the point about having 2 
> jars was to:
>
> - avoid making promises to users on a work in progress runner (make it 
> explicit with a different jar)
>
> - avoid confusion for them (why are there 2 pipeline options? etc....)
>
> If the community believes that there is no confusion or wrong promises with 
> the one jar solution, we could leave the 2 runners in one jar.
>
> Maybe we could start a vote on that?
>
> Etienne
>
> On 31/10/2019 02:06, Kenneth Knowles wrote:
>
> Very good points. We definitely ship a lot of code/features in very early 
> stages, and there seems to be no problem.
>
> I intend mostly to leave this judgment to people like you who know better 
> about Spark users.
>
> But I do think 1 or 2 jars is better than 3. I really don't like "3 jars" and 
> I did give two reasons:
>
> 1. diamond deps where things overlap
> 2. figuring out which thing to depend on
>
> Both are annoying for users. I am not certain if it could lead to a real 
> unsolvable situation. This is just a Java ecosystem problem so I feel 
> qualified to comment.
>
> I did also ask if there were major dependency differences between the two 
> that could cause problems for users. This question was dropped and no one 
> cares to comment so I assume it is not an issue. So then I favor having just 
> 1 jar with both runners.
>
> Kenn
>
> On Wed, Oct 30, 2019 at 2:46 PM Ismaël Mejía <[email protected]> wrote:
>>
>> I am still a bit lost about why we are discussing options without giving any
>> arguments or reasons for the options? Why is 2 modules better than 3 or 3 
>> better
>> than 2, or even better, what forces us to have something different than a 
>> single
>> module?
>>
>> What are the reasons for wanting to have separate jars? If the issue is that 
>> the
>> code is unfinished or not passing the tests, the impact for end users is 
>> minimal
>> because they cannot accidentally end up running the new runner, and if they
>> decide to do so we can warn them it is at their own risk and not ready for
>> production in the documentation + runner.
>>
>> If the fear is that new code may end up being intertwined with the classic 
>> and
>> portable runners and have some side effects. We have the ValidatesRunner +
>> Nexmark in the CI to cover this so again I do not see what is the problem 
>> that
>> requires modules to be separate.
>>
>> If the issue is being uncomfortable about having in-progress code in released
>> artifacts we have been doing this in Beam forever, for example most of the 
>> work
>> on portability and Schema/SQL, and all of those were still part of artifacts
>> long time before they were ready for prime use, so I still don't see why this
>> case is different to require different artifacts.
>>
>> I have the impression we are trying to solve a non-issue by adding a lot of
>> artificial complexity (in particular to the users), or am I missing something
>> else?
>>
>> On Wed, Oct 30, 2019 at 7:40 PM Kenneth Knowles <[email protected]> wrote:
>> >
>> > Oh, I mean that we ship just 2 jars.
>> >
>> > And since Spark users always build an uber jar, they can still depend on 
>> > both of ours and be able to switch runners with a flag.
>> >
>> > I really dislike projects shipping overlapping jars. It is confusing and 
>> > causes major diamond dependency problems.
>> >
>> > Kenn
>> >
>> > On Wed, Oct 30, 2019 at 11:12 AM Alexey Romanenko 
>> > <[email protected]> wrote:
>> >>
>> >> Yes, agree, two jars included in uber jar will work in the similar way. 
>> >> Though having 3 jars looks still quite confusing for me.
>> >>
>> >> On 29 Oct 2019, at 23:54, Kenneth Knowles <[email protected]> wrote:
>> >>
>> >> Is it just as easy to have two jars and build an uber jar with both 
>> >> included? Then the runner can still be toggled with a flag.
>> >>
>> >> Kenn
>> >>
>> >> On Tue, Oct 29, 2019 at 9:38 AM Alexey Romanenko 
>> >> <[email protected]> wrote:
>> >>>
>> >>> Hmm, I don’t think that jar size should play a big role comparing to the 
>> >>> whole size of shaded jar of users job. Even more, I think it will be 
>> >>> quite confusing for users to choose which jar to use if we will have 3 
>> >>> different ones for similar purposes. Though, let’s see what others think.
>> >>>
>> >>> On 29 Oct 2019, at 15:32, Etienne Chauchot <[email protected]> wrote:
>> >>>
>> >>> Hi Alexey,
>> >>>
>> >>> Thanks for your opinion !
>> >>>
>> >>> Comments inline
>> >>>
>> >>> Etienne
>> >>>
>> >>> On 28/10/2019 17:34, Alexey Romanenko wrote:
>> >>>
>> >>> Let me share some of my thoughts on this.
>> >>>
>> >>>     - shall we filter out the package name from the release?
>> >>>
>> >>> Until new runner is not ready to be used in production (or, at least, be 
>> >>> used for beta testing but users should be clearly warned about that in 
>> >>> this case), I believe we need to filter out its classes from published 
>> >>> jar to avoid a confusion.
>> >>>
>> >>> Yes that is what I think also
>> >>>
>> >>>     - should we release 2 jars: one for the old and one for the new ?
>> >>>
>> >>>     - should we release 3 jars: one for the new, one for the new and one 
>> >>> for both ?
>> >>>
>> >>> Once new runner will be released, then I think we need to provide only 
>> >>> one single jar and allow user to switch between different Spark runners 
>> >>> with CLI option.
>> >>>
>> >>> I would vote for 3 jars: one for new, one for old, and one for both. 
>> >>> Indeed, in some cases, users are looking very closely at the size of 
>> >>> jars. This solution meets all use cases
>> >>>
>> >>>     - should we create a special entry to the capability matrix ?
>> >>>
>> >>> Sure, since it has its own uniq characteristics and implementation, but 
>> >>> again, only once new runner will be "officially released".
>> >>>
>> >>> +1
>> >>>
>> >>>
>> >>>
>> >>> On 28 Oct 2019, at 10:27, Etienne Chauchot <[email protected]> wrote:
>> >>>
>> >>> Hi guys,
>> >>>
>> >>> Any opinions on the point2 communication to users ?
>> >>>
>> >>> Etienne
>> >>>
>> >>> On 24/10/2019 15:44, Etienne Chauchot wrote:
>> >>>
>> >>> Hi guys,
>> >>>
>> >>> I'm glad to announce that the PR for the merge to master of the new 
>> >>> runner based on Spark Structured Streaming framework is submitted:
>> >>>
>> >>> https://github.com/apache/beam/pull/9866
>> >>>
>> >>>
>> >>> 1. Regarding the status of the runner:
>> >>>
>> >>> -the runner passes 93% of the validates runner tests in batch mode.
>> >>>
>> >>> -Streaming mode is barely started (waiting for the multi-aggregations 
>> >>> support in spark Structured Streaming framework from the Spark community)
>> >>>
>> >>> -Runner can execute Nexmark
>> >>>
>> >>> -Some things are not wired up yet
>> >>>
>> >>>   -Beam Schemas not wired with Spark Schemas
>> >>>
>> >>>   -Optional features of the model not implemented: state api, timer api, 
>> >>> splittable doFn api, …
>> >>>
>> >>>
>> >>> 2. Regarding the communication to users:
>> >>>
>> >>> - for reasons explained by Ismael: the runner is in the same module as 
>> >>> the "older" one. But it is in a different sub-package and both runners 
>> >>> share the same build.
>> >>>
>> >>> - How should we communicate to users:
>> >>>
>> >>>     - shall we filter out the package name from the release?
>> >>>
>> >>>     - should we release 2 jars: one for the old and one for the new ?
>> >>>
>> >>>     - should we release 3 jars: one for the new, one for the new and one 
>> >>> for both ?
>> >>>
>> >>>     - should we create a special entry to the capability matrix ?
>> >>>
>> >>> WDYT ?
>> >>>
>> >>> Best
>> >>>
>> >>> Etienne
>> >>>
>> >>>
>> >>> On 23/10/2019 19:11, Mikhail Gryzykhin wrote:
>> >>>
>> >>> +1 to merge.
>> >>>
>> >>> It is worth keeping things in master with explicitly marked status. It 
>> >>> will make effort more visible to users and easier to get feedback upon.
>> >>>
>> >>> --Mikhail
>> >>>
>> >>> On Wed, Oct 23, 2019 at 8:36 AM Etienne Chauchot <[email protected]> 
>> >>> wrote:
>> >>>>
>> >>>> Hi guys,
>> >>>>
>> >>>> The new spark runner now supports beam coders and passes 93% of the 
>> >>>> batch validates runner tests (+4%). I think it is time to merge it to 
>> >>>> master. I will submit a PR in the coming days.
>> >>>>
>> >>>> next steps: support schemas and thus better leverage catalyst optimizer 
>> >>>> (among other things optims based on data), port perfs optims that were 
>> >>>> done in the current runner.
>> >>>>
>> >>>> Best
>> >>>>
>> >>>> Etienne
>> >>>>
>> >>>> On 11/10/2019 22:48, Pablo Estrada wrote:
>> >>>>
>> >>>> +1 for merging : )
>> >>>>
>> >>>> On Fri, Oct 11, 2019 at 12:43 PM Robert Bradshaw <[email protected]> 
>> >>>> wrote:
>> >>>>>
>> >>>>> Sounds like a good plan to me.
>> >>>>>
>> >>>>> On Fri, Oct 11, 2019 at 6:20 AM Etienne Chauchot 
>> >>>>> <[email protected]> wrote:
>> >>>>>>
>> >>>>>> Comments inline
>> >>>>>>
>> >>>>>> On 10/10/2019 23:44, Ismaël Mejía wrote:
>> >>>>>>
>> >>>>>> +1
>> >>>>>>
>> >>>>>> The earlier we get to master the better to encourage not only code
>> >>>>>> contributions but as important to have early user feedback.
>> >>>>>>
>> >>>>>> Question is: do we keep the "old" spark runner for a while or not (or 
>> >>>>>> just keep on previous version/tag on git) ?
>> >>>>>>
>> >>>>>> It is still too early to even start discussing when to remove the
>> >>>>>> classical runner given that the new runner is still a WIP. However the
>> >>>>>> overall goal is that this runner becomes the de-facto one once the VR
>> >>>>>> tests and the performance become at least equal to the classical
>> >>>>>> runner, in the meantime the best for users is that they co-exist,
>> >>>>>> let’s not forget that the other runner has been already battle tested
>> >>>>>> for more than 3 years and has had lots of improvements in the last
>> >>>>>> year.
>> >>>>>>
>> >>>>>> +1 on what Ismael says: no soon removal,
>> >>>>>>
>> >>>>>> The plan I had in mind at first (that I showed at the apacheCon) was 
>> >>>>>> this but I'm proposing moving the first gray label to before the red 
>> >>>>>> box.
>> >>>>>>
>> >>>>>> <beogijnhpieapoll.png>
>> >>>>>>
>> >>>>>>
>> >>>>>> I don't think the number of commits should be an issue--we shouldn't
>> >>>>>> just squash years worth of history away. (OTOH, if this is a case of
>> >>>>>> this branch containing lots of little, irrelevant commits that would
>> >>>>>> have normally been squashed away in the normal review process we do
>> >>>>>> for the main branch, then, yes, some cleanup could be nice.)
>> >>>>>>
>> >>>>>> About the commits we should encourage a clear history but we have also
>> >>>>>> to remove useless commits that are still present in the branch,
>> >>>>>> commits of the “Fix errorprone” / “Cleaning” kind and even commits
>> >>>>>> that make a better narrative sense together should be probably
>> >>>>>> squashed, because they do not bring much to the history. It is not
>> >>>>>> about more or less commits it is about its relevance as Robert
>> >>>>>> mentions.
>> >>>>>>
>> >>>>>> I think our experiences with things that go to master early have been 
>> >>>>>> very good. So I am in favor ASAP. We can exclude it from releases 
>> >>>>>> easily until it is ready for end users.
>> >>>>>> I have the same question as Robert - how much is modifications and 
>> >>>>>> how much is new? I notice it is in a subdirectory of the 
>> >>>>>> beam-runners-spark module.
>> >>>>>>
>> >>>>>> In its current form we cannot exclude it but this relates to the other
>> >>>>>> question, so better to explain a bit of history: The new runner used
>> >>>>>> to live in its own module and subdirectory because it is a full blank
>> >>>>>> page rewrite and the decision was not to use any of the classical
>> >>>>>> runner classes to not be constrained by its evolution.
>> >>>>>>
>> >>>>>> However the reason to put it back in the same module as a subdirectory
>> >>>>>> was to encourage early use, in more detail: The way you deploy spark
>> >>>>>> jobs today is usually by packaging and staging an uber jar (~200MB of
>> >>>>>> pure dependency joy) that contains the user pipeline classes, the
>> >>>>>> spark runner module and its dependencies. If we have two spark runners
>> >>>>>> in separate modules the user would need to repackage and redeploy
>> >>>>>> their pipelines every time they want to switch from the classical
>> >>>>>> Spark runner to the structured streaming runner which is painful and
>> >>>>>> time and space consuming compared with the one module approach where
>> >>>>>> they just change the name of the runner class and that’s it. The idea
>> >>>>>> here is to make easy for users to test the new runner, but at the same
>> >>>>>> time to make easy to come back to the classical runner in case of any
>> >>>>>> issue.
>> >>>>>>
>> >>>>>> Ismaël
>> >>>>>>
>> >>>>>> On Thu, Oct 10, 2019 at 9:02 PM Kenneth Knowles <[email protected]> 
>> >>>>>> wrote:
>> >>>>>>
>> >>>>>> +1
>> >>>>>>
>> >>>>>> I think our experiences with things that go to master early have been 
>> >>>>>> very good. So I am in favor ASAP. We can exclude it from releases 
>> >>>>>> easily until it is ready for end users.
>> >>>>>>
>> >>>>>> I have the same question as Robert - how much is modifications and 
>> >>>>>> how much is new? I notice it is in a subdirectory of the 
>> >>>>>> beam-runners-spark module.
>> >>>>>>
>> >>>>>> I did not see any major changes to dependencies but I will also ask 
>> >>>>>> if it has major version differences so that you might want a separate 
>> >>>>>> artifact?
>> >>>>>>
>> >>>>>> Kenn
>> >>>>>>
>> >>>>>> On Thu, Oct 10, 2019 at 11:50 AM Robert Bradshaw 
>> >>>>>> <[email protected]> wrote:
>> >>>>>>
>> >>>>>> On Thu, Oct 10, 2019 at 12:39 AM Etienne Chauchot 
>> >>>>>> <[email protected]> wrote:
>> >>>>>>
>> >>>>>> Hi guys,
>> >>>>>>
>> >>>>>> You probably know that there has been for several months an work
>> >>>>>> developing a new Spark runner based on Spark Structured Streaming
>> >>>>>> framework. This work is located in a feature branch here:
>> >>>>>> https://github.com/apache/beam/tree/spark-runner_structured-streaming
>> >>>>>>
>> >>>>>> To attract more contributors and get some user feedback, we think it 
>> >>>>>> is
>> >>>>>> time to merge it to master. Before doing so, some steps need to be 
>> >>>>>> achieved:
>> >>>>>>
>> >>>>>> - finish the work on spark Encoders (that allow to call Beam coders)
>> >>>>>> because, right now, the runner is in an unstable state (some 
>> >>>>>> transforms
>> >>>>>> use the new way of doing ser/de and some use the old one, making a
>> >>>>>> pipeline incoherent toward serialization)
>> >>>>>>
>> >>>>>> - clean history: The history contains commits from November 2018, so
>> >>>>>> there is a good amount of work, thus a consequent number of commits.
>> >>>>>> They were already squashed but not from September 2019
>> >>>>>>
>> >>>>>> I don't think the number of commits should be an issue--we shouldn't
>> >>>>>> just squash years worth of history away. (OTOH, if this is a case of
>> >>>>>> this branch containing lots of little, irrelevant commits that would
>> >>>>>> have normally been squashed away in the normal review process we do
>> >>>>>> for the main branch, then, yes, some cleanup could be nice.)
>> >>>>>>
>> >>>>>> Regarding status:
>> >>>>>>
>> >>>>>> - the runner passes 89% of the validates runner tests in batch mode. 
>> >>>>>> We
>> >>>>>> hope to pass more with the new Encoders
>> >>>>>>
>> >>>>>> - Streaming mode is barely started (waiting for the multi-aggregations
>> >>>>>> support in spark SS framework from the Spark community)
>> >>>>>>
>> >>>>>> - Runner can execute Nexmark
>> >>>>>>
>> >>>>>> - Some things are not wired up yet
>> >>>>>>
>> >>>>>>      - Beam Schemas not wired with Spark Schemas
>> >>>>>>
>> >>>>>>      - Optional features of the model not implemented:  state api, 
>> >>>>>> timer
>> >>>>>> api, splittable doFn api, …
>> >>>>>>
>> >>>>>> WDYT, can we merge it to master once the 2 steps are done ?
>> >>>>>>
>> >>>>>> I think that as long as it sits parallel to the existing runner, and
>> >>>>>> is clearly marked with its status, it makes sense to me. How many
>> >>>>>> changes does it make to the existing codebase (as opposed to add new
>> >>>>>> code)?
>> >>>
>> >>>
>> >>>
>> >>

Re: [spark structured streaming runner] merge to master?

Reply via email to