Re: [spark structured streaming runner] merge to master?

Kyle Weaver Tue, 12 Nov 2019 14:58:54 -0800

+1 to "one uber jar to rule them all." As Ryan said, we snuck the portable
runner into master at a much less usable state than the structured
streaming runner, and it didn't seem to cause any issues (although in
retrospect we probably should have tagged it as @Experimental, I wasn't
aware that was a thing at the time).


In general, until adding visible documentation on the website, I suspect
most users will be completely unaware the new runner is even there.

On Fri, Nov 8, 2019 at 9:50 AM Kenneth Knowles <[email protected]> wrote:

> On Thu, Nov 7, 2019 at 5:32 PM Etienne Chauchot <[email protected]>
> wrote:
> >
> > Hi guys
> >
> > @Kenn,
> >
> > I just wanted to mention that I did answered your question on
> dependencies here:
> https://lists.apache.org/thread.html/5a85caac41e796c2aa351d835b3483808ebbbd4512b480940d494439@%3Cdev.beam.apache.org%3E
>
> Ah, sorry! In that case there is no problem at all.
>
>
> > I'm not in favor of having the 2 runners in one jar, the point about
> having 2 jars was to:
> >
> > - avoid making promises to users on a work in progress runner (make it
> explicit with a different jar)
> > - avoid confusion for them (why are there 2 pipeline options? etc....)
> >
> > If the community believes that there is no confusion or wrong promises
> with the one jar solution, we could leave the 2 runners in one jar.
> >
> > Maybe we could start a vote on that?
>
> It seems unanimous among others to have one jar. There were some
> suggestions of how to avoid promises and confusion, like Ryan's most recent
> email. Did any of the ideas sound good to you?
>
> Kenn
>
>
> I have no objection to putting the experimental runner alongside the
>> stable, mature runner.  We have some precedence with the portable
>> spark runner, and that's worked out pretty well -- at least, I haven't
>> heard any complaints from confused users!
>>
>> That being said:
>>
>> 1.  It really should be marked @Experimental in the code *and* clearly
>> warned in API (javadoc) and documentation.
>>
>> 2.  Ideally, I'd like to see a warning banner in the logs when it's
>> used, pointing to the stable SparkRunner and/or documentation on the
>> current known issues.
>>
>> All my best, Ryan
>>
>>
>>
>>
>>
>>
>> > regarding jars:
>> >
>> > I don't like 3 jars either.
>> >
>> >
>> > Etienne
>> >
>> > On 31/10/2019 02:06, Kenneth Knowles wrote:
>> >
>> > Very good points. We definitely ship a lot of code/features in very
>> early stages, and there seems to be no problem.
>> >
>> > I intend mostly to leave this judgment to people like you who know
>> better about Spark users.
>> >
>> > But I do think 1 or 2 jars is better than 3. I really don't like "3
>> jars" and I did give two reasons:
>> >
>> > 1. diamond deps where things overlap
>> > 2. figuring out which thing to depend on
>> >
>> > Both are annoying for users. I am not certain if it could lead to a
>> real unsolvable situation. This is just a Java ecosystem problem so I feel
>> qualified to comment.
>> >
>> > I did also ask if there were major dependency differences between the
>> two that could cause problems for users. This question was dropped and no
>> one cares to comment so I assume it is not an issue. So then I favor having
>> just 1 jar with both runners.
>> >
>> > Kenn
>> >
>> > On Wed, Oct 30, 2019 at 2:46 PM Ismaël Mejía <[email protected]> wrote:
>> >>
>> >> I am still a bit lost about why we are discussing options without
>> giving any
>> >> arguments or reasons for the options? Why is 2 modules better than 3
>> or 3 better
>> >> than 2, or even better, what forces us to have something different
>> than a single
>> >> module?
>> >>
>> >> What are the reasons for wanting to have separate jars? If the issue
>> is that the
>> >> code is unfinished or not passing the tests, the impact for end users
>> is minimal
>> >> because they cannot accidentally end up running the new runner, and if
>> they
>> >> decide to do so we can warn them it is at their own risk and not ready
>> for
>> >> production in the documentation + runner.
>> >>
>> >> If the fear is that new code may end up being intertwined with the
>> classic and
>> >> portable runners and have some side effects. We have the
>> ValidatesRunner +
>> >> Nexmark in the CI to cover this so again I do not see what is the
>> problem that
>> >> requires modules to be separate.
>> >>
>> >> If the issue is being uncomfortable about having in-progress code in
>> released
>> >> artifacts we have been doing this in Beam forever, for example most of
>> the work
>> >> on portability and Schema/SQL, and all of those were still part of
>> artifacts
>> >> long time before they were ready for prime use, so I still don't see
>> why this
>> >> case is different to require different artifacts.
>> >>
>> >> I have the impression we are trying to solve a non-issue by adding a
>> lot of
>> >> artificial complexity (in particular to the users), or am I missing
>> something
>> >> else?
>> >>
>> >> On Wed, Oct 30, 2019 at 7:40 PM Kenneth Knowles <[email protected]>
>> wrote:
>> >> >
>> >> > Oh, I mean that we ship just 2 jars.
>> >> >
>> >> > And since Spark users always build an uber jar, they can still
>> depend on both of ours and be able to switch runners with a flag.
>> >> >
>> >> > I really dislike projects shipping overlapping jars. It is confusing
>> and causes major diamond dependency problems.
>> >> >
>> >> > Kenn
>> >> >
>> >> > On Wed, Oct 30, 2019 at 11:12 AM Alexey Romanenko <
>> [email protected]> wrote:
>> >> >>
>> >> >> Yes, agree, two jars included in uber jar will work in the similar
>> way. Though having 3 jars looks still quite confusing for me.
>> >> >>
>> >> >> On 29 Oct 2019, at 23:54, Kenneth Knowles <[email protected]> wrote:
>> >> >>
>> >> >> Is it just as easy to have two jars and build an uber jar with both
>> included? Then the runner can still be toggled with a flag.
>> >> >>
>> >> >> Kenn
>> >> >>
>> >> >> On Tue, Oct 29, 2019 at 9:38 AM Alexey Romanenko <
>> [email protected]> wrote:
>> >> >>>
>> >> >>> Hmm, I don’t think that jar size should play a big role comparing
>> to the whole size of shaded jar of users job. Even more, I think it will be
>> quite confusing for users to choose which jar to use if we will have 3
>> different ones for similar purposes. Though, let’s see what others think.
>> >> >>>
>> >> >>> On 29 Oct 2019, at 15:32, Etienne Chauchot <[email protected]>
>> wrote:
>> >> >>>
>> >> >>> Hi Alexey,
>> >> >>>
>> >> >>> Thanks for your opinion !
>> >> >>>
>> >> >>> Comments inline
>> >> >>>
>> >> >>> Etienne
>> >> >>>
>> >> >>> On 28/10/2019 17:34, Alexey Romanenko wrote:
>> >> >>>
>> >> >>> Let me share some of my thoughts on this.
>> >> >>>
>> >> >>>     - shall we filter out the package name from the release?
>> >> >>>
>> >> >>> Until new runner is not ready to be used in production (or, at
>> least, be used for beta testing but users should be clearly warned about
>> that in this case), I believe we need to filter out its classes from
>> published jar to avoid a confusion.
>> >> >>>
>> >> >>> Yes that is what I think also
>> >> >>>
>> >> >>>     - should we release 2 jars: one for the old and one for the
>> new ?
>> >> >>>
>> >> >>>     - should we release 3 jars: one for the new, one for the new
>> and one for both ?
>> >> >>>
>> >> >>> Once new runner will be released, then I think we need to provide
>> only one single jar and allow user to switch between different Spark
>> runners with CLI option.
>> >> >>>
>> >> >>> I would vote for 3 jars: one for new, one for old, and one for
>> both. Indeed, in some cases, users are looking very closely at the size of
>> jars. This solution meets all use cases
>> >> >>>
>> >> >>>     - should we create a special entry to the capability matrix ?
>> >> >>>
>> >> >>> Sure, since it has its own uniq characteristics and
>> implementation, but again, only once new runner will be "officially
>> released".
>> >> >>>
>> >> >>> +1
>> >> >>>
>> >> >>>
>> >> >>>
>> >> >>> On 28 Oct 2019, at 10:27, Etienne Chauchot <[email protected]>
>> wrote:
>> >> >>>
>> >> >>> Hi guys,
>> >> >>>
>> >> >>> Any opinions on the point2 communication to users ?
>> >> >>>
>> >> >>> Etienne
>> >> >>>
>> >> >>> On 24/10/2019 15:44, Etienne Chauchot wrote:
>> >> >>>
>> >> >>> Hi guys,
>> >> >>>
>> >> >>> I'm glad to announce that the PR for the merge to master of the
>> new runner based on Spark Structured Streaming framework is submitted:
>> >> >>>
>> >> >>> https://github.com/apache/beam/pull/9866
>> >> >>>
>> >> >>>
>> >> >>> 1. Regarding the status of the runner:
>> >> >>>
>> >> >>> -the runner passes 93% of the validates runner tests in batch mode.
>> >> >>>
>> >> >>> -Streaming mode is barely started (waiting for the
>> multi-aggregations support in spark Structured Streaming framework from the
>> Spark community)
>> >> >>>
>> >> >>> -Runner can execute Nexmark
>> >> >>>
>> >> >>> -Some things are not wired up yet
>> >> >>>
>> >> >>>   -Beam Schemas not wired with Spark Schemas
>> >> >>>
>> >> >>>   -Optional features of the model not implemented: state api,
>> timer api, splittable doFn api, …
>> >> >>>
>> >> >>>
>> >> >>> 2. Regarding the communication to users:
>> >> >>>
>> >> >>> - for reasons explained by Ismael: the runner is in the same
>> module as the "older" one. But it is in a different sub-package and both
>> runners share the same build.
>> >> >>>
>> >> >>> - How should we communicate to users:
>> >> >>>
>> >> >>>     - shall we filter out the package name from the release?
>> >> >>>
>> >> >>>     - should we release 2 jars: one for the old and one for the
>> new ?
>> >> >>>
>> >> >>>     - should we release 3 jars: one for the new, one for the new
>> and one for both ?
>> >> >>>
>> >> >>>     - should we create a special entry to the capability matrix ?
>> >> >>>
>> >> >>> WDYT ?
>> >> >>>
>> >> >>> Best
>> >> >>>
>> >> >>> Etienne
>> >> >>>
>> >> >>>
>> >> >>> On 23/10/2019 19:11, Mikhail Gryzykhin wrote:
>> >> >>>
>> >> >>> +1 to merge.
>> >> >>>
>> >> >>> It is worth keeping things in master with explicitly marked
>> status. It will make effort more visible to users and easier to get
>> feedback upon.
>> >> >>>
>> >> >>> --Mikhail
>> >> >>>
>> >> >>> On Wed, Oct 23, 2019 at 8:36 AM Etienne Chauchot <
>> [email protected]> wrote:
>> >> >>>>
>> >> >>>> Hi guys,
>> >> >>>>
>> >> >>>> The new spark runner now supports beam coders and passes 93% of
>> the batch validates runner tests (+4%). I think it is time to merge it to
>> master. I will submit a PR in the coming days.
>> >> >>>>
>> >> >>>> next steps: support schemas and thus better leverage catalyst
>> optimizer (among other things optims based on data), port perfs optims that
>> were done in the current runner.
>> >> >>>>
>> >> >>>> Best
>> >> >>>>
>> >> >>>> Etienne
>> >> >>>>
>> >> >>>> On 11/10/2019 22:48, Pablo Estrada wrote:
>> >> >>>>
>> >> >>>> +1 for merging : )
>> >> >>>>
>> >> >>>> On Fri, Oct 11, 2019 at 12:43 PM Robert Bradshaw <
>> [email protected]> wrote:
>> >> >>>>>
>> >> >>>>> Sounds like a good plan to me.
>> >> >>>>>
>> >> >>>>> On Fri, Oct 11, 2019 at 6:20 AM Etienne Chauchot <
>> [email protected]> wrote:
>> >> >>>>>>
>> >> >>>>>> Comments inline
>> >> >>>>>>
>> >> >>>>>> On 10/10/2019 23:44, Ismaël Mejía wrote:
>> >> >>>>>>
>> >> >>>>>> +1
>> >> >>>>>>
>> >> >>>>>> The earlier we get to master the better to encourage not only
>> code
>> >> >>>>>> contributions but as important to have early user feedback.
>> >> >>>>>>
>> >> >>>>>> Question is: do we keep the "old" spark runner for a while or
>> not (or just keep on previous version/tag on git) ?
>> >> >>>>>>
>> >> >>>>>> It is still too early to even start discussing when to remove
>> the
>> >> >>>>>> classical runner given that the new runner is still a WIP.
>> However the
>> >> >>>>>> overall goal is that this runner becomes the de-facto one once
>> the VR
>> >> >>>>>> tests and the performance become at least equal to the classical
>> >> >>>>>> runner, in the meantime the best for users is that they
>> co-exist,
>> >> >>>>>> let’s not forget that the other runner has been already battle
>> tested
>> >> >>>>>> for more than 3 years and has had lots of improvements in the
>> last
>> >> >>>>>> year.
>> >> >>>>>>
>> >> >>>>>> +1 on what Ismael says: no soon removal,
>> >> >>>>>>
>> >> >>>>>> The plan I had in mind at first (that I showed at the
>> apacheCon) was this but I'm proposing moving the first gray label to before
>> the red box.
>> >> >>>>>>
>> >> >>>>>> <beogijnhpieapoll.png>
>> >> >>>>>>
>> >> >>>>>>
>> >> >>>>>> I don't think the number of commits should be an issue--we
>> shouldn't
>> >> >>>>>> just squash years worth of history away. (OTOH, if this is a
>> case of
>> >> >>>>>> this branch containing lots of little, irrelevant commits that
>> would
>> >> >>>>>> have normally been squashed away in the normal review process
>> we do
>> >> >>>>>> for the main branch, then, yes, some cleanup could be nice.)
>> >> >>>>>>
>> >> >>>>>> About the commits we should encourage a clear history but we
>> have also
>> >> >>>>>> to remove useless commits that are still present in the branch,
>> >> >>>>>> commits of the “Fix errorprone” / “Cleaning” kind and even
>> commits
>> >> >>>>>> that make a better narrative sense together should be probably
>> >> >>>>>> squashed, because they do not bring much to the history. It is
>> not
>> >> >>>>>> about more or less commits it is about its relevance as Robert
>> >> >>>>>> mentions.
>> >> >>>>>>
>> >> >>>>>> I think our experiences with things that go to master early
>> have been very good. So I am in favor ASAP. We can exclude it from releases
>> easily until it is ready for end users.
>> >> >>>>>> I have the same question as Robert - how much is modifications
>> and how much is new? I notice it is in a subdirectory of the
>> beam-runners-spark module.
>> >> >>>>>>
>> >> >>>>>> In its current form we cannot exclude it but this relates to
>> the other
>> >> >>>>>> question, so better to explain a bit of history: The new runner
>> used
>> >> >>>>>> to live in its own module and subdirectory because it is a full
>> blank
>> >> >>>>>> page rewrite and the decision was not to use any of the
>> classical
>> >> >>>>>> runner classes to not be constrained by its evolution.
>> >> >>>>>>
>> >> >>>>>> However the reason to put it back in the same module as a
>> subdirectory
>> >> >>>>>> was to encourage early use, in more detail: The way you deploy
>> spark
>> >> >>>>>> jobs today is usually by packaging and staging an uber jar
>> (~200MB of
>> >> >>>>>> pure dependency joy) that contains the user pipeline classes,
>> the
>> >> >>>>>> spark runner module and its dependencies. If we have two spark
>> runners
>> >> >>>>>> in separate modules the user would need to repackage and
>> redeploy
>> >> >>>>>> their pipelines every time they want to switch from the
>> classical
>> >> >>>>>> Spark runner to the structured streaming runner which is
>> painful and
>> >> >>>>>> time and space consuming compared with the one module approach
>> where
>> >> >>>>>> they just change the name of the runner class and that’s it.
>> The idea
>> >> >>>>>> here is to make easy for users to test the new runner, but at
>> the same
>> >> >>>>>> time to make easy to come back to the classical runner in case
>> of any
>> >> >>>>>> issue.
>> >> >>>>>>
>> >> >>>>>> Ismaël
>> >> >>>>>>
>> >> >>>>>> On Thu, Oct 10, 2019 at 9:02 PM Kenneth Knowles <
>> [email protected]> wrote:
>> >> >>>>>>
>> >> >>>>>> +1
>> >> >>>>>>
>> >> >>>>>> I think our experiences with things that go to master early
>> have been very good. So I am in favor ASAP. We can exclude it from releases
>> easily until it is ready for end users.
>> >> >>>>>>
>> >> >>>>>> I have the same question as Robert - how much is modifications
>> and how much is new? I notice it is in a subdirectory of the
>> beam-runners-spark module.
>> >> >>>>>>
>> >> >>>>>> I did not see any major changes to dependencies but I will also
>> ask if it has major version differences so that you might want a separate
>> artifact?
>> >> >>>>>>
>> >> >>>>>> Kenn
>> >> >>>>>>
>> >> >>>>>> On Thu, Oct 10, 2019 at 11:50 AM Robert Bradshaw <
>> [email protected]> wrote:
>> >> >>>>>>
>> >> >>>>>> On Thu, Oct 10, 2019 at 12:39 AM Etienne Chauchot <
>> [email protected]> wrote:
>> >> >>>>>>
>> >> >>>>>> Hi guys,
>> >> >>>>>>
>> >> >>>>>> You probably know that there has been for several months an work
>> >> >>>>>> developing a new Spark runner based on Spark Structured
>> Streaming
>> >> >>>>>> framework. This work is located in a feature branch here:
>> >> >>>>>>
>> https://github.com/apache/beam/tree/spark-runner_structured-streaming
>> >> >>>>>>
>> >> >>>>>> To attract more contributors and get some user feedback, we
>> think it is
>> >> >>>>>> time to merge it to master. Before doing so, some steps need to
>> be achieved:
>> >> >>>>>>
>> >> >>>>>> - finish the work on spark Encoders (that allow to call Beam
>> coders)
>> >> >>>>>> because, right now, the runner is in an unstable state (some
>> transforms
>> >> >>>>>> use the new way of doing ser/de and some use the old one,
>> making a
>> >> >>>>>> pipeline incoherent toward serialization)
>> >> >>>>>>
>> >> >>>>>> - clean history: The history contains commits from November
>> 2018, so
>> >> >>>>>> there is a good amount of work, thus a consequent number of
>> commits.
>> >> >>>>>> They were already squashed but not from September 2019
>> >> >>>>>>
>> >> >>>>>> I don't think the number of commits should be an issue--we
>> shouldn't
>> >> >>>>>> just squash years worth of history away. (OTOH, if this is a
>> case of
>> >> >>>>>> this branch containing lots of little, irrelevant commits that
>> would
>> >> >>>>>> have normally been squashed away in the normal review process
>> we do
>> >> >>>>>> for the main branch, then, yes, some cleanup could be nice.)
>> >> >>>>>>
>> >> >>>>>> Regarding status:
>> >> >>>>>>
>> >> >>>>>> - the runner passes 89% of the validates runner tests in batch
>> mode. We
>> >> >>>>>> hope to pass more with the new Encoders
>> >> >>>>>>
>> >> >>>>>> - Streaming mode is barely started (waiting for the
>> multi-aggregations
>> >> >>>>>> support in spark SS framework from the Spark community)
>> >> >>>>>>
>> >> >>>>>> - Runner can execute Nexmark
>> >> >>>>>>
>> >> >>>>>> - Some things are not wired up yet
>> >> >>>>>>
>> >> >>>>>>      - Beam Schemas not wired with Spark Schemas
>> >> >>>>>>
>> >> >>>>>>      - Optional features of the model not implemented:  state
>> api, timer
>> >> >>>>>> api, splittable doFn api, …
>> >> >>>>>>
>> >> >>>>>> WDYT, can we merge it to master once the 2 steps are done ?
>> >> >>>>>>
>> >> >>>>>> I think that as long as it sits parallel to the existing
>> runner, and
>> >> >>>>>> is clearly marked with its status, it makes sense to me. How
>> many
>> >> >>>>>> changes does it make to the existing codebase (as opposed to
>> add new
>> >> >>>>>> code)?
>> >> >>>
>> >> >>>
>> >> >>>
>> >> >>
>>
>

Re: [spark structured streaming runner] merge to master?

Reply via email to