Re: [spark structured streaming runner] merge to master?

Etienne Chauchot Wed, 13 Nov 2019 06:16:46 -0800

Ok for 1 jar with the 2 runners then.

I'll add the banner to the logs and the Experimental in the code and inin the javadocs.


Thanks for your opinions guys !

Etienne

On 08/11/2019 18:50, Kenneth Knowles wrote:

On Thu, Nov 7, 2019 at 5:32 PM Etienne Chauchot <[email protected]<mailto:[email protected]>> wrote:

>
> Hi guys
>
> @Kenn,
>

> I just wanted to mention that I did answered your question ondependencies here:https://lists.apache.org/thread.html/5a85caac41e796c2aa351d835b3483808ebbbd4512b480940d494439@%3Cdev.beam.apache.org%3E


Ah, sorry! In that case there is no problem at all.

> I'm not in favor of having the 2 runners in one jar, the point abouthaving 2 jars was to:

> - avoid making promises to users on a work in progress runner (makeit explicit with a different jar)

> - avoid confusion for them (why are there 2 pipeline options? etc....)
>

> If the community believes that there is no confusion or wrongpromises with the one jar solution, we could leave the 2 runners inone jar.

>
> Maybe we could start a vote on that?

It seems unanimous among others to have one jar. There were somesuggestions of how to avoid promises and confusion, like Ryan's mostrecent email. Did any of the ideas sound good to you?


Kenn


    I have no objection to putting the experimental runner alongside the
    stable, mature runner.  We have some precedence with the portable
    spark runner, and that's worked out pretty well -- at least, I haven't
    heard any complaints from confused users!

    That being said:

    1.  It really should be marked @Experimental in the code *and* clearly
    warned in API (javadoc) and documentation.

    2.  Ideally, I'd like to see a warning banner in the logs when it's
    used, pointing to the stable SparkRunner and/or documentation on the
    current known issues.

    All my best, Ryan






    > regarding jars:
    >
    > I don't like 3 jars either.
    >
    >
    > Etienne
    >
    > On 31/10/2019 02:06, Kenneth Knowles wrote:
    >
    > Very good points. We definitely ship a lot of code/features in
    very early stages, and there seems to be no problem.
    >
    > I intend mostly to leave this judgment to people like you who
    know better about Spark users.
    >
    > But I do think 1 or 2 jars is better than 3. I really don't like
    "3 jars" and I did give two reasons:
    >
    > 1. diamond deps where things overlap
    > 2. figuring out which thing to depend on
    >
    > Both are annoying for users. I am not certain if it could lead
    to a real unsolvable situation. This is just a Java ecosystem
    problem so I feel qualified to comment.
    >
    > I did also ask if there were major dependency differences
    between the two that could cause problems for users. This question
    was dropped and no one cares to comment so I assume it is not an
    issue. So then I favor having just 1 jar with both runners.
    >
    > Kenn
    >
    > On Wed, Oct 30, 2019 at 2:46 PM Ismaël Mejía <[email protected]
    <mailto:[email protected]>> wrote:
    >>
    >> I am still a bit lost about why we are discussing options
    without giving any
    >> arguments or reasons for the options? Why is 2 modules better
    than 3 or 3 better
    >> than 2, or even better, what forces us to have something
    different than a single
    >> module?
    >>
    >> What are the reasons for wanting to have separate jars? If the
    issue is that the
    >> code is unfinished or not passing the tests, the impact for end
    users is minimal
    >> because they cannot accidentally end up running the new runner,
    and if they
    >> decide to do so we can warn them it is at their own risk and
    not ready for
    >> production in the documentation + runner.
    >>
    >> If the fear is that new code may end up being intertwined with
    the classic and
    >> portable runners and have some side effects. We have the
    ValidatesRunner +
    >> Nexmark in the CI to cover this so again I do not see what is
    the problem that
    >> requires modules to be separate.
    >>
    >> If the issue is being uncomfortable about having in-progress
    code in released
    >> artifacts we have been doing this in Beam forever, for example
    most of the work
    >> on portability and Schema/SQL, and all of those were still part
    of artifacts
    >> long time before they were ready for prime use, so I still
    don't see why this
    >> case is different to require different artifacts.
    >>
    >> I have the impression we are trying to solve a non-issue by
    adding a lot of
    >> artificial complexity (in particular to the users), or am I
    missing something
    >> else?
    >>
    >> On Wed, Oct 30, 2019 at 7:40 PM Kenneth Knowles
    <[email protected] <mailto:[email protected]>> wrote:
    >> >
    >> > Oh, I mean that we ship just 2 jars.
    >> >
    >> > And since Spark users always build an uber jar, they can
    still depend on both of ours and be able to switch runners with a
    flag.
    >> >
    >> > I really dislike projects shipping overlapping jars. It is
    confusing and causes major diamond dependency problems.
    >> >
    >> > Kenn
    >> >
    >> > On Wed, Oct 30, 2019 at 11:12 AM Alexey Romanenko
    <[email protected] <mailto:[email protected]>> wrote:
    >> >>
    >> >> Yes, agree, two jars included in uber jar will work in the
    similar way. Though having 3 jars looks still quite confusing for me.
    >> >>
    >> >> On 29 Oct 2019, at 23:54, Kenneth Knowles <[email protected]
    <mailto:[email protected]>> wrote:
    >> >>
    >> >> Is it just as easy to have two jars and build an uber jar
    with both included? Then the runner can still be toggled with a flag.
    >> >>
    >> >> Kenn
    >> >>
    >> >> On Tue, Oct 29, 2019 at 9:38 AM Alexey Romanenko
    <[email protected] <mailto:[email protected]>> wrote:
    >> >>>
    >> >>> Hmm, I don’t think that jar size should play a big role
    comparing to the whole size of shaded jar of users job. Even more,
    I think it will be quite confusing for users to choose which jar
    to use if we will have 3 different ones for similar purposes.
    Though, let’s see what others think.
    >> >>>
    >> >>> On 29 Oct 2019, at 15:32, Etienne Chauchot
    <[email protected] <mailto:[email protected]>> wrote:
    >> >>>
    >> >>> Hi Alexey,
    >> >>>
    >> >>> Thanks for your opinion !
    >> >>>
    >> >>> Comments inline
    >> >>>
    >> >>> Etienne
    >> >>>
    >> >>> On 28/10/2019 17:34, Alexey Romanenko wrote:
    >> >>>
    >> >>> Let me share some of my thoughts on this.
    >> >>>
    >> >>>     - shall we filter out the package name from the release?
    >> >>>
    >> >>> Until new runner is not ready to be used in production (or,
    at least, be used for beta testing but users should be clearly
    warned about that in this case), I believe we need to filter out
    its classes from published jar to avoid a confusion.
    >> >>>
    >> >>> Yes that is what I think also
    >> >>>
    >> >>>     - should we release 2 jars: one for the old and one for
    the new ?
    >> >>>
    >> >>>     - should we release 3 jars: one for the new, one for
    the new and one for both ?
    >> >>>
    >> >>> Once new runner will be released, then I think we need to
    provide only one single jar and allow user to switch between
    different Spark runners with CLI option.
    >> >>>
    >> >>> I would vote for 3 jars: one for new, one for old, and one
    for both. Indeed, in some cases, users are looking very closely at
    the size of jars. This solution meets all use cases
    >> >>>
    >> >>>     - should we create a special entry to the capability
    matrix ?
    >> >>>
    >> >>> Sure, since it has its own uniq characteristics and
    implementation, but again, only once new runner will be
    "officially released".
    >> >>>
    >> >>> +1
    >> >>>
    >> >>>
    >> >>>
    >> >>> On 28 Oct 2019, at 10:27, Etienne Chauchot
    <[email protected] <mailto:[email protected]>> wrote:
    >> >>>
    >> >>> Hi guys,
    >> >>>
    >> >>> Any opinions on the point2 communication to users ?
    >> >>>
    >> >>> Etienne
    >> >>>
    >> >>> On 24/10/2019 15:44, Etienne Chauchot wrote:
    >> >>>
    >> >>> Hi guys,
    >> >>>
    >> >>> I'm glad to announce that the PR for the merge to master of
    the new runner based on Spark Structured Streaming framework is
    submitted:
    >> >>>
    >> >>> https://github.com/apache/beam/pull/9866
    >> >>>
    >> >>>
    >> >>> 1. Regarding the status of the runner:
    >> >>>
    >> >>> -the runner passes 93% of the validates runner tests in
    batch mode.
    >> >>>
    >> >>> -Streaming mode is barely started (waiting for the
    multi-aggregations support in spark Structured Streaming framework
    from the Spark community)
    >> >>>
    >> >>> -Runner can execute Nexmark
    >> >>>
    >> >>> -Some things are not wired up yet
    >> >>>
    >> >>>   -Beam Schemas not wired with Spark Schemas
    >> >>>
    >> >>>   -Optional features of the model not implemented: state
    api, timer api, splittable doFn api, …
    >> >>>
    >> >>>
    >> >>> 2. Regarding the communication to users:
    >> >>>
    >> >>> - for reasons explained by Ismael: the runner is in the
    same module as the "older" one. But it is in a different
    sub-package and both runners share the same build.
    >> >>>
    >> >>> - How should we communicate to users:
    >> >>>
    >> >>>     - shall we filter out the package name from the release?
    >> >>>
    >> >>>     - should we release 2 jars: one for the old and one for
    the new ?
    >> >>>
    >> >>>     - should we release 3 jars: one for the new, one for
    the new and one for both ?
    >> >>>
    >> >>>     - should we create a special entry to the capability
    matrix ?
    >> >>>
    >> >>> WDYT ?
    >> >>>
    >> >>> Best
    >> >>>
    >> >>> Etienne
    >> >>>
    >> >>>
    >> >>> On 23/10/2019 19:11, Mikhail Gryzykhin wrote:
    >> >>>
    >> >>> +1 to merge.
    >> >>>
    >> >>> It is worth keeping things in master with explicitly marked
    status. It will make effort more visible to users and easier to
    get feedback upon.
    >> >>>
    >> >>> --Mikhail
    >> >>>
    >> >>> On Wed, Oct 23, 2019 at 8:36 AM Etienne Chauchot
    <[email protected] <mailto:[email protected]>> wrote:
    >> >>>>
    >> >>>> Hi guys,
    >> >>>>
    >> >>>> The new spark runner now supports beam coders and passes
    93% of the batch validates runner tests (+4%). I think it is time
    to merge it to master. I will submit a PR in the coming days.
    >> >>>>
    >> >>>> next steps: support schemas and thus better leverage
    catalyst optimizer (among other things optims based on data), port
    perfs optims that were done in the current runner.
    >> >>>>
    >> >>>> Best
    >> >>>>
    >> >>>> Etienne
    >> >>>>
    >> >>>> On 11/10/2019 22:48, Pablo Estrada wrote:
    >> >>>>
    >> >>>> +1 for merging : )
    >> >>>>
    >> >>>> On Fri, Oct 11, 2019 at 12:43 PM Robert Bradshaw
    <[email protected] <mailto:[email protected]>> wrote:
    >> >>>>>
    >> >>>>> Sounds like a good plan to me.
    >> >>>>>
    >> >>>>> On Fri, Oct 11, 2019 at 6:20 AM Etienne Chauchot
    <[email protected] <mailto:[email protected]>> wrote:
    >> >>>>>>
    >> >>>>>> Comments inline
    >> >>>>>>
    >> >>>>>> On 10/10/2019 23:44, Ismaël Mejía wrote:
    >> >>>>>>
    >> >>>>>> +1
    >> >>>>>>
    >> >>>>>> The earlier we get to master the better to encourage not
    only code
    >> >>>>>> contributions but as important to have early user feedback.
    >> >>>>>>
    >> >>>>>> Question is: do we keep the "old" spark runner for a
    while or not (or just keep on previous version/tag on git) ?
    >> >>>>>>
    >> >>>>>> It is still too early to even start discussing when to
    remove the
    >> >>>>>> classical runner given that the new runner is still a
    WIP. However the
    >> >>>>>> overall goal is that this runner becomes the de-facto
    one once the VR
    >> >>>>>> tests and the performance become at least equal to the
    classical
    >> >>>>>> runner, in the meantime the best for users is that they
    co-exist,
    >> >>>>>> let’s not forget that the other runner has been already
    battle tested
    >> >>>>>> for more than 3 years and has had lots of improvements
    in the last
    >> >>>>>> year.
    >> >>>>>>
    >> >>>>>> +1 on what Ismael says: no soon removal,
    >> >>>>>>
    >> >>>>>> The plan I had in mind at first (that I showed at the
    apacheCon) was this but I'm proposing moving the first gray label
    to before the red box.
    >> >>>>>>
    >> >>>>>> <beogijnhpieapoll.png>
    >> >>>>>>
    >> >>>>>>
    >> >>>>>> I don't think the number of commits should be an
    issue--we shouldn't
    >> >>>>>> just squash years worth of history away. (OTOH, if this
    is a case of
    >> >>>>>> this branch containing lots of little, irrelevant
    commits that would
    >> >>>>>> have normally been squashed away in the normal review
    process we do
    >> >>>>>> for the main branch, then, yes, some cleanup could be nice.)
    >> >>>>>>
    >> >>>>>> About the commits we should encourage a clear history
    but we have also
    >> >>>>>> to remove useless commits that are still present in the
    branch,
    >> >>>>>> commits of the “Fix errorprone” / “Cleaning” kind and
    even commits
    >> >>>>>> that make a better narrative sense together should be
    probably
    >> >>>>>> squashed, because they do not bring much to the history.
    It is not
    >> >>>>>> about more or less commits it is about its relevance as
    Robert
    >> >>>>>> mentions.
    >> >>>>>>
    >> >>>>>> I think our experiences with things that go to master
    early have been very good. So I am in favor ASAP. We can exclude
    it from releases easily until it is ready for end users.
    >> >>>>>> I have the same question as Robert - how much is
    modifications and how much is new? I notice it is in a
    subdirectory of the beam-runners-spark module.
    >> >>>>>>
    >> >>>>>> In its current form we cannot exclude it but this
    relates to the other
    >> >>>>>> question, so better to explain a bit of history: The new
    runner used
    >> >>>>>> to live in its own module and subdirectory because it is
    a full blank
    >> >>>>>> page rewrite and the decision was not to use any of the
    classical
    >> >>>>>> runner classes to not be constrained by its evolution.
    >> >>>>>>
    >> >>>>>> However the reason to put it back in the same module as
    a subdirectory
    >> >>>>>> was to encourage early use, in more detail: The way you
    deploy spark
    >> >>>>>> jobs today is usually by packaging and staging an uber
    jar (~200MB of
    >> >>>>>> pure dependency joy) that contains the user pipeline
    classes, the
    >> >>>>>> spark runner module and its dependencies. If we have two
    spark runners
    >> >>>>>> in separate modules the user would need to repackage and
    redeploy
    >> >>>>>> their pipelines every time they want to switch from the
    classical
    >> >>>>>> Spark runner to the structured streaming runner which is
    painful and
    >> >>>>>> time and space consuming compared with the one module
    approach where
    >> >>>>>> they just change the name of the runner class and that’s
    it. The idea
    >> >>>>>> here is to make easy for users to test the new runner,
    but at the same
    >> >>>>>> time to make easy to come back to the classical runner
    in case of any
    >> >>>>>> issue.
    >> >>>>>>
    >> >>>>>> Ismaël
    >> >>>>>>
    >> >>>>>> On Thu, Oct 10, 2019 at 9:02 PM Kenneth Knowles
    <[email protected] <mailto:[email protected]>> wrote:
    >> >>>>>>
    >> >>>>>> +1
    >> >>>>>>
    >> >>>>>> I think our experiences with things that go to master
    early have been very good. So I am in favor ASAP. We can exclude
    it from releases easily until it is ready for end users.
    >> >>>>>>
    >> >>>>>> I have the same question as Robert - how much is
    modifications and how much is new? I notice it is in a
    subdirectory of the beam-runners-spark module.
    >> >>>>>>
    >> >>>>>> I did not see any major changes to dependencies but I
    will also ask if it has major version differences so that you
    might want a separate artifact?
    >> >>>>>>
    >> >>>>>> Kenn
    >> >>>>>>
    >> >>>>>> On Thu, Oct 10, 2019 at 11:50 AM Robert Bradshaw
    <[email protected] <mailto:[email protected]>> wrote:
    >> >>>>>>
    >> >>>>>> On Thu, Oct 10, 2019 at 12:39 AM Etienne Chauchot
    <[email protected] <mailto:[email protected]>> wrote:
    >> >>>>>>
    >> >>>>>> Hi guys,
    >> >>>>>>
    >> >>>>>> You probably know that there has been for several months
    an work
    >> >>>>>> developing a new Spark runner based on Spark Structured
    Streaming
    >> >>>>>> framework. This work is located in a feature branch here:
    >> >>>>>>
    https://github.com/apache/beam/tree/spark-runner_structured-streaming
    >> >>>>>>
    >> >>>>>> To attract more contributors and get some user feedback,
    we think it is
    >> >>>>>> time to merge it to master. Before doing so, some steps
    need to be achieved:
    >> >>>>>>
    >> >>>>>> - finish the work on spark Encoders (that allow to call
    Beam coders)
    >> >>>>>> because, right now, the runner is in an unstable state
    (some transforms
    >> >>>>>> use the new way of doing ser/de and some use the old
    one, making a
    >> >>>>>> pipeline incoherent toward serialization)
    >> >>>>>>
    >> >>>>>> - clean history: The history contains commits from
    November 2018, so
    >> >>>>>> there is a good amount of work, thus a consequent number
    of commits.
    >> >>>>>> They were already squashed but not from September 2019
    >> >>>>>>
    >> >>>>>> I don't think the number of commits should be an
    issue--we shouldn't
    >> >>>>>> just squash years worth of history away. (OTOH, if this
    is a case of
    >> >>>>>> this branch containing lots of little, irrelevant
    commits that would
    >> >>>>>> have normally been squashed away in the normal review
    process we do
    >> >>>>>> for the main branch, then, yes, some cleanup could be nice.)
    >> >>>>>>
    >> >>>>>> Regarding status:
    >> >>>>>>
    >> >>>>>> - the runner passes 89% of the validates runner tests in
    batch mode. We
    >> >>>>>> hope to pass more with the new Encoders
    >> >>>>>>
    >> >>>>>> - Streaming mode is barely started (waiting for the
    multi-aggregations
    >> >>>>>> support in spark SS framework from the Spark community)
    >> >>>>>>
    >> >>>>>> - Runner can execute Nexmark
    >> >>>>>>
    >> >>>>>> - Some things are not wired up yet
    >> >>>>>>
    >> >>>>>>      - Beam Schemas not wired with Spark Schemas
    >> >>>>>>
    >> >>>>>>      - Optional features of the model not implemented: 
    state api, timer
    >> >>>>>> api, splittable doFn api, …
    >> >>>>>>
    >> >>>>>> WDYT, can we merge it to master once the 2 steps are done ?
    >> >>>>>>
    >> >>>>>> I think that as long as it sits parallel to the existing
    runner, and
    >> >>>>>> is clearly marked with its status, it makes sense to me.
    How many
    >> >>>>>> changes does it make to the existing codebase (as
    opposed to add new
    >> >>>>>> code)?
    >> >>>
    >> >>>
    >> >>>
    >> >>

Re: [spark structured streaming runner] merge to master?

Reply via email to