Re: [spark structured streaming runner] merge to master?

Etienne Chauchot Tue, 29 Oct 2019 07:33:07 -0700

Hi Alexey,

Thanks for your opinion !


Comments inline

Etienne

On 28/10/2019 17:34, Alexey Romanenko wrote:

Let me share some of my thoughts on this.
    - shall we filter out the package name from the release?
Until new runner is not ready to be used in production (or, at least,be used for beta testing but users should be clearly warned about thatin this case), I believe we need to filter out its classes frompublished jar to avoid a confusion.

Yes that is what I think also

    - should we release 2 jars: one for the old and one for the new ?
- should we release 3 jars: one for the new, one for the new andone for both ?
Once new runner will be released, then I think we need to provide onlyone single jar and allow user to switch between different Sparkrunners with CLI option.

I would vote for 3 jars: one for new, one for old, and one for both.Indeed, in some cases, users are looking very closely at the size ofjars. This solution meets all use cases

    - should we create a special entry to the capability matrix ?
Sure, since it has its own uniq characteristics and implementation,but again, only once new runner will be "officially released".

+1

On 28 Oct 2019, at 10:27, Etienne Chauchot <echauc...@apache.org<mailto:echauc...@apache.org>> wrote:


Hi guys,

Any opinions on the point2 communication to users ?

Etienne

On 24/10/2019 15:44, Etienne Chauchot wrote:


Hi guys,

I'm glad to announce that the PR for the merge to master of the newrunner based on Spark Structured Streaming framework is submitted:


https://github.com/apache/beam/pull/9866


1. Regarding the status of the runner:

-the runner passes 93% of the validates runner tests in batch mode.

-Streaming mode is barely started (waiting for themulti-aggregations support in spark Structured Streaming frameworkfrom the Spark community)


-Runner can execute Nexmark

-Some things are not wired up yet

  -Beam Schemas not wired with Spark Schemas

-Optional features of the model not implemented: state api, timerapi, splittable doFn api, …



2. Regarding the communication to users:

- for reasons explained by Ismael: the runner is in the same moduleas the "older" one. But it is in a different sub-package and bothrunners share the same build.


- How should we communicate to users:

    - shall we filter out the package name from the release?

    - should we release 2 jars: one for the old and one for the new ?

- should we release 3 jars: one for the new, one for the new andone for both ?


    - should we create a special entry to the capability matrix ?

WDYT ?

Best

Etienne


On 23/10/2019 19:11, Mikhail Gryzykhin wrote:

+1 to merge.

It is worth keeping things in master with explicitly marked status.It will make effort more visible to users and easier to getfeedback upon.


--Mikhail

On Wed, Oct 23, 2019 at 8:36 AM Etienne Chauchot<echauc...@apache.org <mailto:echauc...@apache.org>> wrote:


    Hi guys,

    The new spark runner now supports beam coders and passes 93% of
    the batch validates runner tests (+4%). I think it is time to
    merge it to master. I will submit a PR in the coming days.

    next steps: support schemas and thus better leverage catalyst
    optimizer (among other things optims based on data), port perfs
    optims that were done in the current runner.

    Best

    Etienne

    On 11/10/2019 22:48, Pablo Estrada wrote:

    +1 for merging : )

    On Fri, Oct 11, 2019 at 12:43 PM Robert Bradshaw
    <rober...@google.com <mailto:rober...@google.com>> wrote:

        Sounds like a good plan to me.

        On Fri, Oct 11, 2019 at 6:20 AM Etienne Chauchot
        <echauc...@apache.org <mailto:echauc...@apache.org>> wrote:

            Comments inline

            On 10/10/2019 23:44, Ismaël Mejía wrote:

            +1

            The earlier we get to master the better to encourage not only code
            contributions but as important to have early user feedback.

            Question is: do we keep the "old" spark runner for a while or not 
(or just keep on previous version/tag on git) ?

            It is still too early to even start discussing when to remove the
            classical runner given that the new runner is still a WIP. However 
the
            overall goal is that this runner becomes the de-facto one once the 
VR
            tests and the performance become at least equal to the classical
            runner, in the meantime the best for users is that they co-exist,
            let’s not forget that the other runner has been already battle 
tested
            for more than 3 years and has had lots of improvements in the last
            year.


            +1 on what Ismael says: no soon removal,

            The plan I had in mind at first (that I showed at the
            apacheCon) was this but I'm proposing moving the first
            gray label to before the red box.

            <beogijnhpieapoll.png>

            I don't think the number of commits should be an issue--we shouldn't
            just squash years worth of history away. (OTOH, if this is a case of
            this branch containing lots of little, irrelevant commits that would
            have normally been squashed away in the normal review process we do
            for the main branch, then, yes, some cleanup could be nice.)

            About the commits we should encourage a clear history but we have 
also
            to remove useless commits that are still present in the branch,
            commits of the “Fix errorprone” / “Cleaning” kind and even commits
            that make a better narrative sense together should be probably
            squashed, because they do not bring much to the history. It is not
            about more or less commits it is about its relevance as Robert
            mentions.

            I think our experiences with things that go to master early have 
been very good. So I am in favor ASAP. We can exclude it from releases easily 
until it is ready for end users.
            I have the same question as Robert - how much is modifications and 
how much is new? I notice it is in a subdirectory of the beam-runners-spark 
module.

            In its current form we cannot exclude it but this relates to the 
other
            question, so better to explain a bit of history: The new runner used
            to live in its own module and subdirectory because it is a full 
blank
            page rewrite and the decision was not to use any of the classical
            runner classes to not be constrained by its evolution.

            However the reason to put it back in the same module as a 
subdirectory
            was to encourage early use, in more detail: The way you deploy spark
            jobs today is usually by packaging and staging an uber jar (~200MB 
of
            pure dependency joy) that contains the user pipeline classes, the
            spark runner module and its dependencies. If we have two spark 
runners
            in separate modules the user would need to repackage and redeploy
            their pipelines every time they want to switch from the classical
            Spark runner to the structured streaming runner which is painful and
            time and space consuming compared with the one module approach where
            they just change the name of the runner class and that’s it. The 
idea
            here is to make easy for users to test the new runner, but at the 
same
            time to make easy to come back to the classical runner in case of 
any
            issue.

            Ismaël

            On Thu, Oct 10, 2019 at 9:02 PM Kenneth Knowles<k...@apache.org>  
<mailto:k...@apache.org>  wrote:

            +1

            I think our experiences with things that go to master early have 
been very good. So I am in favor ASAP. We can exclude it from releases easily 
until it is ready for end users.

            I have the same question as Robert - how much is modifications and 
how much is new? I notice it is in a subdirectory of the beam-runners-spark 
module.

            I did not see any major changes to dependencies but I will also ask 
if it has major version differences so that you might want a separate artifact?

            Kenn

            On Thu, Oct 10, 2019 at 11:50 AM Robert Bradshaw<rober...@google.com>  
<mailto:rober...@google.com>  wrote:

            On Thu, Oct 10, 2019 at 12:39 AM Etienne Chauchot<echauc...@apache.org>  
<mailto:echauc...@apache.org>  wrote:

            Hi guys,

            You probably know that there has been for several months an work
            developing a new Spark runner based on Spark Structured Streaming
            framework. This work is located in a feature branch here:
            
https://github.com/apache/beam/tree/spark-runner_structured-streaming

            To attract more contributors and get some user feedback, we think 
it is
            time to merge it to master. Before doing so, some steps need to be 
achieved:

            - finish the work on spark Encoders (that allow to call Beam coders)
            because, right now, the runner is in an unstable state (some 
transforms
            use the new way of doing ser/de and some use the old one, making a
            pipeline incoherent toward serialization)

            - clean history: The history contains commits from November 2018, so
            there is a good amount of work, thus a consequent number of commits.
            They were already squashed but not from September 2019

            I don't think the number of commits should be an issue--we shouldn't
            just squash years worth of history away. (OTOH, if this is a case of
            this branch containing lots of little, irrelevant commits that would
            have normally been squashed away in the normal review process we do
            for the main branch, then, yes, some cleanup could be nice.)

            Regarding status:

            - the runner passes 89% of the validates runner tests in batch 
mode. We
            hope to pass more with the new Encoders

            - Streaming mode is barely started (waiting for the 
multi-aggregations
            support in spark SS framework from the Spark community)

            - Runner can execute Nexmark

            - Some things are not wired up yet

                  - Beam Schemas not wired with Spark Schemas

                  - Optional features of the model not implemented:  state api, 
timer
            api, splittable doFn api, …

            WDYT, can we merge it to master once the 2 steps are done ?

            I think that as long as it sits parallel to the existing runner, and
            is clearly marked with its status, it makes sense to me. How many
            changes does it make to the existing codebase (as opposed to add new
            code)?

Re: [spark structured streaming runner] merge to master?

Reply via email to