Re: [spark structured streaming runner] merge to master?

Jean-Baptiste Onofré Tue, 29 Oct 2019 21:59:51 -0700

I agree, it would be the easiest way and allow users to switch easily aswell using a single artifact.


Regards
JB


On 29/10/2019 23:54, Kenneth Knowles wrote:

Is it just as easy to have two jars and build an uber jar with bothincluded? Then the runner can still be toggled with a flag.


Kenn

On Tue, Oct 29, 2019 at 9:38 AM Alexey Romanenko<[email protected] <mailto:[email protected]>> wrote:


    Hmm, I don’t think that jar size should play a big role comparing
    to the whole size of shaded jar of users job. Even more, I think
    it will be quite confusing for users to choose which jar to use if
    we will have 3 different ones for similar purposes. Though, let’s
    see what others think.

    On 29 Oct 2019, at 15:32, Etienne Chauchot <[email protected]
    <mailto:[email protected]>> wrote:

    Hi Alexey,

    Thanks for your opinion !

    Comments inline

    Etienne

    On 28/10/2019 17:34, Alexey Romanenko wrote:

    Let me share some of my thoughts on this.


        - shall we filter out the package name from the release?

    Until new runner is not ready to be used in production (or, at
    least, be used for beta testing but users should be clearly
    warned about that in this case), I believe we need to filter out
    its classes from published jar to avoid a confusion.

    Yes that is what I think also


        - should we release 2 jars: one for the old and one for
    the new ?

        - should we release 3 jars: one for the new, one for the
    new and one for both ?

    Once new runner will be released, then I think we need to
    provide only one single jar and allow user to switch between
    different Spark runners with CLI option.


    I would vote for 3 jars: one for new, one for old, and one for
    both. Indeed, in some cases, users are looking very closely at
    the size of jars. This solution meets all use cases

        - should we create a special entry to the capability matrix ?

    Sure, since it has its own uniq characteristics and
    implementation, but again, only once new runner will be
    "officially released".

+1

    On 28 Oct 2019, at 10:27, Etienne Chauchot
    <[email protected] <mailto:[email protected]>> wrote:

    Hi guys,

    Any opinions on the point2 communication to users ?

    Etienne

    On 24/10/2019 15:44, Etienne Chauchot wrote:


    Hi guys,

    I'm glad to announce that the PR for the merge to master of
    the new runner based on Spark Structured Streaming framework
    is submitted:

    https://github.com/apache/beam/pull/9866


    1. Regarding the status of the runner:

    -the runner passes 93% of the validates runner tests in batch
    mode.

    -Streaming mode is barely started (waiting for the
    multi-aggregations support in spark Structured Streaming
    framework from the Spark community)

    -Runner can execute Nexmark

    -Some things are not wired up yet

      -Beam Schemas not wired with Spark Schemas

      -Optional features of the model not implemented: state api,
    timer api, splittable doFn api, …


    2. Regarding the communication to users:

    - for reasons explained by Ismael: the runner is in the same
    module as the "older" one. But it is in a different
    sub-package and both runners share the same build.

    - How should we communicate to users:

        - shall we filter out the package name from the release?

        - should we release 2 jars: one for the old and one for
    the new ?

        - should we release 3 jars: one for the new, one for the
    new and one for both ?

        - should we create a special entry to the capability matrix ?

    WDYT ?

    Best

    Etienne


    On 23/10/2019 19:11, Mikhail Gryzykhin wrote:

    +1 to merge.

    It is worth keeping things in master with explicitly marked
    status. It will make effort more visible to users and easier
    to get feedback upon.

    --Mikhail

    On Wed, Oct 23, 2019 at 8:36 AM Etienne Chauchot
    <[email protected] <mailto:[email protected]>> wrote:

        Hi guys,

        The new spark runner now supports beam coders and passes
        93% of the batch validates runner tests (+4%). I think it
        is time to merge it to master. I will submit a PR in the
        coming days.

        next steps: support schemas and thus better leverage
        catalyst optimizer (among other things optims based on
        data), port perfs optims that were done in the current
        runner.

        Best

        Etienne

        On 11/10/2019 22:48, Pablo Estrada wrote:

        +1 for merging : )

        On Fri, Oct 11, 2019 at 12:43 PM Robert Bradshaw
        <[email protected] <mailto:[email protected]>> wrote:

            Sounds like a good plan to me.

            On Fri, Oct 11, 2019 at 6:20 AM Etienne Chauchot
            <[email protected] <mailto:[email protected]>>
            wrote:

                Comments inline

                On 10/10/2019 23:44, Ismaël Mejía wrote:

                +1

                The earlier we get to master the better to encourage not only 
code
                contributions but as important to have early user feedback.

                Question is: do we keep the "old" spark runner for a while or 
not (or just keep on previous version/tag on git) ?

                It is still too early to even start discussing when to remove 
the
                classical runner given that the new runner is still a WIP. 
However the
                overall goal is that this runner becomes the de-facto one once 
the VR
                tests and the performance become at least equal to the classical
                runner, in the meantime the best for users is that they 
co-exist,
                let’s not forget that the other runner has been already battle 
tested
                for more than 3 years and has had lots of improvements in the 
last
                year.


                +1 on what Ismael says: no soon removal,

                The plan I had in mind at first (that I showed
                at the apacheCon) was this but I'm proposing
                moving the first gray label to before the red box.

                <beogijnhpieapoll.png>

                I don't think the number of commits should be an issue--we 
shouldn't
                just squash years worth of history away. (OTOH, if this is a 
case of
                this branch containing lots of little, irrelevant commits that 
would
                have normally been squashed away in the normal review process 
we do
                for the main branch, then, yes, some cleanup could be nice.)

                About the commits we should encourage a clear history but we 
have also
                to remove useless commits that are still present in the branch,
                commits of the “Fix errorprone” / “Cleaning” kind and even 
commits
                that make a better narrative sense together should be probably
                squashed, because they do not bring much to the history. It is 
not
                about more or less commits it is about its relevance as Robert
                mentions.

                I think our experiences with things that go to master early 
have been very good. So I am in favor ASAP. We can exclude it from releases 
easily until it is ready for end users.
                I have the same question as Robert - how much is modifications 
and how much is new? I notice it is in a subdirectory of the beam-runners-spark 
module.

                In its current form we cannot exclude it but this relates to 
the other
                question, so better to explain a bit of history: The new runner 
used
                to live in its own module and subdirectory because it is a full 
blank
                page rewrite and the decision was not to use any of the 
classical
                runner classes to not be constrained by its evolution.

                However the reason to put it back in the same module as a 
subdirectory
                was to encourage early use, in more detail: The way you deploy 
spark
                jobs today is usually by packaging and staging an uber jar 
(~200MB of
                pure dependency joy) that contains the user pipeline classes, 
the
                spark runner module and its dependencies. If we have two spark 
runners
                in separate modules the user would need to repackage and 
redeploy
                their pipelines every time they want to switch from the 
classical
                Spark runner to the structured streaming runner which is 
painful and
                time and space consuming compared with the one module approach 
where
                they just change the name of the runner class and that’s it. 
The idea
                here is to make easy for users to test the new runner, but at 
the same
                time to make easy to come back to the classical runner in case 
of any
                issue.

                Ismaël

                On Thu, Oct 10, 2019 at 9:02 PM Kenneth Knowles<[email protected]>  
<mailto:[email protected]>  wrote:

                +1

                I think our experiences with things that go to master early 
have been very good. So I am in favor ASAP. We can exclude it from releases 
easily until it is ready for end users.

                I have the same question as Robert - how much is modifications 
and how much is new? I notice it is in a subdirectory of the beam-runners-spark 
module.

                I did not see any major changes to dependencies but I will also 
ask if it has major version differences so that you might want a separate 
artifact?

                Kenn

                On Thu, Oct 10, 2019 at 11:50 AM Robert Bradshaw<[email protected]> 
 <mailto:[email protected]>  wrote:

                On Thu, Oct 10, 2019 at 12:39 AM Etienne 
Chauchot<[email protected]>  <mailto:[email protected]>  wrote:

                Hi guys,

                You probably know that there has been for several months an work
                developing a new Spark runner based on Spark Structured 
Streaming
                framework. This work is located in a feature branch here:
                
https://github.com/apache/beam/tree/spark-runner_structured-streaming

                To attract more contributors and get some user feedback, we 
think it is
                time to merge it to master. Before doing so, some steps need to 
be achieved:

                - finish the work on spark Encoders (that allow to call Beam 
coders)
                because, right now, the runner is in an unstable state (some 
transforms
                use the new way of doing ser/de and some use the old one, 
making a
                pipeline incoherent toward serialization)

                - clean history: The history contains commits from November 
2018, so
                there is a good amount of work, thus a consequent number of 
commits.
                They were already squashed but not from September 2019

                I don't think the number of commits should be an issue--we 
shouldn't
                just squash years worth of history away. (OTOH, if this is a 
case of
                this branch containing lots of little, irrelevant commits that 
would
                have normally been squashed away in the normal review process 
we do
                for the main branch, then, yes, some cleanup could be nice.)

                Regarding status:

                - the runner passes 89% of the validates runner tests in batch 
mode. We
                hope to pass more with the new Encoders

                - Streaming mode is barely started (waiting for the 
multi-aggregations
                support in spark SS framework from the Spark community)

                - Runner can execute Nexmark

                - Some things are not wired up yet

                      - Beam Schemas not wired with Spark Schemas

                      - Optional features of the model not implemented:  state 
api, timer
                api, splittable doFn api, …

                WDYT, can we merge it to master once the 2 steps are done ?

                I think that as long as it sits parallel to the existing 
runner, and
                is clearly marked with its status, it makes sense to me. How 
many
                changes does it make to the existing codebase (as opposed to 
add new
                code)?

Re: [spark structured streaming runner] merge to master?

Reply via email to