Re: [spark structured streaming runner] merge to master?

Etienne Chauchot Fri, 11 Oct 2019 06:08:07 -0700

Hi Kenn,

Comments inline


On 10/10/2019 21:02, Kenneth Knowles wrote:

+1
I think our experiences with things that go to master early have beenvery good. So I am in favor ASAP. We can exclude it from releaseseasily until it is ready for end users.

I have mixed emotions on exclusion: I'd like users to try it forfeedback but I don't want to confuse them and make promises on featurescompleteness. How do you think we should we proceed?

I have the same question as Robert - how much is modifications and howmuch is new? I notice it is in a subdirectory of thebeam-runners-spark module.

cf my answers to Robert.

I did not see any major changes to dependencies but I will also ask ifit has major version differences so that you might want a separateartifact?

There is no deps change at all, the 2 runners are in sync ondeps/versions. The trick is that spark 2.4.x provides both RDD/Dstreamand Structured Streaming.



Etienne


Kenn

On Thu, Oct 10, 2019 at 11:50 AM Robert Bradshaw <[email protected]<mailto:[email protected]>> wrote:


    On Thu, Oct 10, 2019 at 12:39 AM Etienne Chauchot
    <[email protected] <mailto:[email protected]>> wrote:
    >
    > Hi guys,
    >
    > You probably know that there has been for several months an work
    > developing a new Spark runner based on Spark Structured Streaming
    > framework. This work is located in a feature branch here:
    >
    https://github.com/apache/beam/tree/spark-runner_structured-streaming
    >
    > To attract more contributors and get some user feedback, we
    think it is
    > time to merge it to master. Before doing so, some steps need to
    be achieved:
    >
    > - finish the work on spark Encoders (that allow to call Beam coders)
    > because, right now, the runner is in an unstable state (some
    transforms
    > use the new way of doing ser/de and some use the old one, making a
    > pipeline incoherent toward serialization)
    >
    > - clean history: The history contains commits from November 2018, so
    > there is a good amount of work, thus a consequent number of commits.
    > They were already squashed but not from September 2019

    I don't think the number of commits should be an issue--we shouldn't
    just squash years worth of history away. (OTOH, if this is a case of
    this branch containing lots of little, irrelevant commits that would
    have normally been squashed away in the normal review process we do
    for the main branch, then, yes, some cleanup could be nice.)

    > Regarding status:
    >
    > - the runner passes 89% of the validates runner tests in batch
    mode. We
    > hope to pass more with the new Encoders
    >
    > - Streaming mode is barely started (waiting for the
    multi-aggregations
    > support in spark SS framework from the Spark community)
    >
    > - Runner can execute Nexmark
    >
    > - Some things are not wired up yet
    >
    >      - Beam Schemas not wired with Spark Schemas
    >
    >      - Optional features of the model not implemented: state
    api, timer
    > api, splittable doFn api, …
    >
    > WDYT, can we merge it to master once the 2 steps are done ?

    I think that as long as it sits parallel to the existing runner, and
    is clearly marked with its status, it makes sense to me. How many
    changes does it make to the existing codebase (as opposed to add new
    code)?

Re: [spark structured streaming runner] merge to master?

Reply via email to