Nice summary, Robert! I really like the transparency on the state of the Go SDK and how it's being used.

It would be great to see the streaming mode improve because only then we have a full-blown SDK. It looks like we will need a few more resources on the SDK to bring it up to par with Python.

I agree that cross-language transforms would be the most sensible path to solving the IO problem. The state of the Python SDK does not differ much in this regard because it also suffers from a lack of IO. I think you have seen the recent discussions about how to configure cross-language transforms. For the Python side we have Java's GenerateSequence and KafkaIO working with the portable Flink Runner.

Unfortunately, I'm not a Gopher yet but I'd be happy to exchange ideas or go into more detail about the cross-language capabilities.

Cheers,
Max

On 18.04.19 15:13, Thomas Weise wrote:
Hi Robert,

Thanks a bunch for providing this comprehensive update. This is exactly the kind of perspective I was looking for, even when overall it means that for potential users of the Go SDK it is even sooner than what I might have hoped for.

For more context, my interest was primarily on the streaming side. From the list of missing features you listed, State + Timers + Triggers would probably be highest priority. Unfortunately I won't be able to contribute to the Go SDK anytime soon, so this is mostly fyi in case anyone else does.

On improving the IOs, I think it would make a lot of sense to focus on the cross-language route. There has been some work lately to make existing Beam Java IOs available on the Flink runner (Max would be able to share more details on that).

Thanks!
Thomas


On Wed, Apr 17, 2019 at 9:56 PM Robert Burke <[email protected] <mailto:[email protected]>> wrote:

    Oh dang. Thanks for mentioning that! Here's an open copy of the
    versioning thoughts doc, though there shouldn't be any surprises
    from the points I mentioned above.

    
https://docs.google.com/document/d/1ZjP30zNLWTu_WzkWbgY8F_ZXlA_OWAobAD9PuohJxPg/edit#heading=h.drpipq762xi7

    On Wed, 17 Apr 2019 at 21:20, Nathan Fisher <[email protected]
    <mailto:[email protected]>> wrote:

        Hi Robert,

        Great summary on the current state of play. FYI the referenced G
        doc doesn't appear to people outside the org as a default.

        Great to hear the Go SDK is still getting love. I last looked at
        in September-October of last year.

        Cheers,
        Nathan

        On Wed, 17 Apr 2019 at 20:27, Lukasz Cwik <[email protected]
        <mailto:[email protected]>> wrote:

            Thanks for the indepth summary.

            On Mon, Apr 15, 2019 at 4:19 PM Robert Burke
            <[email protected] <mailto:[email protected]>> wrote:

                Hi Thomas! I'm so glad you asked!

                The status of the Go SDK is complicated, so this email
                can't be brief. There's are several dimensions to
                consider: as a Go Open Source Project, User Libraries
                and Experience, and on Beam Features.

                I'm going to be updating the roadmap later this month
                when I have a spare moment.

                *tl;dr;*
                I would *love* help in improving the Go SDK, especially
                around interactions with Java/Python/Flink. Java and I
                do not have a good working relationship for operational
                purposes, and the last time I used Python, I had to
                re-image my machine. There's lots to do, but shouting
                out tasks to the void is rarely as productive as it is
                cathartic. If there's an offer to help, and a preference
                for/experience with  something to work on, I'm willing
                to find something useful to get started on for you.

                (Note: The following are simply my opinion as someone
                who works with the project weekly as a Go programmer,
                and should not be treated as demands or gospel. I just
                don't have anyone to talk about Go SDK issues with, and
                my previous discussions, have largely seemed to fall on
                uninterested ears.)

                *The SDK can be considered Alpha when all of the
                following are true:*
                * The SDK is tested by the Beam project on a ULR and on
                Flink as well as Dataflow.
                * The IOs have received some love to ensure they can
                scale (either through SDF or reshuffles), and be
                portable to different environments (eg. using the Go
                Cloud Development Kit (CDK) libraries).
                    * Cross-Language IO support would also be acceptable.
                * The SDK is using Go Modules for dependency management,
                marking it as version 0.Minor (where Minor should
                probably track the mainline Beam minor version for now).

                *We can move to calling it Beta when all of the
                following are true:*
                * The all implemented Beam features are meaningfully
                tested on the portable runners (eg. a proper "Validates
                Runner" suite exists in Go)
                * The SDK is properly documented on the Beam site, and
                in it's Go Docs.

                After this, I'll be more comfortable recommending it as
                something folks can use for production.
                That said, there are happy paths that are useable today
                in batch situations.
                *
                Intro*
                The Go SDK is a purely Beam Portable SDK. If it runs on
                a distributed system at all, it's being run portably.
                Currently it's regularly tested on Google Cloud Dataflow
                (though Dataflow doesn't officially support the SDK at
                this time), and on it's own single bundle Direct Runner
                (intended for unit testing purposes). In addition, it's
                being tested at scale within Google, on an internal
                runner, where it presently satisfies our performance
                benchmarks, and correctness tests.

                I've been working on cases to make the SDK suitable for
                data processing within Google. This unfortunately makes
                my contributions more towards general SDK usability,
                documentation, and performance, rather than "making it
                usable outside Google". Note this also precludes
                necessary work to resolve issues with running Go SDK
                pipelines on Google Cloud Dataflow. I believe that the
                SDK must become a good member of the Go ecosystem, the
                Beam ecosystem.

                Improved Go Docs, are on their way, and Daniel Oliviera
                has been helping me make the "getting started"
                experience better by improving pipeline construction
                time error messages.

                Finally many of the following issues have JIRAs already,
                some don't. It would take me time I don't have to audit
                and line everything up for this email, please look
                before you file JIRAs for things mentioned below, should
                the urge strike you.

                *As a Go Open Source Project
                *As an open source project written in Go, the SDK is
                lagging on adopting Go Modules for Dependency Management
                and Versioning.

                Using Go Modules which would ensure that what the Beam
                project infrastructure is testing what users are
                getting.  I'm very happy to elaborate on this, and have
                a bit I wrote about it two months ago on the topic[1].
                But I loathe sending out plans for things that I don't
                have time to work on, so it's only coming to light now.

                The short points are:
                * Go is opinionated about versioning since Go 1.11, when
                Modules were introduced. They allow for reproducible
                builds with versioned deps, supported by the Go language
                tools.
                * Packages 1 & greater are beholden to not make breaking
                changes. We're not yet there with the SDK yet (certainly
                not a 2.11 product), so IMO the SDK should be considered
                v0.X
                * I don't think it's reasonable to move SDK languages in
                lockstep with the project. Eg. The Go language is
                considering adopting Generics, which may necessitate a
                Major Version Change to the SDK user surface as it's
                modified to support them. It's not reasonable to move
                all of beam to a new version due to a single language
                surface.
                    * This isn't an issue since it reads: the Go SDK
                version X, runs against portable beam runners at version Y.

                See a recent email discussion thread [2] for other
                factors relating to Gradle.

                *User Libraries (IOs, Transforms)*
                There's a lack of testing around the IOs and Transforms
                in the SDK. In some cases, not even unit tests. Very
                little time has been spent by anyone to bring these to
                production quality.

                /The best route to production IOs right now would be to
                work on Cross Language IO support with the Go SDK. I
                imagine it would be similar to what Python is doing./

                The Bounded IOs that exist are largely "toys" not
                written for serious production use. For Bounded cases,
                this is largely due to the lack of SDF or using
                reshuffle judiciously, or leveraging other known
                patterns to scalably read data. You'll note they aren't
                meaningfully tested anywhere as well.

                For Unbounded IOs, there's only 1 presently, and that's
                the Google Cloud PubSub IO. It's not portable. It can't
                be portable until we've implemented State+Timers, or
                SDFs. At present, it only works on Dataflow, and does so
                with runner substitution. As such, it uses the same
                pubsub connector that Streaming Dataflow jobs use.
                Interestingly, this means it can scale properly, and is
                technically the only one that can scale properly.
                Unfortunately, it only works on Dataflow.

                My work on using the Beam Go SDK inside Google uses a
                variant of Cross Language IO. This is one reason why I
                haven't spent any time on the IOs, because they aren't
                necessary inside Google, and there's not been a usecase
                I could contrive to spend the time to fix them up so far.

                *General SDK Code Quality*
                In my opinion the SDK is presently reasonable on general
                code quality. Most critical aspects have tests, and from
                Google internal testing on complex and large amounts of
                data, the SDK is performant, once a few bits of code
                generation is done to avoid reflection on the hot path.

                Various combinations of features should be vetted
                together better. Eg. Using composites wrapping various
                other beam primitives. This was an issue resolved
                recently for CoGBKs.

                *Beam Features*
                The SDK is largely usable for Batch Pipelines. I know
                this since that's what I'm ensuring is the case for a
                Google internal runner. I know the following "classes of
                feature" work for the batch use cases, to varying levels
                of documentation and testing.
                * DoFns
                * CombineFns
                   * Combiner Lifting
                * CoGroupByKey (Joins)
                * Side Inputs
                * User Defined Coders
                * Global Windows
                * User Metrics (though they need to move to the new beam
                Metrics protos)

                Streaming is another story. The following aren't implemented
                * State + Timers + Triggers
                   * Necessary for portable pubsub IOs for example.
                * SDFs aren't implemented yet
                    * Necessary for
                * Windows
                    * Session Windows
                    * Custom WindowFns

                I haven't run anything in streaming mode, so there are
                likely other features and considerations I'm missing.

                The following are implemented but not meaningfully tested
                * Windows
                   * Fixed Windowing
                    * Sliding Windows

                Other like Large Iterables Support , or Schema's are not
                yet implemented either. There are likely others, but I'd
                need to list everything form the compatibility matrix.

                *What I'm spending my time on*
                Documenting, and debugging google internal user issues.
                The following artifacts will be produced externally in
                the next few months:
                * Improved user documentation/programming on the Go SDK
                (targeted to folks who know Go, but not Beam, or any
                distributed programming).
                * An SDK contribution guide to be put on the Wiki,
                focusing on "Life of a Pipeline" from the user
                controller, to the worker perspective. and where each of
                those parts are being mapped to where the SDK is dealing
                with them. This should enable others to contribute beam
                features to the SDK.
                * The Versioning Issue mentioned above, it's finicky.
                * Large (State Backed) Iterable Support

                *What I'd love help with*
                1. Getting the existing suite of SDK integration tests
                running against a ULR or Flink (there are Jira's for these).
                2. Improving existing IOs, adding tests for existing
                features over adding new ones.
                    a) Migrate the existing IOs to use the Go CDK where
                possible (needs to wait for the
                Versioning/GoModules/Gradle issue to be resolved though).

                Your friendly neighbourhood Distributed Gopher Wrangler,
                Robert Burke (@lostluck)

                [1]
                
https://docs.google.com/document/d/1nB5qCarN0jmo40zH1J0icZa6Wyb0v4u08AdG4WDTTEY/edit

                [2]
                
https://lists.apache.org/thread.html/8952f546b449ce8682db221e7688db546e25145c31cd835ed88ad172@%3Cdev.beam.apache.org%3E

                On Sat, 13 Apr 2019 at 11:30, Thomas Weise
                <[email protected] <mailto:[email protected]>> wrote:

                    How "experimental" is the Go SDK? What are the major
                    work items to reach MVP? How close are we to be able
                    to run let's say wordcount on the portable Flink runner?

                    How current is the roadmap [1]? JIRA [2] could
                    suggest that there is a lot of work left to do?

                    Thanks,
                    Thomas

                    [1] https://beam.apache.org/roadmap/go-sdk/
                    [2]
                    
https://issues.apache.org/jira/issues/?jql=project%20%3D%20BEAM%20and%20component%20%3D%20sdk-go%20and%20resolution%20%3D%20Unresolved%20



-- Nathan Fisher
          w: http://junctionbox.ca/

Reply via email to