Hi Robert,

Great summary on the current state of play. FYI the referenced G doc
doesn't appear to people outside the org as a default.

Great to hear the Go SDK is still getting love. I last looked at in
September-October of last year.

Cheers,
Nathan

On Wed, 17 Apr 2019 at 20:27, Lukasz Cwik <[email protected]> wrote:

> Thanks for the indepth summary.
>
> On Mon, Apr 15, 2019 at 4:19 PM Robert Burke <[email protected]> wrote:
>
>> Hi Thomas! I'm so glad you asked!
>>
>> The status of the Go SDK is complicated, so this email can't be brief.
>> There's are several dimensions to consider: as a Go Open Source Project,
>> User Libraries and Experience, and on Beam Features.
>>
>> I'm going to be updating the roadmap later this month when I have a spare
>> moment.
>>
>> *tl;dr;*
>> I would *love* help in improving the Go SDK, especially around
>> interactions with Java/Python/Flink. Java and I do not have a good working
>> relationship for operational purposes, and the last time I used Python, I
>> had to re-image my machine. There's lots to do, but shouting out tasks to
>> the void is rarely as productive as it is cathartic. If there's an offer to
>> help, and a preference for/experience with  something to work on, I'm
>> willing to find something useful to get started on for you.
>>
>> (Note: The following are simply my opinion as someone who works with the
>> project weekly as a Go programmer, and should not be treated as demands or
>> gospel. I just don't have anyone to talk about Go SDK issues with, and my
>> previous discussions, have largely seemed to fall on uninterested ears.)
>>
>> *The SDK can be considered Alpha when all of the following are true:*
>> * The SDK is tested by the Beam project on a ULR and on Flink as well as
>> Dataflow.
>> * The IOs have received some love to ensure they can scale (either
>> through SDF or reshuffles), and be portable to different environments (eg.
>> using the Go Cloud Development Kit (CDK) libraries).
>>    * Cross-Language IO support would also be acceptable.
>> * The SDK is using Go Modules for dependency management, marking it as
>> version 0.Minor (where Minor should probably track the mainline Beam minor
>> version for now).
>>
>> *We can move to calling it Beta when all of the following are true:*
>> * The all implemented Beam features are meaningfully tested on the
>> portable runners (eg. a proper "Validates Runner" suite exists in Go)
>> * The SDK is properly documented on the Beam site, and in it's Go Docs.
>>
>> After this, I'll be more comfortable recommending it as something folks
>> can use for production.
>> That said, there are happy paths that are useable today in batch
>> situations.
>>
>> *Intro*
>> The Go SDK is a purely Beam Portable SDK. If it runs on a distributed
>> system at all, it's being run portably. Currently it's regularly tested on
>> Google Cloud Dataflow (though Dataflow doesn't officially support the SDK
>> at this time), and on it's own single bundle Direct Runner (intended for
>> unit testing purposes). In addition, it's being tested at scale within
>> Google, on an internal runner, where it presently satisfies our performance
>> benchmarks, and correctness tests.
>>
>> I've been working on cases to make the SDK suitable for data processing
>> within Google. This unfortunately makes my contributions more towards
>> general SDK usability, documentation, and performance, rather than "making
>> it usable outside Google". Note this also precludes necessary work to
>> resolve issues with running Go SDK pipelines on Google Cloud Dataflow. I
>> believe that the SDK must become a good member of the Go ecosystem, the
>> Beam ecosystem.
>>
>> Improved Go Docs, are on their way, and Daniel Oliviera has been helping
>> me make the "getting started" experience better by improving pipeline
>> construction time error messages.
>>
>> Finally many of the following issues have JIRAs already, some don't. It
>> would take me time I don't have to audit and line everything up for this
>> email, please look before you file JIRAs for things mentioned below, should
>> the urge strike you.
>>
>>
>> *As a Go Open Source Project*As an open source project written in Go,
>> the SDK is lagging on adopting Go Modules for Dependency Management and
>> Versioning.
>>
>> Using Go Modules which would ensure that what the Beam project
>> infrastructure is testing what users are getting.  I'm very happy to
>> elaborate on this, and have a bit I wrote about it two months ago on the
>> topic[1]. But I loathe sending out plans for things that I don't have time
>> to work on, so it's only coming to light now.
>>
>> The short points are:
>> * Go is opinionated about versioning since Go 1.11, when Modules were
>> introduced. They allow for reproducible builds with versioned deps,
>> supported by the Go language tools.
>> * Packages 1 & greater are beholden to not make breaking changes. We're
>> not yet there with the SDK yet (certainly not a 2.11 product), so IMO the
>> SDK should be considered v0.X
>> * I don't think it's reasonable to move SDK languages in lockstep with
>> the project. Eg. The Go language is considering adopting Generics, which
>> may necessitate a Major Version Change to the SDK user surface as it's
>> modified to support them. It's not reasonable to move all of beam to a new
>> version due to a single language surface.
>>    * This isn't an issue since it reads: the Go SDK version X, runs
>> against portable beam runners at version Y.
>>
>> See a recent email discussion thread [2] for other factors relating to
>> Gradle.
>>
>> *User Libraries (IOs, Transforms)*
>> There's a lack of testing around the IOs and Transforms in the SDK. In
>> some cases, not even unit tests. Very little time has been spent by anyone
>> to bring these to production quality.
>>
>> *The best route to production IOs right now would be to work on Cross
>> Language IO support with the Go SDK. I imagine it would be similar to what
>> Python is doing.*
>>
>> The Bounded IOs that exist are largely "toys" not written for serious
>> production use. For Bounded cases, this is largely due to the lack of SDF
>> or using reshuffle judiciously, or leveraging other known patterns to
>> scalably read data. You'll note they aren't meaningfully tested anywhere as
>> well.
>>
>> For Unbounded IOs, there's only 1 presently, and that's the Google Cloud
>> PubSub IO. It's not portable. It can't be portable until we've implemented
>> State+Timers, or SDFs. At present, it only works on Dataflow, and does so
>> with runner substitution. As such, it uses the same pubsub connector that
>> Streaming Dataflow jobs use. Interestingly, this means it can scale
>> properly, and is technically the only one that can scale properly.
>> Unfortunately, it only works on Dataflow.
>>
>> My work on using the Beam Go SDK inside Google uses a variant of Cross
>> Language IO. This is one reason why I haven't spent any time on the IOs,
>> because they aren't necessary inside Google, and there's not been a usecase
>> I could contrive to spend the time to fix them up so far.
>>
>> *General SDK Code Quality*
>> In my opinion the SDK is presently reasonable on general code quality.
>> Most critical aspects have tests, and from Google internal testing on
>> complex and large amounts of data, the SDK is performant, once a few bits
>> of code generation is done to avoid reflection on the hot path.
>>
>> Various combinations of features should be vetted together better. Eg.
>> Using composites wrapping various other beam primitives. This was an issue
>> resolved recently for CoGBKs.
>>
>> *Beam Features*
>> The SDK is largely usable for Batch Pipelines. I know this since that's
>> what I'm ensuring is the case for a Google internal runner. I know the
>> following "classes of feature" work for the batch use cases, to varying
>> levels of documentation and testing.
>> * DoFns
>> * CombineFns
>>   * Combiner Lifting
>> * CoGroupByKey (Joins)
>> * Side Inputs
>> * User Defined Coders
>> * Global Windows
>> * User Metrics (though they need to move to the new beam Metrics protos)
>>
>> Streaming is another story. The following aren't implemented
>> * State + Timers + Triggers
>>   * Necessary for portable pubsub IOs for example.
>> * SDFs aren't implemented yet
>>    * Necessary for
>> * Windows
>>    * Session Windows
>>    * Custom WindowFns
>>
>> I haven't run anything in streaming mode, so there are likely other
>> features and considerations I'm missing.
>>
>> The following are implemented but not meaningfully tested
>> * Windows
>>   * Fixed Windowing
>>    * Sliding Windows
>>
>> Other like Large Iterables Support , or Schema's are not yet implemented
>> either. There are likely others, but I'd need to list everything form the
>> compatibility matrix.
>>
>> *What I'm spending my time on*
>> Documenting, and debugging google internal user issues. The following
>> artifacts will be produced externally in the next few months:
>> * Improved user documentation/programming on the Go SDK (targeted to
>> folks who know Go, but not Beam, or any distributed programming).
>> * An SDK contribution guide to be put on the Wiki, focusing on "Life of a
>> Pipeline" from the user controller, to the worker perspective. and where
>> each of those parts are being mapped to where the SDK is dealing with them.
>> This should enable others to contribute beam features to the SDK.
>> * The Versioning Issue mentioned above, it's finicky.
>> * Large (State Backed) Iterable Support
>>
>> *What I'd love help with*
>> 1. Getting the existing suite of SDK integration tests running against a
>> ULR or Flink (there are Jira's for these).
>> 2. Improving existing IOs, adding tests for existing features over adding
>> new ones.
>>    a) Migrate the existing IOs to use the Go CDK where possible (needs to
>> wait for the Versioning/GoModules/Gradle issue to be resolved though).
>>
>> Your friendly neighbourhood Distributed Gopher Wrangler,
>> Robert Burke (@lostluck)
>>
>> [1]
>> https://docs.google.com/document/d/1nB5qCarN0jmo40zH1J0icZa6Wyb0v4u08AdG4WDTTEY/edit
>>
>> [2]
>> https://lists.apache.org/thread.html/8952f546b449ce8682db221e7688db546e25145c31cd835ed88ad172@%3Cdev.beam.apache.org%3E
>>
>> On Sat, 13 Apr 2019 at 11:30, Thomas Weise <[email protected]> wrote:
>>
>>> How "experimental" is the Go SDK? What are the major work items to reach
>>> MVP? How close are we to be able to run let's say wordcount on the portable
>>> Flink runner?
>>>
>>> How current is the roadmap [1]? JIRA [2] could suggest that there is a
>>> lot of work left to do?
>>>
>>> Thanks,
>>> Thomas
>>>
>>> [1] https://beam.apache.org/roadmap/go-sdk/
>>> [2]
>>> https://issues.apache.org/jira/issues/?jql=project%20%3D%20BEAM%20and%20component%20%3D%20sdk-go%20and%20resolution%20%3D%20Unresolved%20
>>>
>>>

-- 
Nathan Fisher
 w: http://junctionbox.ca/

Reply via email to