This is an automated email from the ASF dual-hosted git repository. lostluck pushed a commit to branch prism in repository https://gitbox.apache.org/repos/asf/beam.git
commit a8a9e57e6bdd4f5478e78ee08aa0dc584eab9749 Author: lostluck <[email protected]> AuthorDate: Wed Feb 8 13:10:01 2023 -0800 [prism] Add initial README --- sdks/go/pkg/beam/runners/prism/README.md | 169 +++++++++++++++++++++++++++++++ 1 file changed, 169 insertions(+) diff --git a/sdks/go/pkg/beam/runners/prism/README.md b/sdks/go/pkg/beam/runners/prism/README.md new file mode 100644 index 00000000000..fbd73d124c2 --- /dev/null +++ b/sdks/go/pkg/beam/runners/prism/README.md @@ -0,0 +1,169 @@ +# Apache Beam Go Prism Runner + +Prism is a local portable Apache Beam runner authored in Go. + +* Local, for fast startup and ease of testing on a single machine. +* Portable, in that it uses the Beam FnAPI to communicate with Beam SDKs of any language. +* Go simple concurrency enables clear structures for testing batch through streaming jobs. + +It's intended to replace the current Go Direct runner, but also be for general +single machine use. + +For Go SDK users: + - Short term: set runner to "prism" to use it, or invoke directly. + - Medium term: switch the default from "direct" to "prism". + - Long term: alias "direct" to "prism", and delete legacy Go direct runner. + +Prisms allow breaking apart and separating a beam of light into +it's component wavelengths, as well as recombining them together. + +The Prism Runner leans on this metaphor with the goal of making it +easier for users and Beam SDK developers alike to test and validate +aspects of Beam that are presently under represented. + +## Configurability + +Prism is configurable using YAML, which is eagerly validated on startup. +The configuration contains a set of variants to specify execution behavior, +either to support specific testing goals, or to emulate different runners. + +Beam's implementation contains a number of details that are hidden from +users, and to date, no runner implements the same set of features. This +can make SDK or pipeline development difficult, since exactly what is +being tested will vary on the runner being used. + +At the top level the configuration contains "variants", and the variants +configure the behaviors of different "handlers" in Prism. + +Jobs will be able to provide a pipeline option to select which variant to +use. Multiple jobs on the same prism instance can use different variants. +Jobs which don't provide a variant will default to testing behavior. + +All variants should execute the Beam Model faithfully and correctly, +and with few exceptions it should not be possible for there to be an +invalid execution. The machine's the limit. + +It's not expected that all handler options are useful for pipeline authors, +These options should remain useful for SDK developers, +or more precise issue reproduction. + +For more detail on the motivation, see Robert Burke's (@lostluck) Beam Summit 2022 talk: +https://2022.beamsummit.org/sessions/portable-go-beam-runner/. + +Here's a non-exhaustive set of variants. + +### Variant Highlight: "default" + +The "default" variant is testing focused, intending to route out issues at development +time, rather than discovering them on production runners. Notably, this mode should +never use fusion, executing each Transform individually and independantly, one at a time. + +This variant should be able to execute arbitrary pipelines, correctly, with clarity and +precision when an error occurs. Other features supported by the SDK should be enabled by default to +ensure good coverage, such as caches, or RPC reductions like sending elements in +ProcessBundleRequest and Response, as they should not affect correctness. Composite +transforms like Splitable DoFns and Combines should be expanded to ensure coverage. + +Additional validations may be added as time goes on. + +Does not retry or provide other resilience features, which may mask errors. + +To ensure coverage, there may be sibling variants that use mutually exclusive alternative +executions. + +### Variant Highlight: "fast" + +Not Yet Implemented - Illustrative goal. + +The "fast" variant is performance focused, intended for local scale execution. +A psuedo production execution. Fusion optimizations should be performed. +Large PCollection should be offloaded to persistent disk. Bundles should be +dynamically split. Multiple Bundles should be executed simultaneously. And so on. + +Pipelines should execute as swiftly as possible within the bounds of correct +execution. + +### Variant Hightlight: "flink" "dataflow" "spark" AKA Emulations + +Not Yet Implemented - Illustrative goal. + +Emulation variants have the goal of replicating on the local scale, +the behaviors of other runners. Flink execution never "lifts" Combines, and +doesn't dynamically split. Dataflow has different characteristics for batch +and streaming execution with certain execution charateristics enabled or +disabled. + +As Prism is intended to implement all facets of Beam Model execution, the handlers +can have features selectively disabled to ensure + +## Current Limitations + +* Experimental and testing use only. +* Executing docker containers isn't yet implemented. + * This precludes running the Java and Python SDKs, or their transforms for Cross Language. +* In Memory Only + * Not yet suitable for larger jobs, which may have intermediate data that exceeds memory bounds. + * Doesn't yet support sufficient intermediate data garbage collection for indefinite stream processing. +* Doesn't yet execute all beam pipeline features. +* No UI for job status inspection. + +## Implemented so far. + +* DoFns + * Side Inputs + * Multiple Outputs +* Flattens +* GBKs + * Includes handling session windows. + * Global Window + * Interval Windowing + * Session Windows. +* Combines lifted and unlifted. +* Expands Splittable DoFns +* Limited support for Process Continuations + * Residuals are rescheduled for execution immeadiately. + * The transform must be finite (and eventually return a stop process continuation) +* Basic Metrics support + +## Next feature short list (unordered) + +See https://github.com/apache/beam/issues/24789 for current status. + +* Resolve watermark advancement for Process Continuations +* Test Stream +* Triggers & Complex Windowing Strategy execution. +* State +* Timers +* "PubSub" Transform +* Support SDK Containers via Testcontainers + * Cross Language Transforms +* FnAPI Optimizations + * Fusion + * Data with ProcessBundleRequest & Response +* Progess tracking + * Channel Splitting + * Dynamic Splitting +* Stand alone execution support +* UI reporting of in progress jobs + +This is not a comprehensive feature set, but a set of goals to best +support users of the Go SDK in testing their pipelines. + +## How to contribute + +Until additional structure is necessary, check the main issue +https://github.com/apache/beam/issues/24789 for the current +status, file an issue for the feature or bug to fix with `[prism]` +in the title, and refer to the main issue, before begining work +to avoid duplication of effort. + +If a feature will take a long time, please send a PR to +link to your issue from this README to help others discover it. + +Otherwise, ordinary [Beam contribution guidelines apply](https://beam.apache.org/contribute/). + +# Long Term Goals + +Once support for containers is implemented, Prism should become a target +for the Java Runner Validation tests, which are the current specification +for correct runner behavior. This will inform further feature developement.
