[ https://issues.apache.org/jira/browse/BEAM-11076?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Anonymous updated BEAM-11076: ----------------------------- Status: Triage Needed (was: In Progress) > Go Direct Runner Improvements > ----------------------------- > > Key: BEAM-11076 > URL: https://issues.apache.org/jira/browse/BEAM-11076 > Project: Beam > Issue Type: Improvement > Components: sdk-go > Reporter: Robert Burke > Priority: P3 > Time Spent: 0.5h > Remaining Estimate: 0h > > The Go SDK has a simple direct runner intended for basic batch framework > testing. That is, it's only suitable for the barest tests, and not that it > ensures that the basics work for arbitrary pipelines. > The runner has the following features: > * Operates on the direct pipeline graph without marshalling through the beam > protos. > ** This prevents it from validating that the pipeline is valid for portable > runners. > * Executes the whole pipeline as a single bundle, on a single worker thread. > "in process" > ** This renders it only suitable for very small data sets, that likely > operate in memory. > * Doesn't marshal elements. > ** While this avoids notionally unnecessary work, it's another reason why > users will run into errors after using the direct runner to "validate" their > pipeline before moving to Spark or Flink. > Further, the runner hasn't been validated for beam semantics, nor have more > complex features of the Beam Model been implemented or validated. This makes > it unsuitable for more than it's current use for demoing the SDK in basic > batch operation, and the light use it has testing the SDK itself. > However, implementing full beam semantics for a runner, even without the > distributed portion is a project in itself. It's part of the beam design that > implementing the semantics for a beam runner to be more complicated on the > runner side vs the SDK side. > But there's no reason why we can't improve the Go Direct Runner to match all > semantics required of beam for single machine contexts. > In particular the various improvements below could be made (and should > probably be sharded into separate sub task JIRAs as required): > * Convert the Go Direct Runner to a "Go Portable Runner" instead, which > means implementing the Job Management and FnApi protocols. This would > ensure that all runners are operating the Go SDK workers in the same way, via > the harness. > ** This doesn't preclude "go awareness" for operating everthing in a single > binary, or later re-optimizing to avoid serialization. > * Allow the runner to execute "headless" (as a job submission server). > * Allow the runner to execute more than a single bundle at once. > ** Enabling better use of CPU cores in single execution mode. > * Add loopback and docker execution mode support, in addition to the Go "in > process" support it has. > * Once the runner can execute portable pipelines done, it becomes possible > to run the Python and Java Runner Validation Tests against the runner to > validate all the features of the [Beam Programming > Model|https://beam.apache.org/documentation/programming-guide] > ** Each feature / TestSuite of which should be handled in separate JIRAs. > ** Adding jenkins runs of those passing tests to ensure ongoing validation > of the runner against the model. > > A good place to start is being able to run and execute pipelines on the > Python Portable runner, which implements all beam semantics correctly. > Instructions for doing so are on [Go Tips page in the Dev > Wiki|https://cwiki.apache.org/confluence/display/BEAM/Go+Tips]. > > Direct Runner Code: > [https://github.com/apache/beam/tree/master/sdks/go/pkg/beam/runners/direct] > SDK Harness Code: > [https://github.com/apache/beam/blob/master/sdks/go/pkg/beam/core/runtime/harness/harness.go] > > -- This message was sent by Atlassian Jira (v8.20.10#820010)