[ 
https://issues.apache.org/jira/browse/BEAM-11076?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Anonymous updated BEAM-11076:
-----------------------------
    Status: Triage Needed  (was: In Progress)

> Go Direct Runner Improvements
> -----------------------------
>
>                 Key: BEAM-11076
>                 URL: https://issues.apache.org/jira/browse/BEAM-11076
>             Project: Beam
>          Issue Type: Improvement
>          Components: sdk-go
>            Reporter: Robert Burke
>            Priority: P3
>          Time Spent: 0.5h
>  Remaining Estimate: 0h
>
> The Go SDK has a simple direct runner intended for basic batch framework 
> testing. That is, it's only suitable for the barest tests, and not that it 
> ensures that the basics work for arbitrary pipelines.
> The runner has the following features:
>  * Operates on the direct pipeline graph without marshalling through the beam 
> protos.
>  ** This prevents it from validating that the pipeline is valid for portable 
> runners.
>  * Executes the whole pipeline as a single bundle, on a single worker thread. 
> "in process"
>  ** This renders it only suitable for very small data sets, that likely 
> operate in memory.
>  * Doesn't marshal elements.
>  ** While this avoids notionally unnecessary work, it's another reason why 
> users will run into errors after using the direct runner to "validate" their 
> pipeline before moving to Spark or Flink.
> Further, the runner hasn't been validated for beam semantics, nor have more 
> complex features of the Beam Model been implemented or validated. This makes 
> it unsuitable for more than it's current use for demoing the SDK in basic 
> batch operation, and the light use it has testing the SDK itself.
> However, implementing full beam semantics for a runner, even without the 
> distributed portion is a project in itself. It's part of the beam design that 
> implementing the semantics for a beam runner to be more complicated on the 
> runner side vs the SDK side. 
> But there's no reason why we can't improve the Go Direct Runner to match all 
> semantics required of beam for single machine contexts.
> In particular the various improvements below could be made (and should 
> probably be sharded into separate sub task JIRAs as required):
>  * Convert the Go Direct Runner to a "Go Portable Runner" instead, which 
> means implementing the Job Management  and  FnApi protocols. This would 
> ensure that all runners are operating the Go SDK workers in the same way, via 
> the harness. 
>  ** This doesn't preclude "go awareness" for operating everthing in a single 
> binary, or later re-optimizing to avoid serialization.
>  * Allow the runner to execute "headless" (as a job submission server).
>  * Allow the runner to execute more than a single bundle at once.
>  ** Enabling better use of CPU cores in single execution mode.
>  * Add loopback and docker execution mode support, in addition to the Go "in 
> process" support it has.
>  * Once the runner can execute portable pipelines done, it becomes possible 
> to run the Python and Java Runner Validation Tests against the runner to 
> validate all the features of the [Beam Programming 
> Model|https://beam.apache.org/documentation/programming-guide]
>  ** Each feature / TestSuite of which should be handled in separate JIRAs.
>  ** Adding jenkins runs of those passing tests to ensure ongoing validation 
> of the runner against the model.
>  
> A good place to start is being able to run and execute pipelines on the 
> Python Portable runner, which implements all beam semantics correctly. 
> Instructions for doing so are on [Go Tips page in the Dev 
> Wiki|https://cwiki.apache.org/confluence/display/BEAM/Go+Tips]. 
>  
> Direct Runner Code: 
> [https://github.com/apache/beam/tree/master/sdks/go/pkg/beam/runners/direct]
> SDK Harness Code: 
> [https://github.com/apache/beam/blob/master/sdks/go/pkg/beam/core/runtime/harness/harness.go]
>  
>  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

Reply via email to