+1 to translation from beam pipeline Protos. The Go SDK does that currently in dataflowlib/translate.go to handle the current Dataflow situation, so it's certainly doable.
On Tue, Mar 31, 2020, 5:48 PM Robert Bradshaw <[email protected]> wrote: > On Tue, Mar 31, 2020 at 12:06 PM Sam Rohde <[email protected]> wrote: > >> Hi All, >> >> I am currently investigating making the Python DataflowRunner to use a >> portable pipeline representation so that we can eventually get rid of the >> Pipeline(runner) weirdness. >> >> In that case, I have a lot questions about the Python DataflowRunner: >> >> *PValueCache* >> >> - Why does this exist? >> >> This is historical baggage from the (long gone) first direct runner when > actual computed PCollections were cached, and the DataflowRunner inherited > it. > > >> *DataflowRunner* >> >> - I see that the DataflowRunner defines some PTransforms as >> runner-specific primitives by returning a PCollection.from_(...) in apply_ >> methods. Then in the run_ methods, it references the PValueCache to add >> steps. >> - How does this add steps? >> - Where does it cache the values to? >> - How does the runner harness pick up these cached values to >> create new steps? >> - How is this information communicated to the runner harness? >> - Why do the following transforms need to be overridden: GroupByKey, >> WriteToBigQuery, CombineValues, Read? >> >> Each of these four has a different implementation on Dataflow. > >> >> - Why doesn't the ParDo transform need to be overridden? I see that >> it has a run_ method but no apply_ method. >> >> apply_ is called at pipeline construction time, all of these should be > replaced by PTransformOverrides. run_ is called after pipeline construction > to actually build up the dataflow graph. > > >> *Possible fixes* >> I was thinking of getting rid of the apply_ and run_ methods and >> replacing those with a PTransformOverride and a simple PipelineVisitor, >> respectively. Is this feasible? Am I missing any assumptions that don't >> make this feasible? >> > > If we're going to overhaul how the runner works, it would be best to make > DataflowRunner direct a translator from Beam runner api protos to Cloudv1b3 > protos, rather than manipulate the intermediate Python representation > (which no one wants to change for fear of messing up DataflowRunner and > cause headaches for cross langauge). > > >
