Re: Unportable Dataflow Pipeline Questions

Robert Burke Tue, 31 Mar 2020 18:02:39 -0700

+1 to translation from beam pipeline Protos.

 The Go SDK does that currently in dataflowlib/translate.go to handle the
current Dataflow situation, so it's certainly doable.


On Tue, Mar 31, 2020, 5:48 PM Robert Bradshaw <[email protected]> wrote:

> On Tue, Mar 31, 2020 at 12:06 PM Sam Rohde <[email protected]> wrote:
>
>> Hi All,
>>
>> I am currently investigating making the Python DataflowRunner to use a
>> portable pipeline representation so that we can eventually get rid of the
>> Pipeline(runner) weirdness.
>>
>> In that case, I have a lot questions about the Python DataflowRunner:
>>
>> *PValueCache*
>>
>>    - Why does this exist?
>>
>> This is historical baggage from the (long gone) first direct runner when
> actual computed PCollections were cached, and the DataflowRunner inherited
> it.
>
>
>> *DataflowRunner*
>>
>>    - I see that the DataflowRunner defines some PTransforms as
>>    runner-specific primitives by returning a PCollection.from_(...) in apply_
>>    methods. Then in the run_ methods, it references the PValueCache to add
>>    steps.
>>       - How does this add steps?
>>       - Where does it cache the values to?
>>       - How does the runner harness pick up these cached values to
>>       create new steps?
>>       - How is this information communicated to the runner harness?
>>    - Why do the following transforms need to be overridden: GroupByKey,
>>    WriteToBigQuery, CombineValues, Read?
>>
>> Each of these four has a different implementation on Dataflow.
>
>>
>>    - Why doesn't the ParDo transform need to be overridden? I see that
>>    it has a run_ method but no apply_ method.
>>
>> apply_ is called at pipeline construction time, all of these should be
> replaced by PTransformOverrides. run_ is called after pipeline construction
> to actually build up the dataflow graph.
>
>
>> *Possible fixes*
>> I was thinking of getting rid of the apply_ and run_ methods and
>> replacing those with a PTransformOverride and a simple PipelineVisitor,
>> respectively. Is this feasible? Am I missing any assumptions that don't
>> make this feasible?
>>
>
> If we're going to overhaul how the runner works, it would be best to make
> DataflowRunner direct a translator from Beam runner api protos to Cloudv1b3
> protos, rather than manipulate the intermediate Python representation
> (which no one wants to change for fear of messing up DataflowRunner and
> cause headaches for cross langauge).
>
>
>

Re: Unportable Dataflow Pipeline Questions

Reply via email to