Re: Unportable Dataflow Pipeline Questions

Sam Rohde Wed, 01 Apr 2020 11:31:41 -0700

Okay cool, so it sounds like the cleanup can be done in two phases: move
the apply_ methods to transform replacements, then move Dataflow onto the
Cloudv1b3 protos. AFAIU, after phase one will make the Pipeline object
portable? If the InteractiveRunner were to make a Pipeline object, then it
could be passed to the DataflowRunner to run, correct?


On Tue, Mar 31, 2020 at 6:01 PM Robert Burke <[email protected]> wrote:

> +1 to translation from beam pipeline Protos.
>
>  The Go SDK does that currently in dataflowlib/translate.go to handle the
> current Dataflow situation, so it's certainly doable.
>
> On Tue, Mar 31, 2020, 5:48 PM Robert Bradshaw <[email protected]> wrote:
>
>> On Tue, Mar 31, 2020 at 12:06 PM Sam Rohde <[email protected]> wrote:
>>
>>> Hi All,
>>>
>>> I am currently investigating making the Python DataflowRunner to use a
>>> portable pipeline representation so that we can eventually get rid of the
>>> Pipeline(runner) weirdness.
>>>
>>> In that case, I have a lot questions about the Python DataflowRunner:
>>>
>>> *PValueCache*
>>>
>>>    - Why does this exist?
>>>
>>> This is historical baggage from the (long gone) first direct runner when
>> actual computed PCollections were cached, and the DataflowRunner inherited
>> it.
>>
>>
>>> *DataflowRunner*
>>>
>>>    - I see that the DataflowRunner defines some PTransforms as
>>>    runner-specific primitives by returning a PCollection.from_(...) in 
>>> apply_
>>>    methods. Then in the run_ methods, it references the PValueCache to add
>>>    steps.
>>>       - How does this add steps?
>>>       - Where does it cache the values to?
>>>       - How does the runner harness pick up these cached values to
>>>       create new steps?
>>>       - How is this information communicated to the runner harness?
>>>    - Why do the following transforms need to be overridden: GroupByKey,
>>>    WriteToBigQuery, CombineValues, Read?
>>>
>>> Each of these four has a different implementation on Dataflow.
>>
>>>
>>>    - Why doesn't the ParDo transform need to be overridden? I see that
>>>    it has a run_ method but no apply_ method.
>>>
>>> apply_ is called at pipeline construction time, all of these should be
>> replaced by PTransformOverrides. run_ is called after pipeline construction
>> to actually build up the dataflow graph.
>>
>>
>>> *Possible fixes*
>>> I was thinking of getting rid of the apply_ and run_ methods and
>>> replacing those with a PTransformOverride and a simple PipelineVisitor,
>>> respectively. Is this feasible? Am I missing any assumptions that don't
>>> make this feasible?
>>>
>>
>> If we're going to overhaul how the runner works, it would be best to make
>> DataflowRunner direct a translator from Beam runner api protos to Cloudv1b3
>> protos, rather than manipulate the intermediate Python representation
>> (which no one wants to change for fear of messing up DataflowRunner and
>> cause headaches for cross langauge).
>>
>>
>>

Re: Unportable Dataflow Pipeline Questions

Reply via email to