+dev@

Reviving this thread as it has hit me again on Dataflow.  I am trying to
upgrade an active streaming pipeline from 2.36.0 to 2.40.0.  Originally, I
received an error that the step "Flatten.pCollections" was missing from the
new job graph.  I knew from the code that that wasn't true, so I dumped the
job file via "--dataflowJobFile" for both the running pipeline and for the
new version I'm attempting to update to.  Both job files showed identical
data for the Flatten.pCollections step, which raises the question of why
that would have been reported as missing.

Out of curiosity I then tried mapping the step to the same name, which
changed the error to:  "The Coder or type for step
Flatten.pCollections/Unzipped-2/FlattenReplace has changed."  Again, the
job files show identical coders for the Flatten step (though
"Unzipped-2/FlattenReplace" is not present in the job file, maybe an
internal Dataflow thing?), so I'm confident that the coder hasn't actually
changed.

I'm not sure how to proceed in updating the running pipeline, and I'd
really prefer not to drain.  Any ideas?

Thanks,
Evan


On Fri, Oct 22, 2021 at 3:36 PM Evan Galpin <[email protected]> wrote:

> Thanks for the ideas Luke. I checked out the json graphs as per your
> recommendation (thanks for that, was previously unaware), and the
> "output_info" was identical for both the running pipeline and the pipeline
> I was hoping to update it with.  I ended up opting to just drain and submit
> the updated pipeline as a new job.  Thanks for the tips!
>
> Thanks,
> Evan
>
> On Thu, Oct 21, 2021 at 7:02 PM Luke Cwik <[email protected]> wrote:
>
>> I would suggest dumping the JSON representation (with the
>> --dataflowJobFile=/path/to/output.json) of the pipeline before and after
>> and looking to see what is being submitted to Dataflow. Dataflow's JSON
>> graph representation is a bipartite graph where there are transform nodes
>> with inputs and outputs and PCollection nodes with no inputs or outputs.
>> The PCollection nodes typically end with the suffix ".out". This could help
>> find steps that have been added/removed/renamed.
>>
>> The PipelineDotRenderer[1] might be of use as well.
>>
>> 1:
>> https://github.com/apache/beam/blob/master/runners/core-construction-java/src/main/java/org/apache/beam/runners/core/construction/renderer/PipelineDotRenderer.java
>>
>> On Thu, Oct 21, 2021 at 11:54 AM Evan Galpin <[email protected]>
>> wrote:
>>
>>> Hi all,
>>>
>>> I'm looking for any help regarding updating streaming jobs which are
>>> already running on Dataflow.  Specifically I'm seeking guidance for
>>> situations where Fusion is involved, and trying to decipher which old steps
>>> should be mapped to which new steps.
>>>
>>> I have a case where I updated the steps which come after the step in
>>> question, but when I attempt to update there is an error that "<old step>
>>> no longer produces data to the steps <downstream step>". I believe that
>>> <old step> is only changed as a result of fusion, and in reality it does in
>>> fact produce data to <downstream step> (confirmed when deployed as a new
>>> job for testing purposes).
>>>
>>> Is there a guide for how to deal with updates and fusion?
>>>
>>> Thanks,
>>> Evan
>>>
>>

Reply via email to