Hi everyone,

The interactive beam example using the DirectRunner fails after execution
of the last cell. The recursion limit is exceeded during the calculation of
the cache label because of a circular reference in the PipelineInfo object.

The constructor
<https://github.com/apache/beam/blob/master/sdks/python/apache_beam/runners/interactive/pipeline_analyzer.py#L375>
for
the PipelineInfo class creates a mapping from each pcollection to the
transforms that produce and consume it. The issue arises when there exists
a transform that is both a producer and a consumer for the same
pcollection. This occurs when a transform's expand method returns the same
pcoll object that's passed into it. The specific transform causing the
failure of the example is MaybeReshuffle
<https://github.com/apache/beam/blob/master/sdks/python/apache_beam/transforms/core.py#L2464>,
which
is used in the Create transform. Replacing "return pcoll" with "return
pcoll | Map(lambda x: x)" seems to fix the problem.

A workaround for this issue on the interactive beam side would be fairly
simple, but it seems to me that there should be more validation of
pipelines to prevent the use of transforms that return the same pcoll
that's passed in, or at least a mention of this in the transform style
guide. My understanding is that pcollections are produced by a single
transform (they even have a field called "producer" that references only
one transform). If that's the case then that property of pcollections
should be enforced.

I made ticket BEAM-8451 to track this issue.

I'm still new to beam so I apologize if I'm fundamentally misunderstanding
something. I'm not exactly sure what the next step should be and would
appreciate some recommendations. I can submit a PR to solve the immediate
problem of the failing example but the underlying problem should also be
addressed at some point. I also apologize if people are already aware of
this problem.

Thank You!
Igor Durovic

Reply via email to