That is correct. Side inputs give a view of the "whole" PCollection and
hence introduce a fusion-producing barrier. For example, suppose one has a
DoFn that produces two outputs, mainPColl and sidePColl, that are consumed
(as the main and side input respectively) of DoFnB.

                  -------- mainPColl ----- DoFnB
                /                            ^
inPColl -- DoFnA                             |
                \                            |
                  -------- sidePColl ------- /


Now DoFnB may iterate over the entity of sidePColl for every element of
mainPColl. This means that DoFnA and DoFnB cannot be fused, which
would require DoFnB to consume the elements as they are produced from
DoFnA, but we need DoFnA to run to completion before we know the contents
of sidePColl.

Similar constraints apply in larger graphs (e.g. there may be many
intermediate DoFns and PCollections), but they principally boil down to
shapes that look like this.

Though this does not introduce a global barrier in streaming, there is
still the analogous per window/watermark barrier that prevents fusion for
the same reasons.




On Thu, Dec 14, 2023 at 3:02 PM Joey Tran <joey.t...@schrodinger.com> wrote:

> Hey all,
>
> We have a pretty big pipeline and while I was inspecting the stages, I
> noticed there is less fusion than I expected. I suspect it has to do with
> the heavy use of side inputs in our workflow. In the python sdk, I see that
> side inputs are considered when determining whether two stages are fusible.
> I have a hard time getting a clear understanding of the logic though. Could
> someone clarify / summarize the rules around this?
>
> Thanks!
> Joey
>

Reply via email to