[ 
https://issues.apache.org/jira/browse/BEAM-696?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15949438#comment-15949438
 ] 

Daniel Halperin commented on BEAM-696:
--------------------------------------

Status update:

* Community has agreed that the model definition is correct, and that runners 
should not make performance improvements like "combine-before-GBK" where those 
changes could result in incorrect answers.

* At the time, documentation was needed to explain this issue -- it was present 
on the old Dataflow website but not the Beam website.

A lot of documentation has been moved to Beam. I found what I believe is the 
content [~amitsela] referenced:

{code}
If the main input element exists in more than one window, then processElement 
gets called multiple times, once for each window. Each call to processElement 
projects the “current” window for the main input element, and thus might 
provide a different view of the side input each time.

If the side input has multiple trigger firings, Beam uses the value from the 
latest trigger firing. This is particularly useful if you use a side input with 
a single global window and specify a trigger.
{code}

[~amitsela] do you think we can close this now?

> Document: Side-Inputs non-deterministic with merging main-input windows
> -----------------------------------------------------------------------
>
>                 Key: BEAM-696
>                 URL: https://issues.apache.org/jira/browse/BEAM-696
>             Project: Beam
>          Issue Type: Task
>          Components: beam-model
>            Reporter: Ben Chambers
>
> Side-Inputs are non-deterministic for several reasons:
> 1. Because they depend on triggering of the side-input (this is acceptable 
> because triggers are by their nature non-deterministic).
> 2. They depend on the current state of the main-input window in order to 
> lookup the side-input. This means that with merging
> 3. Any runner optimizations that affect when the side-input is looked up may 
> cause problems with either or both of these.
> This issue focuses on #2 -- the non-determinism of side-inputs that execute 
> within a Merging WindowFn.
> Possible solution would be to defer running anything that looks up the 
> side-input until we need to extract an output, and using the main-window at 
> that point. Specifically, if the main-window is a MergingWindowFn, don't 
> execute any kind of pre-combine, instead buffer all the inputs and combine 
> later.
> This could still run into some non-determinism if there are triggers 
> controlling when we extract output.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

Reply via email to