[
https://issues.apache.org/jira/browse/BEAM-696?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15565305#comment-15565305
]
Amit Sela commented on BEAM-696:
--------------------------------
Correct me of I'm wrong - I think we all agree that sideInput for merging
(main-input) windows is non-deterministic when using **combiners**.
If the runner defers processing/merging (like Dataflow/Flink does) and applies
the *actual* {{Combine.GroupedValues}} as late as possible (and I'm assuming
when all inputs are grouped into the same bundle) then the problem is only with
triggers as [[email protected]] mentioned.
This means that merging windows don't support actual combine, and as
[~bchambers] mentioned (I think it was in Slack #general) it has implications
on performance.
So setting aside the triggers issue, I think there are two ways to go here:
1. State that sideInput for merging windows should be sort of "idempotent" in
the sense that it shouldn't matter from which step of the merged window it's
being read.
2. A runner should not apply "combining" for merging windows and execute it as
the underlying {{GroupByKey}} followed by {{Combine.GroupedValues}}
transformations, with the understanding that it could degrade performance.
WDYT ?
> Side-Inputs non-deterministic with merging main-input windows
> -------------------------------------------------------------
>
> Key: BEAM-696
> URL: https://issues.apache.org/jira/browse/BEAM-696
> Project: Beam
> Issue Type: Bug
> Components: beam-model
> Reporter: Ben Chambers
> Assignee: Pei He
>
> Side-Inputs are non-deterministic for several reasons:
> 1. Because they depend on triggering of the side-input (this is acceptable
> because triggers are by their nature non-deterministic).
> 2. They depend on the current state of the main-input window in order to
> lookup the side-input. This means that with merging
> 3. Any runner optimizations that affect when the side-input is looked up may
> cause problems with either or both of these.
> This issue focuses on #2 -- the non-determinism of side-inputs that execute
> within a Merging WindowFn.
> Possible solution would be to defer running anything that looks up the
> side-input until we need to extract an output, and using the main-window at
> that point. Specifically, if the main-window is a MergingWindowFn, don't
> execute any kind of pre-combine, instead buffer all the inputs and combine
> later.
> This could still run into some non-determinism if there are triggers
> controlling when we extract output.
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)