[
https://issues.apache.org/jira/browse/BEAM-3288?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16963195#comment-16963195
]
Kenneth Knowles commented on BEAM-3288:
---------------------------------------
I believe it is a fundamental design but to have triggers able to drop data. I
can't find the discussion now but I believe this has consensus. The trigger
should regulate the flow rate but not alter the semantic answer. When a trigger
claims to be "done", as a backstop we should still emit the final pane.
> Guard against unsafe triggers at construction time
> ---------------------------------------------------
>
> Key: BEAM-3288
> URL: https://issues.apache.org/jira/browse/BEAM-3288
> Project: Beam
> Issue Type: Task
> Components: sdk-java-core, sdk-py-core
> Reporter: Eugene Kirpichov
> Assignee: Kenneth Knowles
> Priority: Major
> Time Spent: 10m
> Remaining Estimate: 0h
>
> Current Beam trigger semantics are rather confusing and in some cases
> extremely unsafe, especially if the pipeline includes multiple chained GBKs.
> One example of that is https://issues.apache.org/jira/browse/BEAM-3169 .
> There's multiple issues:
> The API allows users to specify terminating top-level triggers (e.g. "trigger
> a pane after receiving 10000 elements in the window, and that's it"), but
> experience from user support shows that this is nearly always a mistake and
> the user did not intend to drop all further data.
> In general, triggers are the only place in Beam where data is being dropped
> without making a lot of very loud noise about it - a practice for which the
> PTransform style guide uses the language: "never, ever, ever do this".
> Continuation triggers are still worse. For context: continuation trigger is
> the trigger that's set on the output of a GBK and controls further
> aggregation of the results of this aggregation by downstream GBKs. The output
> shouldn't just use the same trigger as the input, because e.g. if the input
> trigger said "wait for an hour before emitting a pane", that doesn't mean
> that we should wait for another hour before emitting a result of aggregating
> the result of the input trigger. Continuation triggers try to simulate the
> behavior "as if a pane of the input propagated through the entire pipeline",
> but the implementation of individual continuation triggers doesn't do that.
> E.g. the continuation of "first N elements in pane" trigger is "first 1
> element in pane", and if the results of a first GBK are further grouped by a
> second GBK onto more coarse key (e.g. if everything is grouped onto the same
> key), that effectively means that, of the keys of the first GBK, only one
> survives and all others are dropped (what happened in the data loss bug).
> The ultimate fix to all of these things is
> https://s.apache.org/beam-sink-triggers . However, it is a huge model change,
> and meanwhile we have to do something. The options are, in order of
> increasing backward incompatibility (but incompatibility in a "rejecting
> something that previously was accepted but extremely dangerous" kind of way):
> - Make the continuation trigger of most triggers be the "always-fire"
> trigger. Seems that this should be the case for all triggers except the
> watermark trigger. This will definitely increase safety, but lead to more
> eager firing of downstream aggregations. It also will violate a user's
> expectation that a fire-once trigger fires everything downstream only once,
> but that expectation appears impossible to satisfy safely.
> - Make the continuation trigger of some triggers be the "invalid" trigger,
> i.e. require the user to set it explicitly: there's in general no good and
> safe way to infer what a trigger on a second GBK "truly" should be, based on
> the trigger of the PCollection input into a first GBK. This is especially
> true for terminating triggers.
> - Prohibit top-level terminating triggers entirely. This will ensure that the
> only data that ever gets dropped is "droppably late" data.
> CC: [~bchambers] [~kenn] [~tgroh]
--
This message was sent by Atlassian Jira
(v8.3.4#803005)