[ 
https://issues.apache.org/jira/browse/BEAM-3288?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16963195#comment-16963195
 ] 

Kenneth Knowles commented on BEAM-3288:
---------------------------------------

I believe it is a fundamental design but to have triggers able to drop data. I 
can't find the discussion now but I believe this has consensus. The trigger 
should regulate the flow rate but not alter the semantic answer. When a trigger 
claims to be "done", as a backstop we should still emit the final pane.

> Guard against unsafe triggers at construction time 
> ---------------------------------------------------
>
>                 Key: BEAM-3288
>                 URL: https://issues.apache.org/jira/browse/BEAM-3288
>             Project: Beam
>          Issue Type: Task
>          Components: sdk-java-core, sdk-py-core
>            Reporter: Eugene Kirpichov
>            Assignee: Kenneth Knowles
>            Priority: Major
>          Time Spent: 10m
>  Remaining Estimate: 0h
>
> Current Beam trigger semantics are rather confusing and in some cases 
> extremely unsafe, especially if the pipeline includes multiple chained GBKs. 
> One example of that is https://issues.apache.org/jira/browse/BEAM-3169 .
> There's multiple issues:
> The API allows users to specify terminating top-level triggers (e.g. "trigger 
> a pane after receiving 10000 elements in the window, and that's it"), but 
> experience from user support shows that this is nearly always a mistake and 
> the user did not intend to drop all further data.
> In general, triggers are the only place in Beam where data is being dropped 
> without making a lot of very loud noise about it - a practice for which the 
> PTransform style guide uses the language: "never, ever, ever do this".
> Continuation triggers are still worse. For context: continuation trigger is 
> the trigger that's set on the output of a GBK and controls further 
> aggregation of the results of this aggregation by downstream GBKs. The output 
> shouldn't just use the same trigger as the input, because e.g. if the input 
> trigger said "wait for an hour before emitting a pane", that doesn't mean 
> that we should wait for another hour before emitting a result of aggregating 
> the result of the input trigger. Continuation triggers try to simulate the 
> behavior "as if a pane of the input propagated through the entire pipeline", 
> but the implementation of individual continuation triggers doesn't do that. 
> E.g. the continuation of "first N elements in pane" trigger is "first 1 
> element in pane", and if the results of a first GBK are further grouped by a 
> second GBK onto more coarse key (e.g. if everything is grouped onto the same 
> key), that effectively means that, of the keys of the first GBK, only one 
> survives and all others are dropped (what happened in the data loss bug).
> The ultimate fix to all of these things is 
> https://s.apache.org/beam-sink-triggers . However, it is a huge model change, 
> and meanwhile we have to do something. The options are, in order of 
> increasing backward incompatibility (but incompatibility in a "rejecting 
> something that previously was accepted but extremely dangerous" kind of way):
> - Make the continuation trigger of most triggers be the "always-fire" 
> trigger. Seems that this should be the case for all triggers except the 
> watermark trigger. This will definitely increase safety, but lead to more 
> eager firing of downstream aggregations. It also will violate a user's 
> expectation that a fire-once trigger fires everything downstream only once, 
> but that expectation appears impossible to satisfy safely.
> - Make the continuation trigger of some triggers be the "invalid" trigger, 
> i.e. require the user to set it explicitly: there's in general no good and 
> safe way to infer what a trigger on a second GBK "truly" should be, based on 
> the trigger of the PCollection input into a first GBK. This is especially 
> true for terminating triggers.
> - Prohibit top-level terminating triggers entirely. This will ensure that the 
> only data that ever gets dropped is "droppably late" data.
> CC: [~bchambers] [~kenn] [~tgroh]



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

Reply via email to