ASF GitHub Bot logged work on BEAM-3288:

                Author: ASF GitHub Bot
            Created on: 08/Nov/19 21:49
            Start Date: 08/Nov/19 21:49
    Worklog Time Spent: 10m 
      Work Description: kennknowles commented on issue #9960: [BEAM-3288] Guard 
against unsafe triggers at construction time
URL: https://github.com/apache/beam/pull/9960#issuecomment-552002401
   Fixed commit history and I will merge once Java suite passes.
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:

Issue Time Tracking

    Worklog Id:     (was: 340761)
    Time Spent: 4h 10m  (was: 4h)

> Guard against unsafe triggers at construction time 
> ---------------------------------------------------
>                 Key: BEAM-3288
>                 URL: https://issues.apache.org/jira/browse/BEAM-3288
>             Project: Beam
>          Issue Type: Task
>          Components: sdk-java-core, sdk-py-core
>            Reporter: Eugene Kirpichov
>            Assignee: Kenneth Knowles
>            Priority: Major
>          Time Spent: 4h 10m
>  Remaining Estimate: 0h
> Current Beam trigger semantics are rather confusing and in some cases 
> extremely unsafe, especially if the pipeline includes multiple chained GBKs. 
> One example of that is https://issues.apache.org/jira/browse/BEAM-3169 .
> There's multiple issues:
> The API allows users to specify terminating top-level triggers (e.g. "trigger 
> a pane after receiving 10000 elements in the window, and that's it"), but 
> experience from user support shows that this is nearly always a mistake and 
> the user did not intend to drop all further data.
> In general, triggers are the only place in Beam where data is being dropped 
> without making a lot of very loud noise about it - a practice for which the 
> PTransform style guide uses the language: "never, ever, ever do this".
> Continuation triggers are still worse. For context: continuation trigger is 
> the trigger that's set on the output of a GBK and controls further 
> aggregation of the results of this aggregation by downstream GBKs. The output 
> shouldn't just use the same trigger as the input, because e.g. if the input 
> trigger said "wait for an hour before emitting a pane", that doesn't mean 
> that we should wait for another hour before emitting a result of aggregating 
> the result of the input trigger. Continuation triggers try to simulate the 
> behavior "as if a pane of the input propagated through the entire pipeline", 
> but the implementation of individual continuation triggers doesn't do that. 
> E.g. the continuation of "first N elements in pane" trigger is "first 1 
> element in pane", and if the results of a first GBK are further grouped by a 
> second GBK onto more coarse key (e.g. if everything is grouped onto the same 
> key), that effectively means that, of the keys of the first GBK, only one 
> survives and all others are dropped (what happened in the data loss bug).
> The ultimate fix to all of these things is 
> https://s.apache.org/beam-sink-triggers . However, it is a huge model change, 
> and meanwhile we have to do something. The options are, in order of 
> increasing backward incompatibility (but incompatibility in a "rejecting 
> something that previously was accepted but extremely dangerous" kind of way):
> - Make the continuation trigger of most triggers be the "always-fire" 
> trigger. Seems that this should be the case for all triggers except the 
> watermark trigger. This will definitely increase safety, but lead to more 
> eager firing of downstream aggregations. It also will violate a user's 
> expectation that a fire-once trigger fires everything downstream only once, 
> but that expectation appears impossible to satisfy safely.
> - Make the continuation trigger of some triggers be the "invalid" trigger, 
> i.e. require the user to set it explicitly: there's in general no good and 
> safe way to infer what a trigger on a second GBK "truly" should be, based on 
> the trigger of the PCollection input into a first GBK. This is especially 
> true for terminating triggers.
> - Prohibit top-level terminating triggers entirely. This will ensure that the 
> only data that ever gets dropped is "droppably late" data.
> CC: [~bchambers] [~kenn] [~tgroh]

This message was sent by Atlassian Jira

Reply via email to