I'm in favor of option 3 in all cases; in favor of option 2 if it's
considered to be "more backwards-compatible" than option 1;

Option 1 I'm in favor of at minimum making the continuation trigger for all
triggers be nonterminating. I'm not super bothered by how we accomplish
that, so long as we don't drop more data due to a continuation trigger.

On Mon, Dec 4, 2017 at 4:02 PM, Raghu Angadi <rang...@google.com> wrote:

> I have been thinking about this since last week's discussions about
> buffering in sinks and was reading https://s.apache.org/beam-sink-triggers. It
> says BEAM-3169 is an example of a bug caused by misunderstanding of trigger
> semantics.
>
>   - I would like to know which part of the (documented) trigger semantics
> it misunderstood. I personally didn't think about how triggers are
> propagated/enforced downstream. Always thought of them in the context of
> the current step. JavaDoc for triggers does not mention it explicitly.
>
>   - In addition the fix for BEAM-3169 seems to essentially using
> Reshuffle, which places elements in its own window. KafkaIO's exactly-once
> sink does something similar
> <https://github.com/apache/beam/blob/dedc5e8f25a560ad9d620cab468f417def1747fb/sdks/java/io/kafka/src/main/java/org/apache/beam/sdk/io/kafka/KafkaIO.java#L1930>.
> Is it a violation of triggers set upstream (as mentioned in a in a recent dev
> thread
> <https://lists.apache.org/thread.html/ebcb316edb85c6bb2c1024a8ac52f0647dab069310b217802b735962@%3Cdev.beam.apache.org%3E>
> )?
>
> I guess I am hoping for expanded JavaDoc to describe trigger semantics in
> much more detail (preferably with examples) such that users and developers
> can understand better not suffer from many subtle bugs. Best practices are
> useful, of course, and having users actually understand the right semantics
> is also very useful.
>
>
> On Mon, Dec 4, 2017 at 3:19 PM, Eugene Kirpichov <kirpic...@google.com>
> wrote:
>
>> Hi,
>>
>> After a recent investigation of a data loss bug caused by unintuitive
>> behavior of some kinds of triggers, we had a discussion about how we can
>> protect against future issues like this, and I summarized it in
>> https://issues.apache.org/jira/browse/BEAM-3288 . Copying here:
>>
>> Current Beam trigger semantics are rather confusing and in some cases
>> extremely unsafe, especially if the pipeline includes multiple chained
>> GBKs. One example of that is https://issues.apache.org/j
>> ira/browse/BEAM-3169 .
>>
>> There's multiple issues:
>>
>> The API allows users to specify terminating top-level triggers (e.g.
>> "trigger a pane after receiving 10000 elements in the window, and that's
>> it"), but experience from user support shows that this is nearly always a
>> mistake and the user did not intend to drop all further data.
>>
>> In general, triggers are the only place in Beam where data is being
>> dropped without making a lot of very loud noise about it - a practice for
>> which the PTransform style guide uses the language: "never, ever, ever do
>> this".
>>
>> Continuation triggers are still worse. For context: continuation trigger
>> is the trigger that's set on the output of a GBK and controls further
>> aggregation of the results of this aggregation by downstream GBKs. The
>> output shouldn't just use the same trigger as the input, because e.g. if
>> the input trigger said "wait for an hour before emitting a pane", that
>> doesn't mean that we should wait for another hour before emitting a result
>> of aggregating the result of the input trigger. Continuation triggers try
>> to simulate the behavior "as if a pane of the input propagated through the
>> entire pipeline", but the implementation of individual continuation
>> triggers doesn't do that. E.g. the continuation of "first N elements in
>> pane" trigger is "first 1 element in pane", and if the results of a first
>> GBK are further grouped by a second GBK onto more coarse key (e.g. if
>> everything is grouped onto the same key), that effectively means that, of
>> the keys of the first GBK, only one survives and all others are dropped
>> (what happened in the data loss bug).
>>
>> The ultimate fix to all of these things is https://s.apache.org/beam-s
>> ink-triggers . However, it is a huge model change, and meanwhile we have
>> to do something. The options are, in order of increasing backward
>> incompatibility (but incompatibility in a "rejecting something that
>> previously was accepted but extremely dangerous" kind of way):
>>
>>    - *Make the continuation trigger of most triggers be the
>>    "always-fire" trigger.* Seems that this should be the case for all
>>    triggers except the watermark trigger. This will definitely increase
>>    safety, but lead to more eager firing of downstream aggregations. It also
>>    will violate a user's expectation that a fire-once trigger fires 
>> everything
>>    downstream only once, but that expectation appears impossible to satisfy
>>    safely.
>>    - *Make the continuation trigger of some triggers be the "invalid"
>>    trigger, *i.e. require the user to set it explicitly: there's in
>>    general no good and safe way to infer what a trigger on a second GBK
>>    "truly" should be, based on the trigger of the PCollection input into a
>>    first GBK. This is especially true for terminating triggers.
>>    - *Prohibit top-level terminating triggers entirely. *This will
>>    ensure that the only data that ever gets dropped is "droppably late" data.
>>
>>
>> Do people think that these options are sensible?
>> +Kenn Knowles <k...@google.com> +Thomas Groh <tg...@google.com> +Ben
>> Chambers <bchamb...@google.com> is this a fair summary of our discussion?
>>
>> Thanks!
>>
>
>

Reply via email to