Triggers still finish and drop all data

Kenneth Knowles Wed, 30 Oct 2019 15:27:37 -0700

Problem: a trigger can "finish" which causes a window to "close" and drop
all remaining data arriving for that window.


This has been discussed many times and I thought fixed, but it seems to not
be fixed. It does not seem to have its own Jira or thread that I can find.
But here are some pointers:

 - data loss bug:
https://lists.apache.org/thread.html/ce413231d0b7d52019668765186ef27a7ffb69b151fdb34f4bf80b0f@%3Cdev.beam.apache.org%3E
 - user hitting the bug:
https://lists.apache.org/thread.html/28879bc80cd5c7ef1a3e38cb1d2c063165d40c13c02894bbccd66aca@%3Cuser.beam.apache.org%3E
 - user confusion:
https://lists.apache.org/thread.html/2707aa449c8c6de1c6e3e8229db396323122304c14931c44d0081449@%3Cuser.beam.apache.org%3E
 - thread from 2016 on the topic:
https://lists.apache.org/thread.html/5f44b62fdaf34094ccff8da2a626b7cd344d29a8a0fff6eac8e148ea@%3Cdev.beam.apache.org%3E

In theory, trigger finishing was intended for users who can get their
answers from a smaller amount of data and then drop the rest. In practice,
triggers aren't really expressive enough for this. Stateful DoFn is the
solution for these cases.

I've opened https://github.com/apache/beam/pull/9942 which makes the
following changes:

 - when a trigger says it is finished, it never fires again but data is
still kept
 - at GC time the final output will be emitted

As with all bugfixes, this is backwards-incompatible (if your pipeline
relies on buggy behavior, it will stop working). So this is a major change
that I wanted to discuss on dev@.

Kenn

Triggers still finish and drop all data

Reply via email to