Hi Kenn,

does there still remain some use for trigger to finish? If we don't drop data, would it still be of any use to users? If not, would it be better to just remove the functionality completely, so that users who use it (and it will possibly break for them) are aware of it at compile time?

Jan

On 10/30/19 11:26 PM, Kenneth Knowles wrote:
Problem: a trigger can "finish" which causes a window to "close" and drop all remaining data arriving for that window.

This has been discussed many times and I thought fixed, but it seems to not be fixed. It does not seem to have its own Jira or thread that I can find. But here are some pointers:

 - data loss bug: https://lists.apache.org/thread.html/ce413231d0b7d52019668765186ef27a7ffb69b151fdb34f4bf80b0f@%3Cdev.beam.apache.org%3E  - user hitting the bug: https://lists.apache.org/thread.html/28879bc80cd5c7ef1a3e38cb1d2c063165d40c13c02894bbccd66aca@%3Cuser.beam.apache.org%3E  - user confusion: https://lists.apache.org/thread.html/2707aa449c8c6de1c6e3e8229db396323122304c14931c44d0081449@%3Cuser.beam.apache.org%3E  - thread from 2016 on the topic: https://lists.apache.org/thread.html/5f44b62fdaf34094ccff8da2a626b7cd344d29a8a0fff6eac8e148ea@%3Cdev.beam.apache.org%3E

In theory, trigger finishing was intended for users who can get their answers from a smaller amount of data and then drop the rest. In practice, triggers aren't really expressive enough for this. Stateful DoFn is the solution for these cases.

I've opened https://github.com/apache/beam/pull/9942 which makes the following changes:

 - when a trigger says it is finished, it never fires again but data is still kept
 - at GC time the final output will be emitted

As with all bugfixes, this is backwards-incompatible (if your pipeline relies on buggy behavior, it will stop working). So this is a major change that I wanted to discuss on dev@.

Kenn

Reply via email to