Hi Kenn,
does there still remain some use for trigger to finish? If we don't drop
data, would it still be of any use to users? If not, would it be better
to just remove the functionality completely, so that users who use it
(and it will possibly break for them) are aware of it at compile time?
Jan
On 10/30/19 11:26 PM, Kenneth Knowles wrote:
Problem: a trigger can "finish" which causes a window to "close" and
drop all remaining data arriving for that window.
This has been discussed many times and I thought fixed, but it seems
to not be fixed. It does not seem to have its own Jira or thread that
I can find. But here are some pointers:
- data loss bug:
https://lists.apache.org/thread.html/ce413231d0b7d52019668765186ef27a7ffb69b151fdb34f4bf80b0f@%3Cdev.beam.apache.org%3E
- user hitting the bug:
https://lists.apache.org/thread.html/28879bc80cd5c7ef1a3e38cb1d2c063165d40c13c02894bbccd66aca@%3Cuser.beam.apache.org%3E
- user confusion:
https://lists.apache.org/thread.html/2707aa449c8c6de1c6e3e8229db396323122304c14931c44d0081449@%3Cuser.beam.apache.org%3E
- thread from 2016 on the topic:
https://lists.apache.org/thread.html/5f44b62fdaf34094ccff8da2a626b7cd344d29a8a0fff6eac8e148ea@%3Cdev.beam.apache.org%3E
In theory, trigger finishing was intended for users who can get their
answers from a smaller amount of data and then drop the rest. In
practice, triggers aren't really expressive enough for this. Stateful
DoFn is the solution for these cases.
I've opened https://github.com/apache/beam/pull/9942 which makes the
following changes:
- when a trigger says it is finished, it never fires again but data
is still kept
- at GC time the final output will be emitted
As with all bugfixes, this is backwards-incompatible (if your pipeline
relies on buggy behavior, it will stop working). So this is a major
change that I wanted to discuss on dev@.
Kenn