[
https://issues.apache.org/jira/browse/BEAM-9487?focusedWorklogId=657826&page=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-657826
]
ASF GitHub Bot logged work on BEAM-9487:
----------------------------------------
Author: ASF GitHub Bot
Created on: 30/Sep/21 00:28
Start Date: 30/Sep/21 00:28
Worklog Time Spent: 10m
Work Description: kennknowles commented on a change in pull request
#15603:
URL: https://github.com/apache/beam/pull/15603#discussion_r718808531
##########
File path: sdks/python/apache_beam/transforms/trigger.py
##########
@@ -161,12 +159,42 @@ def with_prefix(self, prefix):
class DataLossReason(Flag):
- """Enum defining potential reasons that a trigger may cause data loss."""
+ """Enum defining potential reasons that a trigger may cause data loss.
+
+ These flags should only cover when the trigger is the cause, though windowing
+ can be taken into account. For instance, AfterWatermark may not flag itself
+ as finishing if the windowing doesn't allow lateness.
+ """
+
+ # Trigger will never be the source of data loss.
NO_POTENTIAL_LOSS = 0
+
+ # Trigger may finish. In this case, data that comes in after the trigger may
+ # be lost. Example: AfterCount(1) will stop firing after the first element.
MAY_FINISH = auto()
+
+ # Trigger has a condition that is not guaranteed to ever be met. In this
+ # case, data that comes in may be lost. Example: AfterCount(42) will lose
+ # 20 records if only 20 come in, since the condition to fire was never met.
CONDITION_NOT_GUARANTEED = auto()
Review comment:
Yea that comment is outdated. In the current implementation, all
buffered elements must be emitted at GC time. Any runner that does not produce
those 32 elements as output is broken.
Incidentally, the data loss that unsafe triggers has to do with is not this
- it is about if the trigger fires and then further data is just ignored
because it "finished".
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]
Issue Time Tracking
-------------------
Worklog Id: (was: 657826)
Time Spent: 32h (was: 31h 50m)
> GBKs on unbounded pcolls with global windows and no triggers should fail
> ------------------------------------------------------------------------
>
> Key: BEAM-9487
> URL: https://issues.apache.org/jira/browse/BEAM-9487
> Project: Beam
> Issue Type: Bug
> Components: sdk-py-core
> Reporter: Udi Meiri
> Assignee: Zachary Houfek
> Priority: P1
> Labels: EaseOfUse, starter
> Fix For: 2.31.0
>
> Time Spent: 32h
> Remaining Estimate: 0h
>
> This, according to "4.2.2.1 GroupByKey and unbounded PCollections" in
> https://beam.apache.org/documentation/programming-guide/.
> bq. If you do apply GroupByKey or CoGroupByKey to a group of unbounded
> PCollections without setting either a non-global windowing strategy, a
> trigger strategy, or both for each collection, Beam generates an
> IllegalStateException error at pipeline construction time.
> Example where this doesn't happen in Python SDK:
> https://stackoverflow.com/questions/60623246/merge-pcollection-with-apache-beam
> I also believe that this unit test should fail, since test_stream is
> unbounded, uses global window, and has no triggers.
> {code}
> def test_global_window_gbk_fail(self):
> with TestPipeline() as p:
> test_stream = TestStream()
> _ = p | test_stream | GroupByKey()
> {code}
--
This message was sent by Atlassian Jira
(v8.3.4#803005)