[
https://issues.apache.org/jira/browse/BEAM-7825?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Kenneth Knowles updated BEAM-7825:
----------------------------------
This Jira ticket has a pull request attached to it, but is still open. Did the
pull request resolve the issue? If so, could you please mark it resolved? This
will help the project have a clear view of its open issues.
> Python's DirectRunner emits multiple panes per window and does not discard
> late data
> ------------------------------------------------------------------------------------
>
> Key: BEAM-7825
> URL: https://issues.apache.org/jira/browse/BEAM-7825
> Project: Beam
> Issue Type: Bug
> Components: sdk-py-core
> Affects Versions: 2.13.0
> Environment: OS: Debian rodete.
> Beam versions: 2.15.0.dev.
> Python versions: Python 2.7, Python 3.7
> Reporter: Alexey Strokach
> Priority: P3
> Time Spent: 5h 10m
> Remaining Estimate: 0h
>
> The documentation for Beam's Windowing and Triggers functionality [states
> that|https://beam.apache.org/documentation/programming-guide/#triggers] _"if
> you use Beam’s default windowing configuration and default trigger, Beam
> outputs the aggregated result when it estimates all data has arrived, and
> discards all subsequent data for that window"_. However, it seems that the
> current behavior of Python's DirectRunner is inconsistent with both of those
> points. As the {{StreamingWordGroupIT.test_discard_late_data}} test shows,
> DirectRunner appears to process every data point that it reads from the input
> stream, irrespective of whether or not the timestamp of that data point is
> older than the timestamps of the windows that have already been processed.
> Furthermore, as the {{StreamingWordGroupIT.test_single_output_per_window}}
> test shows, DirectRunner generates multiple "panes" for the same window,
> apparently disregarding the notion of a watermark?
> The Dataflow runner passes both of those end-to-end tests.
--
This message was sent by Atlassian Jira
(v8.20.1#820001)