[
https://issues.apache.org/jira/browse/BEAM-7825?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Ismaël Mejía updated BEAM-7825:
-------------------------------
Status: Open (was: Triage Needed)
> Python's DirectRunner emits multiple panes per window and does not discard
> late data
> ------------------------------------------------------------------------------------
>
> Key: BEAM-7825
> URL: https://issues.apache.org/jira/browse/BEAM-7825
> Project: Beam
> Issue Type: Bug
> Components: sdk-py-core
> Affects Versions: 2.13.0
> Environment: OS: Debian rodete.
> Beam versions: 2.15.0.dev.
> Python versions: Python 2.7, Python 3.7
> Reporter: Alexey Strokach
> Priority: Major
> Time Spent: 0.5h
> Remaining Estimate: 0h
>
> The documentation for Beam's Windowing and Triggers functionality [states
> that|https://beam.apache.org/documentation/programming-guide/#triggers] _"if
> you use Beam’s default windowing configuration and default trigger, Beam
> outputs the aggregated result when it estimates all data has arrived, and
> discards all subsequent data for that window"_. However, it seems that the
> current behavior of Python's DirectRunner is inconsistent with both of those
> points. In my experience, DirectRunner will process every data point that it
> reads from the input stream, irrespective of whether or not the timestamp of
> that data point is older than the timestamps of the windows that have already
> been processed. Furthermore, it regularly generates multiple "panes" for the
> same window, apparently disregarding the notion of a watermark?
> An integration test demonstrating the inconsistencies between DirectRunner
> and Dataflow is provided in the linked PR.
> Until the limitations of DirectRunner are addressed, maybe they should be
> listed on the [DirectRunner documentation
> page](https://beam.apache.org/documentation/runners/direct/)?
> As far as I understand (I have not seen this explicitly documented anywhere),
> in the case of Cloud Dataflow, the pipeline will first process all elements
> that have accumulated in a PubSub subscription before the start of the
> pipeline, and will then process all new elements which have a timestamp
> within a certain narrow range of the current time (UTC). Would this be the
> behavior that DirectRunner should be trying to emulate? While Dataflow does
> emit a single pane per window (by default) and discards late data, it may be
> too eager in what data it calls "late"?
--
This message was sent by Atlassian JIRA
(v7.6.14#76016)