[ 
https://issues.apache.org/jira/browse/BEAM-7825?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ismaël Mejía updated BEAM-7825:
-------------------------------
    Status: Open  (was: Triage Needed)

> Python's DirectRunner emits multiple panes per window and does not discard 
> late data
> ------------------------------------------------------------------------------------
>
>                 Key: BEAM-7825
>                 URL: https://issues.apache.org/jira/browse/BEAM-7825
>             Project: Beam
>          Issue Type: Bug
>          Components: sdk-py-core
>    Affects Versions: 2.13.0
>         Environment: OS: Debian rodete.
> Beam versions: 2.15.0.dev.
> Python versions: Python 2.7, Python 3.7
>            Reporter: Alexey Strokach
>            Priority: Major
>          Time Spent: 0.5h
>  Remaining Estimate: 0h
>
> The documentation for Beam's Windowing and Triggers functionality [states 
> that|https://beam.apache.org/documentation/programming-guide/#triggers] _"if 
> you use Beam’s default windowing configuration and default trigger, Beam 
> outputs the aggregated result when it estimates all data has arrived, and 
> discards all subsequent data for that window"_. However, it seems that the 
> current behavior of Python's DirectRunner is inconsistent with both of those 
> points. In my experience, DirectRunner will process every data point that it 
> reads from the input stream, irrespective of whether or not the timestamp of 
> that data point is older than the timestamps of the windows that have already 
> been processed. Furthermore, it regularly generates multiple "panes" for the 
> same window, apparently disregarding the notion of a watermark?
> An integration test demonstrating the inconsistencies between DirectRunner 
> and Dataflow is provided in the linked PR.
> Until the limitations of DirectRunner are addressed, maybe they should be 
> listed on the [DirectRunner documentation 
> page](https://beam.apache.org/documentation/runners/direct/)?
> As far as I understand (I have not seen this explicitly documented anywhere), 
> in the case of Cloud Dataflow, the pipeline will first process all elements 
> that have accumulated in a PubSub subscription before the start of the 
> pipeline, and will then process all new elements which have a timestamp 
> within a certain narrow range of the current time (UTC). Would this be the 
> behavior that DirectRunner should be trying to emulate? While Dataflow does 
> emit a single pane per window (by default) and discards late data, it may be 
> too eager in what data it calls "late"?



--
This message was sent by Atlassian JIRA
(v7.6.14#76016)

Reply via email to