This is very much runner dependent, but generally the gradual output will
not occur. The timestamps of elements within the pipeline should be used to
describe when the event the element represents occurs, and a watermark
describes when all of the data that occurred before a timestamp have
arrived. As a result, the watermark advances based on the data you have
read, not the movement of "wall clock time", which means that as processing
speeds up, you can have elements emitted significantly faster than you
would if the processing occurred instantaneously in real time - once you
have all of the data, you know you cannot receive more data - so as soon as
you've processed everything you have, you're done (and can emit all of your
output).

For the latter - the runner can read the entirety of the bounded source,
which permits it to advance the watermark of downstream transforms to the
timestamp of the earliest unprocessed element. That in turn permits it to
emit output for a window when it knows it has processed the entirety of the
data for that window (because all of the elements have been read, the
source's watermark is at positive infinity, and does not hold back any
downstream computation). It is not, however, required to do this - the
runner can buffer as much as it thinks is "reasonable" - which permits some
optimizations when processing only bounded data.

On Fri, Apr 28, 2017 at 10:41 AM, Shen Li <[email protected]> wrote:

> Hi Thomas,
>
> Thanks for the explanation. Does it mean I cannot reproduce the real-time
> behavior of the replayed trace?  Say the watermarks are perfect and
> FixedWindows groups elements into 1-minute windows, will the watermarks
> trigger the FixedWindows to fire roughly every minute?
>
> I am a little confused about the "when available" behavior of the runner.
> Since the watermarks emitted by the BoundedSource will always be
> BoundedWindow.TIMESTAMP_MIN_VALUE except for the last watermark, how could
> the runner know when to trigger the computation on a window?
>
> Thanks,
>
> Shen
>
> On Fri, Apr 28, 2017 at 1:13 PM, Thomas Groh <[email protected]>
> wrote:
>
> > You can't directly control the watermark that a BoundedSource emits.
> > Windowing into FixedWindows will still work as you expect, however: your
> > elements will be assigned to their windows based on the time the event
> > occurred. Depending on the runner, triggers may be run either "when
> > available" or after all the work is completed, but your output data will
> be
> > as if you had a perfect watermark.
> >
> > On Fri, Apr 28, 2017 at 10:09 AM, Shen Li <[email protected]> wrote:
> >
> > > Hi,
> > >
> > > Say I want to replay a data trace of last week using fixed windows. The
> > > data trace is read from a file using TextIO. In order to trigger
> windows
> > at
> > > right times, how can I control the watermark emitted by the
> > BoundedSource?
> > >
> > > Thanks,
> > >
> > > Shen
> > >
> >
>

Reply via email to