subject:"\[jira\] \[Comment Edited\] \(BEAM\-2140\) Fix SplittableDoFn ValidatesRunner tests in FlinkRunner"

[jira] [Comment Edited] (BEAM-2140) Fix SplittableDoFn ValidatesRunner tests in FlinkRunner

2017-06-27 Thread Eugene Kirpichov (JIRA)


[ 
https://issues.apache.org/jira/browse/BEAM-2140?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16065429#comment-16065429
 ] 

Eugene Kirpichov edited comment on BEAM-2140 at 6/27/17 8:38 PM:
-

Okay, I see that I misunderstood what the watermark hold does, and now I'm not 
sure how anything works at all (i.e. why timers set by SDF are not constantly 
dropped) - in direct and dataflow runner :-|

For SDF specifically, I think it would make sense to *advance the watermark of 
the input same as if the DoFn was not splittable* - i.e. consider the input 
element "consumed" only when the ProcessElement call terminates with no 
residual restriction. In other words, I guess, set an "input watermark hold" 
(in addition to output watermark hold)? Is such a thing possible? Does it make 
equal sense for non-splittable DoFn's that use timers?


was (Author: jkff):
Okay, I see that I misunderstood what the watermark hold does, and now I'm not 
sure how anything works at all (i.e. why timers set by SDF are not constantly 
dropped) - in direct and dataflow runner :-|

For SDF specifically, I think it would make sense to **advance the watermark of 
the input same as if the DoFn was not splittable** - i.e. consider the input 
element "consumed" only when the ProcessElement call terminates with no 
residual restriction. In other words, I guess, set an "input watermark hold" 
(in addition to output watermark hold)? Is such a thing possible? Does it make 
equal sense for non-splittable DoFn's that use timers?

> Fix SplittableDoFn ValidatesRunner tests in FlinkRunner
> ---
>
> Key: BEAM-2140
> URL: https://issues.apache.org/jira/browse/BEAM-2140
> Project: Beam
>  Issue Type: Bug
>  Components: runner-flink
>Reporter: Aljoscha Krettek
>Assignee: Aljoscha Krettek
>
> As discovered as part of BEAM-1763, there is a failing SDF test. We disabled 
> the tests to unblock the open PR for BEAM-1763.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

[jira] [Comment Edited] (BEAM-2140) Fix SplittableDoFn ValidatesRunner tests in FlinkRunner

2017-06-27 Thread Kenneth Knowles (JIRA)

[
https://issues.apache.org/jira/browse/BEAM-2140?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16065235#comment-16065235
]

Kenneth Knowles edited comment on BEAM-2140 at 6/27/17 6:14 PM:

I considered for a long time what should happen with processing time triggers
as far as window expiry. We spent quite some time coming up with the semantics
at https://s.apache.org/beam-lateness#heading=h.hot1g47sz45s, long before Beam.
I don't claim it is perfect (it is way too complex, for one) but it represents
a lot of thought by lots of people. I think actually it does give some choices.

* an input being droppable does not necessarily mean you are required to drop
it (some transforms may falter on droppable inputs, but that is specific to the
transform)
* input timestamp and output timestamp are decoupled, so you can reason about
whether to ignore input based on whether the resulting output would be droppable

Some possibilities that I've considered (some break the model):

*Treat processing time timers as inputs with some timestamp at EOW or some such*

The theme that timers are inputs is basically valid. We gain clarity by not
conflating them in APIs and discussions. But how they interact with watermarks,
etc, should be basically compatible. Currently processing time timers are
treated as inputs with a timestamp equal to the input watermark at the moment
of their arrival. So this change would cause an input hold because there is a
known upcoming element that just hasn't arrived.

In streaming: this holds things up too much. It also makes repeatedly firing
after processing time cause an infinite loops, versus what happens today where
it naturally goes through window expiry and GC.

In batch: this breaks the unified model for processing historical data in a
batch mode. With the semantics as they exist today, the way that batch "runs"
triggers and processing time timers (by ignoring them) is completely compatible
with the semantics. So any user who writes a correct transform has good
assurances they it will work in both modes. If processing time timers held
watermarks like this they would need to be processed in batch mode, yet they
are contradictory with the whole point of it.

We can omit unbounded SDFs from this unification issue, probably, but a
bounded-per-element SDF should certainly work on streamed unbounded input as
well as bounded input.

*Decide whether to drop a processing time timer not based on the input
watermark but based on whether its output would be droppable*

This lets the input watermark advance, but still does not allow infinitely
repeating processing time timers to terminate with window expiry automatically,
and it still breaks the unified model. We could alleviate both issues by
refusing to set new timers that would already be expired. I think this is just
a rabbit hole of unnatural corner cases so we should avoid it.

*In addition to the processing time timers that ProcessFn sets, also set a GC
timer*

This seems straightforward and a simple and good idea. These timers are also
still run in batch mode for historical reprocessing.

Can you clarify how it does not work? Is it because you need to create a "loop"
that continues to fire until the residual is gone? Currently, there is simply
no way to make a perpetual loop with timers because of the commentary below.

*Treat event time timers as inputs with their given timestamp*

This would combine the GC timer idea and let you make a looping structure. This
currently cannot work because timers fire only when the input watermark is
strictly greater than their timestamp. The semantics of "on time" and "final
GC" panes depends on this, so we'd have a lot of work to do. But I think there
might be a consistent world where event time timers are treated as elements,
and fire when the watermark arrives at their timestamp. {{@OnWindowExpiration}}
is then absolutely required and cannot be simulated by a timer.

was (Author: kenn):
I considered for a long time what should happen with processing time triggers
as far as window expiry. We spent quite some time coming up with the semantics
at https://s.apache.org/beam-lateness#heading=h.hot1g47sz45s, long before Beam.
I don't claim it is perfect (it is way too complex, for one) but it represents
a lot of thought by lots of people. I think actually it does give some choices.

Some possibilities that I think don't break the model:

*Treat processing time timers as inputs with some timestamp at EOW or some such*

The theme that timers are inputs is

[jira] [Comment Edited] (BEAM-2140) Fix SplittableDoFn ValidatesRunner tests in FlinkRunner

2017-06-27 Thread Kenneth Knowles (JIRA)

[
https://issues.apache.org/jira/browse/BEAM-2140?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16065235#comment-16065235
]

Kenneth Knowles edited comment on BEAM-2140 at 6/27/17 6:14 PM:

Some possibilities for SDF: