+1 Windowing is not operational, and in fact doesn't rely on the notions of watermarks at all. Watermarks are an operational tool used to compute windowing for streaming pipelines, and failure to respect the perfect watermark results in a failure to correctly compute windowing. We then thrown the operational notion of triggering to give pipeline authors the ability to control and try to deal with these failures.
On Thu, Dec 7, 2017 at 8:53 AM, Reuven Lax <[email protected]> wrote: > My point is that windowing in the abstract is not fundamentally > operational, as proven by the fact that in batch pipelines windowing is > perfectly determined by the data. In streaming pipelines things turn a bit > more operational, but that's not fundamental to the data model. In fact, you > can model the watermark as a perfect magical oracle (you can just define it > as a function from a prefix of the stream to a timestamp, not involving > processing time at all); this isn't true in real life of course, but it does > allow the watermark to be modeled in a non-operational way. > > On Thu, Dec 7, 2017 at 1:02 AM, Kenneth Knowles <[email protected]> wrote: >> >> On Wed, Dec 6, 2017 at 10:44 PM, Robert Bradshaw <[email protected]> >> wrote: >>> >>> So, to answer the original question, I don't think windowing is >>> operational, as given a particular windowing there is one and only one >>> right output for a given input. >> >> >> I just want to highlight that this is an operational/semantic blend. The >> watermark determines what is in or is not in a PCollection, and the >> watermark is fundamentally operational, since it is a relation between event >> time and processing time. I am in complete agreement with your statement >> _for a particular watermark curve_. >> >> Or, more aggressively, if we say that dropped elements never existed - by >> definition, and according to the watermark curve - then every window is >> bounded (or the pipeline is stuck) and there is one and only one right >> output and it will match a batch job over the same. >> >> This touches on what I mean by the simplicity of dropping data based on >> (watermark - timestamp) instead of (watermark - window expiry). When >> dropping is based on window expiry, the relationship between real-time >> results and archival results is not as clear. >> >> Kenn >> >> >>> >>> This doesn't mean that it couldn't be >>> interesting to explore expressing the desired windowing on the output >>> of a transformation and letting it flow up rather than expressing it >>> on the input and letting it flow down. (There are technical >>> implications here--an output consumed in multiple ways may need to be >>> executed multiple times despite it occurring once in the graph which >>> could be counter-intuitive, though even just changing the triggering >>> could require this, and runners could probably often be intelligent >>> about sharing work between similar-enough windowing. It also means the >>> choice of windowing couldn't be inspected during construction (though >>> again we may have crossed that bridge if we can't inspect the choice >>> of triggering).) >>> >>> Triggering, on the other hand, is very operational, and intention more >>> easily defined at a sink (and propagated upwards) rather than the way >>> we do now, so a huge +1 to this proposal. >>> >>> - Robert >> >> >
