On Wed, Dec 6, 2017 at 9:45 PM, Reuven Lax <[email protected]> wrote: > > Ignoring merging, one perspective is that the window is just a key with a >>>> deadline. >>>> >>> >>> That is only true when performing an aggregation. Records can be >>> associated with a window, and do not require keys at that point. The >>> "deadline" only applies when something like a GBK is assigned. >>> >> >> Yea, that situation -- windows assigned but no aggregation yet -- is >> analogous to data being a KV prior to the GBK. The main function that >> windows actually serve in the life of data processing is to allow >> aggregations over unbounded data with bounded resources. Only aggregation >> really needs them - if you just have a pass-through sequence of ParDos >> windows don't really do anything. >> > > I disagree. There are multiple instances where windowing is used without > an aggregation after. Fundamentally windowing is a function on elements. > This function is used during aggregations to bound aggregations, but makes > sense on its own. Thinking of windowing as a "timeout" makes for an > intuitive model, but I don't think it's really the right model. For one > thing, that intuitive model makes less sense in batch. >
What are the instances where windowing is used without an aggregation? Kenn > > > >> Kenn >> >> >>> From this perspective, the distinction between key and window is not >>>> important; you could just say that GBK requires the composite key for a >>>> group to eventually expire (in SQL terms, you just need one of the GROUP BY >>>> arguments to provide the deadline, and they are otherwise all on equal >>>> footing). And so the window is just as much a part of the data as the key. >>>> Without merging, once it is assigned you don't need to keep around the >>>> WindowFn or any such. Of course, our way of automatically propagating >>>> windows from inputs to outputs, akin to making MapValues the default mode >>>> of computation, requires the window to be a distinguished secondary key. >>>> >>>> Another way I think about it is that the windowing + watermark + >>>> allowed lateness defines which elements are a part of a PCollection and >>>> which are not. Dropped data semantically never existed in the first place. >>>> This was actually independent of windowing before the "window expiration" >>>> model of dropping data. I still think window expiration + GC + dropping go >>>> together nicely, and drop less data needlessly, but just dropping data >>>> behind the watermark + allowed lateness has some appeal for isolating the >>>> operational aspect here. >>>> >>>> Operationally, you might take the view that the act of expiration and >>>> dropping all remaining data is a configuration on the GBK. Then the >>>> WindowingStrategy, like windows and KV, are plumbing devices to reach a GBK >>>> that may be deep in a composite (which is certainly true anyhow). I don't >>>> really like this, because I would like the output of a GBK to be a >>>> straightforward function of its input - in the unbounded case I would like >>>> to be specified as having to agree with the bounded spec for any finite >>>> prefix. I'm not sure if an operational view is amenable to this. If they >>>> both work, then being able to switch perspectives back and forth would be >>>> cool. >>>> >>>> I think there are some inconsistencies in the above intuitions, and >>>> then there's merging... >>>> >>>> Kenn >>>> >>>> >>>> Also, I think anyone reading this document really ought to at least >>>>> skim the (linked from there) http://s.apache.org/beam-streams-tables and >>>>> internalize the idea of "PCollections as changelogs, aggregations as >>>>> tables >>>>> on which the changelog acts". It probably would be good to rewrite our >>>>> documentation with this in mind: even with my experience on the Beam team, >>>>> this simple idea made it much easier for me to think clearly about all the >>>>> concepts. >>>>> >>>>> I'm very excited about both of these ideas, I think they rival in >>>>> importance the idea of batch/streaming unification and will end up being a >>>>> fundamental part of the future of Beam model. >>>>> >>>>> On Thu, Nov 30, 2017 at 8:52 PM Jean-Baptiste Onofré <[email protected]> >>>>> wrote: >>>>> >>>>>> Hi Kenn, >>>>>> >>>>>> very interesting idea. It sounds more usable and "logic". >>>>>> >>>>>> Regards >>>>>> JB >>>>>> >>>>>> On 11/30/2017 09:06 PM, Kenneth Knowles wrote: >>>>>> > Hi all, >>>>>> > >>>>>> > Triggers are one of the more novel aspects of Beam's support for >>>>>> unbounded data. >>>>>> > They are also one of the most challenging aspects of the model. >>>>>> > >>>>>> > Ben & I have been working on a major new idea for how triggers >>>>>> could work in the >>>>>> > Beam model. We think it will make triggers much more usable, create >>>>>> new >>>>>> > opportunities for no-knobs execution/optimization, and improve >>>>>> compatibility >>>>>> > with DSLs like SQL. (also eliminate a whole class of bugs) >>>>>> > >>>>>> > Triggering is for sinks! >>>>>> > >>>>>> > https://s.apache.org/beam-sink-triggers >>>>>> > >>>>>> > Please take a look at this "1"-pager and give feedback. >>>>>> > >>>>>> > Kenn >>>>>> >>>>>> -- >>>>>> Jean-Baptiste Onofré >>>>>> [email protected] >>>>>> http://blog.nanthrax.net >>>>>> Talend - http://www.talend.com >>>>>> >>>>> >>>> >>> >> >
