On Tue, Dec 5, 2017 at 2:16 PM, Kenneth Knowles <[email protected]> wrote:
> > > On Tue, Dec 5, 2017 at 1:50 PM, Reuven Lax <[email protected]> wrote: > >> >> >> On Tue, Dec 5, 2017 at 12:00 PM, Kenneth Knowles <[email protected]> wrote: >> >>> On Tue, Dec 5, 2017 at 11:32 AM, Eugene Kirpichov <[email protected]> >>> wrote: >>> >>>> Btw do you think windowing also should be separate from a PCollection? >>>> I'm not sure whether it should be considered operational or not. >>>> >>> >>> Interesting question. I think that windowing is more coupled to a >>> PCollection, but I still mull this over once in a while. Here are some >>> fairly unedited thoughts: >>> >>> Ignoring merging, one perspective is that the window is just a key with >>> a deadline. >>> >> >> That is only true when performing an aggregation. Records can be >> associated with a window, and do not require keys at that point. The >> "deadline" only applies when something like a GBK is assigned. >> > > Yea, that situation -- windows assigned but no aggregation yet -- is > analogous to data being a KV prior to the GBK. The main function that > windows actually serve in the life of data processing is to allow > aggregations over unbounded data with bounded resources. Only aggregation > really needs them - if you just have a pass-through sequence of ParDos > windows don't really do anything. > I disagree. There are multiple instances where windowing is used without an aggregation after. Fundamentally windowing is a function on elements. This function is used during aggregations to bound aggregations, but makes sense on its own. Thinking of windowing as a "timeout" makes for an intuitive model, but I don't think it's really the right model. For one thing, that intuitive model makes less sense in batch. > Kenn > > >> From this perspective, the distinction between key and window is not >>> important; you could just say that GBK requires the composite key for a >>> group to eventually expire (in SQL terms, you just need one of the GROUP BY >>> arguments to provide the deadline, and they are otherwise all on equal >>> footing). And so the window is just as much a part of the data as the key. >>> Without merging, once it is assigned you don't need to keep around the >>> WindowFn or any such. Of course, our way of automatically propagating >>> windows from inputs to outputs, akin to making MapValues the default mode >>> of computation, requires the window to be a distinguished secondary key. >>> >>> Another way I think about it is that the windowing + watermark + allowed >>> lateness defines which elements are a part of a PCollection and which are >>> not. Dropped data semantically never existed in the first place. This was >>> actually independent of windowing before the "window expiration" model of >>> dropping data. I still think window expiration + GC + dropping go together >>> nicely, and drop less data needlessly, but just dropping data behind the >>> watermark + allowed lateness has some appeal for isolating the operational >>> aspect here. >>> >>> Operationally, you might take the view that the act of expiration and >>> dropping all remaining data is a configuration on the GBK. Then the >>> WindowingStrategy, like windows and KV, are plumbing devices to reach a GBK >>> that may be deep in a composite (which is certainly true anyhow). I don't >>> really like this, because I would like the output of a GBK to be a >>> straightforward function of its input - in the unbounded case I would like >>> to be specified as having to agree with the bounded spec for any finite >>> prefix. I'm not sure if an operational view is amenable to this. If they >>> both work, then being able to switch perspectives back and forth would be >>> cool. >>> >>> I think there are some inconsistencies in the above intuitions, and then >>> there's merging... >>> >>> Kenn >>> >>> >>> Also, I think anyone reading this document really ought to at least skim >>>> the (linked from there) http://s.apache.org/beam-streams-tables and >>>> internalize the idea of "PCollections as changelogs, aggregations as tables >>>> on which the changelog acts". It probably would be good to rewrite our >>>> documentation with this in mind: even with my experience on the Beam team, >>>> this simple idea made it much easier for me to think clearly about all the >>>> concepts. >>>> >>>> I'm very excited about both of these ideas, I think they rival in >>>> importance the idea of batch/streaming unification and will end up being a >>>> fundamental part of the future of Beam model. >>>> >>>> On Thu, Nov 30, 2017 at 8:52 PM Jean-Baptiste Onofré <[email protected]> >>>> wrote: >>>> >>>>> Hi Kenn, >>>>> >>>>> very interesting idea. It sounds more usable and "logic". >>>>> >>>>> Regards >>>>> JB >>>>> >>>>> On 11/30/2017 09:06 PM, Kenneth Knowles wrote: >>>>> > Hi all, >>>>> > >>>>> > Triggers are one of the more novel aspects of Beam's support for >>>>> unbounded data. >>>>> > They are also one of the most challenging aspects of the model. >>>>> > >>>>> > Ben & I have been working on a major new idea for how triggers could >>>>> work in the >>>>> > Beam model. We think it will make triggers much more usable, create >>>>> new >>>>> > opportunities for no-knobs execution/optimization, and improve >>>>> compatibility >>>>> > with DSLs like SQL. (also eliminate a whole class of bugs) >>>>> > >>>>> > Triggering is for sinks! >>>>> > >>>>> > https://s.apache.org/beam-sink-triggers >>>>> > >>>>> > Please take a look at this "1"-pager and give feedback. >>>>> > >>>>> > Kenn >>>>> >>>>> -- >>>>> Jean-Baptiste Onofré >>>>> [email protected] >>>>> http://blog.nanthrax.net >>>>> Talend - http://www.talend.com >>>>> >>>> >>> >> >
