On Tue, Dec 5, 2017 at 11:32 AM, Eugene Kirpichov <[email protected]> wrote:
> Btw do you think windowing also should be separate from a PCollection? I'm > not sure whether it should be considered operational or not. > Interesting question. I think that windowing is more coupled to a PCollection, but I still mull this over once in a while. Here are some fairly unedited thoughts: Ignoring merging, one perspective is that the window is just a key with a deadline. From this perspective, the distinction between key and window is not important; you could just say that GBK requires the composite key for a group to eventually expire (in SQL terms, you just need one of the GROUP BY arguments to provide the deadline, and they are otherwise all on equal footing). And so the window is just as much a part of the data as the key. Without merging, once it is assigned you don't need to keep around the WindowFn or any such. Of course, our way of automatically propagating windows from inputs to outputs, akin to making MapValues the default mode of computation, requires the window to be a distinguished secondary key. Another way I think about it is that the windowing + watermark + allowed lateness defines which elements are a part of a PCollection and which are not. Dropped data semantically never existed in the first place. This was actually independent of windowing before the "window expiration" model of dropping data. I still think window expiration + GC + dropping go together nicely, and drop less data needlessly, but just dropping data behind the watermark + allowed lateness has some appeal for isolating the operational aspect here. Operationally, you might take the view that the act of expiration and dropping all remaining data is a configuration on the GBK. Then the WindowingStrategy, like windows and KV, are plumbing devices to reach a GBK that may be deep in a composite (which is certainly true anyhow). I don't really like this, because I would like the output of a GBK to be a straightforward function of its input - in the unbounded case I would like to be specified as having to agree with the bounded spec for any finite prefix. I'm not sure if an operational view is amenable to this. If they both work, then being able to switch perspectives back and forth would be cool. I think there are some inconsistencies in the above intuitions, and then there's merging... Kenn Also, I think anyone reading this document really ought to at least skim > the (linked from there) http://s.apache.org/beam-streams-tables and > internalize the idea of "PCollections as changelogs, aggregations as tables > on which the changelog acts". It probably would be good to rewrite our > documentation with this in mind: even with my experience on the Beam team, > this simple idea made it much easier for me to think clearly about all the > concepts. > > I'm very excited about both of these ideas, I think they rival in > importance the idea of batch/streaming unification and will end up being a > fundamental part of the future of Beam model. > > On Thu, Nov 30, 2017 at 8:52 PM Jean-Baptiste Onofré <[email protected]> > wrote: > >> Hi Kenn, >> >> very interesting idea. It sounds more usable and "logic". >> >> Regards >> JB >> >> On 11/30/2017 09:06 PM, Kenneth Knowles wrote: >> > Hi all, >> > >> > Triggers are one of the more novel aspects of Beam's support for >> unbounded data. >> > They are also one of the most challenging aspects of the model. >> > >> > Ben & I have been working on a major new idea for how triggers could >> work in the >> > Beam model. We think it will make triggers much more usable, create new >> > opportunities for no-knobs execution/optimization, and improve >> compatibility >> > with DSLs like SQL. (also eliminate a whole class of bugs) >> > >> > Triggering is for sinks! >> > >> > https://s.apache.org/beam-sink-triggers >> > >> > Please take a look at this "1"-pager and give feedback. >> > >> > Kenn >> >> -- >> Jean-Baptiste Onofré >> [email protected] >> http://blog.nanthrax.net >> Talend - http://www.talend.com >> >
