On Tue, Dec 5, 2017 at 12:00 PM, Kenneth Knowles <[email protected]> wrote:
> On Tue, Dec 5, 2017 at 11:32 AM, Eugene Kirpichov <[email protected]> > wrote: > >> Btw do you think windowing also should be separate from a PCollection? >> I'm not sure whether it should be considered operational or not. >> > > Interesting question. I think that windowing is more coupled to a > PCollection, but I still mull this over once in a while. Here are some > fairly unedited thoughts: > > Ignoring merging, one perspective is that the window is just a key with a > deadline. > That is only true when performing an aggregation. Records can be associated with a window, and do not require keys at that point. The "deadline" only applies when something like a GBK is assigned. >From this perspective, the distinction between key and window is not > important; you could just say that GBK requires the composite key for a > group to eventually expire (in SQL terms, you just need one of the GROUP BY > arguments to provide the deadline, and they are otherwise all on equal > footing). And so the window is just as much a part of the data as the key. > Without merging, once it is assigned you don't need to keep around the > WindowFn or any such. Of course, our way of automatically propagating > windows from inputs to outputs, akin to making MapValues the default mode > of computation, requires the window to be a distinguished secondary key. > > Another way I think about it is that the windowing + watermark + allowed > lateness defines which elements are a part of a PCollection and which are > not. Dropped data semantically never existed in the first place. This was > actually independent of windowing before the "window expiration" model of > dropping data. I still think window expiration + GC + dropping go together > nicely, and drop less data needlessly, but just dropping data behind the > watermark + allowed lateness has some appeal for isolating the operational > aspect here. > > Operationally, you might take the view that the act of expiration and > dropping all remaining data is a configuration on the GBK. Then the > WindowingStrategy, like windows and KV, are plumbing devices to reach a GBK > that may be deep in a composite (which is certainly true anyhow). I don't > really like this, because I would like the output of a GBK to be a > straightforward function of its input - in the unbounded case I would like > to be specified as having to agree with the bounded spec for any finite > prefix. I'm not sure if an operational view is amenable to this. If they > both work, then being able to switch perspectives back and forth would be > cool. > > I think there are some inconsistencies in the above intuitions, and then > there's merging... > > Kenn > > > Also, I think anyone reading this document really ought to at least skim >> the (linked from there) http://s.apache.org/beam-streams-tables and >> internalize the idea of "PCollections as changelogs, aggregations as tables >> on which the changelog acts". It probably would be good to rewrite our >> documentation with this in mind: even with my experience on the Beam team, >> this simple idea made it much easier for me to think clearly about all the >> concepts. >> >> I'm very excited about both of these ideas, I think they rival in >> importance the idea of batch/streaming unification and will end up being a >> fundamental part of the future of Beam model. >> >> On Thu, Nov 30, 2017 at 8:52 PM Jean-Baptiste Onofré <[email protected]> >> wrote: >> >>> Hi Kenn, >>> >>> very interesting idea. It sounds more usable and "logic". >>> >>> Regards >>> JB >>> >>> On 11/30/2017 09:06 PM, Kenneth Knowles wrote: >>> > Hi all, >>> > >>> > Triggers are one of the more novel aspects of Beam's support for >>> unbounded data. >>> > They are also one of the most challenging aspects of the model. >>> > >>> > Ben & I have been working on a major new idea for how triggers could >>> work in the >>> > Beam model. We think it will make triggers much more usable, create new >>> > opportunities for no-knobs execution/optimization, and improve >>> compatibility >>> > with DSLs like SQL. (also eliminate a whole class of bugs) >>> > >>> > Triggering is for sinks! >>> > >>> > https://s.apache.org/beam-sink-triggers >>> > >>> > Please take a look at this "1"-pager and give feedback. >>> > >>> > Kenn >>> >>> -- >>> Jean-Baptiste Onofré >>> [email protected] >>> http://blog.nanthrax.net >>> Talend - http://www.talend.com >>> >> >
