That should have read our dynamic sinks. On Wed, Dec 6, 2017 at 10:05 PM, Reuven Lax <[email protected]> wrote:
> > > On Wed, Dec 6, 2017 at 9:53 PM, Kenneth Knowles <[email protected]> wrote: > >> >> >> On Wed, Dec 6, 2017 at 9:45 PM, Reuven Lax <[email protected]> wrote: >>> >>> Ignoring merging, one perspective is that the window is just a key with >>>>>> a deadline. >>>>>> >>>>> >>>>> That is only true when performing an aggregation. Records can be >>>>> associated with a window, and do not require keys at that point. The >>>>> "deadline" only applies when something like a GBK is assigned. >>>>> >>>> >>>> Yea, that situation -- windows assigned but no aggregation yet -- is >>>> analogous to data being a KV prior to the GBK. The main function that >>>> windows actually serve in the life of data processing is to allow >>>> aggregations over unbounded data with bounded resources. Only aggregation >>>> really needs them - if you just have a pass-through sequence of ParDos >>>> windows don't really do anything. >>>> >>> >>> I disagree. There are multiple instances where windowing is used without >>> an aggregation after. Fundamentally windowing is a function on elements. >>> This function is used during aggregations to bound aggregations, but makes >>> sense on its own. Thinking of windowing as a "timeout" makes for an >>> intuitive model, but I don't think it's really the right model. For one >>> thing, that intuitive model makes less sense in batch. >>> >> >> What are the instances where windowing is used without an aggregation? >> > > One example is in our destination sinks. Mapping to a destination is > usually done by simply examining the window on an element, and these sinks > generally do not group by that window. > > >> Kenn >> >> >> >> >>> >>> >>> >>>> Kenn >>>> >>>> >>>>> From this perspective, the distinction between key and window is not >>>>>> important; you could just say that GBK requires the composite key for a >>>>>> group to eventually expire (in SQL terms, you just need one of the GROUP >>>>>> BY >>>>>> arguments to provide the deadline, and they are otherwise all on equal >>>>>> footing). And so the window is just as much a part of the data as the >>>>>> key. >>>>>> Without merging, once it is assigned you don't need to keep around the >>>>>> WindowFn or any such. Of course, our way of automatically propagating >>>>>> windows from inputs to outputs, akin to making MapValues the default mode >>>>>> of computation, requires the window to be a distinguished secondary key. >>>>>> >>>>>> Another way I think about it is that the windowing + watermark + >>>>>> allowed lateness defines which elements are a part of a PCollection and >>>>>> which are not. Dropped data semantically never existed in the first >>>>>> place. >>>>>> This was actually independent of windowing before the "window expiration" >>>>>> model of dropping data. I still think window expiration + GC + dropping >>>>>> go >>>>>> together nicely, and drop less data needlessly, but just dropping data >>>>>> behind the watermark + allowed lateness has some appeal for isolating the >>>>>> operational aspect here. >>>>>> >>>>>> Operationally, you might take the view that the act of expiration and >>>>>> dropping all remaining data is a configuration on the GBK. Then the >>>>>> WindowingStrategy, like windows and KV, are plumbing devices to reach a >>>>>> GBK >>>>>> that may be deep in a composite (which is certainly true anyhow). I don't >>>>>> really like this, because I would like the output of a GBK to be a >>>>>> straightforward function of its input - in the unbounded case I would >>>>>> like >>>>>> to be specified as having to agree with the bounded spec for any finite >>>>>> prefix. I'm not sure if an operational view is amenable to this. If they >>>>>> both work, then being able to switch perspectives back and forth would be >>>>>> cool. >>>>>> >>>>>> I think there are some inconsistencies in the above intuitions, and >>>>>> then there's merging... >>>>>> >>>>>> Kenn >>>>>> >>>>>> >>>>>> Also, I think anyone reading this document really ought to at least >>>>>>> skim the (linked from there) http://s.apache.org/beam-streams-tables and >>>>>>> internalize the idea of "PCollections as changelogs, aggregations as >>>>>>> tables >>>>>>> on which the changelog acts". It probably would be good to rewrite our >>>>>>> documentation with this in mind: even with my experience on the Beam >>>>>>> team, >>>>>>> this simple idea made it much easier for me to think clearly about all >>>>>>> the >>>>>>> concepts. >>>>>>> >>>>>>> I'm very excited about both of these ideas, I think they rival in >>>>>>> importance the idea of batch/streaming unification and will end up >>>>>>> being a >>>>>>> fundamental part of the future of Beam model. >>>>>>> >>>>>>> On Thu, Nov 30, 2017 at 8:52 PM Jean-Baptiste Onofré < >>>>>>> [email protected]> wrote: >>>>>>> >>>>>>>> Hi Kenn, >>>>>>>> >>>>>>>> very interesting idea. It sounds more usable and "logic". >>>>>>>> >>>>>>>> Regards >>>>>>>> JB >>>>>>>> >>>>>>>> On 11/30/2017 09:06 PM, Kenneth Knowles wrote: >>>>>>>> > Hi all, >>>>>>>> > >>>>>>>> > Triggers are one of the more novel aspects of Beam's support for >>>>>>>> unbounded data. >>>>>>>> > They are also one of the most challenging aspects of the model. >>>>>>>> > >>>>>>>> > Ben & I have been working on a major new idea for how triggers >>>>>>>> could work in the >>>>>>>> > Beam model. We think it will make triggers much more usable, >>>>>>>> create new >>>>>>>> > opportunities for no-knobs execution/optimization, and improve >>>>>>>> compatibility >>>>>>>> > with DSLs like SQL. (also eliminate a whole class of bugs) >>>>>>>> > >>>>>>>> > Triggering is for sinks! >>>>>>>> > >>>>>>>> > https://s.apache.org/beam-sink-triggers >>>>>>>> > >>>>>>>> > Please take a look at this "1"-pager and give feedback. >>>>>>>> > >>>>>>>> > Kenn >>>>>>>> >>>>>>>> -- >>>>>>>> Jean-Baptiste Onofré >>>>>>>> [email protected] >>>>>>>> http://blog.nanthrax.net >>>>>>>> Talend - http://www.talend.com >>>>>>>> >>>>>>> >>>>>> >>>>> >>>> >>> >> >
