Re: [DISCUSS] [DOC] Triggering is for sinks!

Kenneth Knowles Wed, 06 Dec 2017 21:53:51 -0800

On Wed, Dec 6, 2017 at 9:45 PM, Reuven Lax <[email protected]> wrote:
>
> Ignoring merging, one perspective is that the window is just a key with a
>>>> deadline.
>>>>
>>>
>>> That is only true when performing an aggregation. Records can be
>>> associated with a window, and do not require keys at that point. The
>>> "deadline" only applies when something  like a GBK is assigned.
>>>
>>
>> Yea, that situation -- windows assigned but no aggregation yet -- is
>> analogous to data being a KV prior to the GBK. The main function that
>> windows actually serve in the life of data processing is to allow
>> aggregations over unbounded data with bounded resources. Only aggregation
>> really needs them - if you just have a pass-through sequence of ParDos
>> windows don't really do anything.
>>
>
> I disagree. There are multiple instances where windowing is used without
> an aggregation after. Fundamentally windowing is a function on elements.
> This function is used during aggregations to bound aggregations, but makes
> sense on its own. Thinking of windowing as a "timeout" makes for an
> intuitive model, but I don't think it's really the right model. For one
> thing, that intuitive model makes less sense in batch.
>


What are the instances where windowing is used without an aggregation?

Kenn




>
>
>
>> Kenn
>>
>>
>>> From this perspective, the distinction between key and window is not
>>>> important; you could just say that GBK requires the composite key for a
>>>> group to eventually expire (in SQL terms, you just need one of the GROUP BY
>>>> arguments to provide the deadline, and they are otherwise all on equal
>>>> footing). And so the window is just as much a part of the data as the key.
>>>> Without merging, once it is assigned you don't need to keep around the
>>>> WindowFn or any such. Of course, our way of automatically propagating
>>>> windows from inputs to outputs, akin to making MapValues the default mode
>>>> of computation, requires the window to be a distinguished secondary key.
>>>>
>>>> Another way I think about it is that the windowing + watermark +
>>>> allowed lateness defines which elements are a part of a PCollection and
>>>> which are not. Dropped data semantically never existed in the first place.
>>>> This was actually independent of windowing before the "window expiration"
>>>> model of dropping data. I still think window expiration + GC + dropping go
>>>> together nicely, and drop less data needlessly, but just dropping data
>>>> behind the watermark + allowed lateness has some appeal for isolating the
>>>> operational aspect here.
>>>>
>>>> Operationally, you might take the view that the act of expiration and
>>>> dropping all remaining data is a configuration on the GBK. Then the
>>>> WindowingStrategy, like windows and KV, are plumbing devices to reach a GBK
>>>> that may be deep in a composite (which is certainly true anyhow). I don't
>>>> really like this, because I would like the output of a GBK to be a
>>>> straightforward function of its input - in the unbounded case I would like
>>>> to be specified as having to agree with the bounded spec for any finite
>>>> prefix. I'm not sure if an operational view is amenable to this. If they
>>>> both work, then being able to switch perspectives back and forth would be
>>>> cool.
>>>>
>>>> I think there are some inconsistencies in the above intuitions, and
>>>> then there's merging...
>>>>
>>>> Kenn
>>>>
>>>>
>>>> Also, I think anyone reading this document really ought to at least
>>>>> skim the (linked from there) http://s.apache.org/beam-streams-tables and
>>>>> internalize the idea of "PCollections as changelogs, aggregations as 
>>>>> tables
>>>>> on which the changelog acts". It probably would be good to rewrite our
>>>>> documentation with this in mind: even with my experience on the Beam team,
>>>>> this simple idea made it much easier for me to think clearly about all the
>>>>> concepts.
>>>>>
>>>>> I'm very excited about both of these ideas, I think they rival in
>>>>> importance the idea of batch/streaming unification and will end up being a
>>>>> fundamental part of the future of Beam model.
>>>>>
>>>>> On Thu, Nov 30, 2017 at 8:52 PM Jean-Baptiste Onofré <[email protected]>
>>>>> wrote:
>>>>>
>>>>>> Hi Kenn,
>>>>>>
>>>>>> very interesting idea. It sounds more usable and "logic".
>>>>>>
>>>>>> Regards
>>>>>> JB
>>>>>>
>>>>>> On 11/30/2017 09:06 PM, Kenneth Knowles wrote:
>>>>>> > Hi all,
>>>>>> >
>>>>>> > Triggers are one of the more novel aspects of Beam's support for
>>>>>> unbounded data.
>>>>>> > They are also one of the most challenging aspects of the model.
>>>>>> >
>>>>>> > Ben & I have been working on a major new idea for how triggers
>>>>>> could work in the
>>>>>> > Beam model. We think it will make triggers much more usable, create
>>>>>> new
>>>>>> > opportunities for no-knobs execution/optimization, and improve
>>>>>> compatibility
>>>>>> > with DSLs like SQL. (also eliminate a whole class of bugs)
>>>>>> >
>>>>>> > Triggering is for sinks!
>>>>>> >
>>>>>> > https://s.apache.org/beam-sink-triggers
>>>>>> >
>>>>>> > Please take a look at this "1"-pager and give feedback.
>>>>>> >
>>>>>> > Kenn
>>>>>>
>>>>>> --
>>>>>> Jean-Baptiste Onofré
>>>>>> [email protected]
>>>>>> http://blog.nanthrax.net
>>>>>> Talend - http://www.talend.com
>>>>>>
>>>>>
>>>>
>>>
>>
>

Re: [DISCUSS] [DOC] Triggering is for sinks!

Reply via email to