Re: [DISCUSS] [DOC] Triggering is for sinks!

Kenneth Knowles Tue, 05 Dec 2017 14:18:03 -0800

On Tue, Dec 5, 2017 at 1:50 PM, Reuven Lax <[email protected]> wrote:

>
>
> On Tue, Dec 5, 2017 at 12:00 PM, Kenneth Knowles <[email protected]> wrote:
>
>> On Tue, Dec 5, 2017 at 11:32 AM, Eugene Kirpichov <[email protected]>
>> wrote:
>>
>>> Btw do you think windowing also should be separate from a PCollection?
>>> I'm not sure whether it should be considered operational or not.
>>>
>>
>> Interesting question. I think that windowing is more coupled to a
>> PCollection, but I still mull this over once in a while. Here are some
>> fairly unedited thoughts:
>>
>> Ignoring merging, one perspective is that the window is just a key with a
>> deadline.
>>
>
> That is only true when performing an aggregation. Records can be
> associated with a window, and do not require keys at that point. The
> "deadline" only applies when something  like a GBK is assigned.
>


Yea, that situation -- windows assigned but no aggregation yet -- is
analogous to data being a KV prior to the GBK. The main function that
windows actually serve in the life of data processing is to allow
aggregations over unbounded data with bounded resources. Only aggregation
really needs them - if you just have a pass-through sequence of ParDos
windows don't really do anything.

Kenn


> From this perspective, the distinction between key and window is not
>> important; you could just say that GBK requires the composite key for a
>> group to eventually expire (in SQL terms, you just need one of the GROUP BY
>> arguments to provide the deadline, and they are otherwise all on equal
>> footing). And so the window is just as much a part of the data as the key.
>> Without merging, once it is assigned you don't need to keep around the
>> WindowFn or any such. Of course, our way of automatically propagating
>> windows from inputs to outputs, akin to making MapValues the default mode
>> of computation, requires the window to be a distinguished secondary key.
>>
>> Another way I think about it is that the windowing + watermark + allowed
>> lateness defines which elements are a part of a PCollection and which are
>> not. Dropped data semantically never existed in the first place. This was
>> actually independent of windowing before the "window expiration" model of
>> dropping data. I still think window expiration + GC + dropping go together
>> nicely, and drop less data needlessly, but just dropping data behind the
>> watermark + allowed lateness has some appeal for isolating the operational
>> aspect here.
>>
>> Operationally, you might take the view that the act of expiration and
>> dropping all remaining data is a configuration on the GBK. Then the
>> WindowingStrategy, like windows and KV, are plumbing devices to reach a GBK
>> that may be deep in a composite (which is certainly true anyhow). I don't
>> really like this, because I would like the output of a GBK to be a
>> straightforward function of its input - in the unbounded case I would like
>> to be specified as having to agree with the bounded spec for any finite
>> prefix. I'm not sure if an operational view is amenable to this. If they
>> both work, then being able to switch perspectives back and forth would be
>> cool.
>>
>> I think there are some inconsistencies in the above intuitions, and then
>> there's merging...
>>
>> Kenn
>>
>>
>> Also, I think anyone reading this document really ought to at least skim
>>> the (linked from there) http://s.apache.org/beam-streams-tables and
>>> internalize the idea of "PCollections as changelogs, aggregations as tables
>>> on which the changelog acts". It probably would be good to rewrite our
>>> documentation with this in mind: even with my experience on the Beam team,
>>> this simple idea made it much easier for me to think clearly about all the
>>> concepts.
>>>
>>> I'm very excited about both of these ideas, I think they rival in
>>> importance the idea of batch/streaming unification and will end up being a
>>> fundamental part of the future of Beam model.
>>>
>>> On Thu, Nov 30, 2017 at 8:52 PM Jean-Baptiste Onofré <[email protected]>
>>> wrote:
>>>
>>>> Hi Kenn,
>>>>
>>>> very interesting idea. It sounds more usable and "logic".
>>>>
>>>> Regards
>>>> JB
>>>>
>>>> On 11/30/2017 09:06 PM, Kenneth Knowles wrote:
>>>> > Hi all,
>>>> >
>>>> > Triggers are one of the more novel aspects of Beam's support for
>>>> unbounded data.
>>>> > They are also one of the most challenging aspects of the model.
>>>> >
>>>> > Ben & I have been working on a major new idea for how triggers could
>>>> work in the
>>>> > Beam model. We think it will make triggers much more usable, create
>>>> new
>>>> > opportunities for no-knobs execution/optimization, and improve
>>>> compatibility
>>>> > with DSLs like SQL. (also eliminate a whole class of bugs)
>>>> >
>>>> > Triggering is for sinks!
>>>> >
>>>> > https://s.apache.org/beam-sink-triggers
>>>> >
>>>> > Please take a look at this "1"-pager and give feedback.
>>>> >
>>>> > Kenn
>>>>
>>>> --
>>>> Jean-Baptiste Onofré
>>>> [email protected]
>>>> http://blog.nanthrax.net
>>>> Talend - http://www.talend.com
>>>>
>>>
>>
>

Re: [DISCUSS] [DOC] Triggering is for sinks!

Reply via email to