Re: [DISCUSS] [DOC] Triggering is for sinks!

Reuven Lax Wed, 06 Dec 2017 21:46:31 -0800

On Tue, Dec 5, 2017 at 2:16 PM, Kenneth Knowles <[email protected]> wrote:


>
>
> On Tue, Dec 5, 2017 at 1:50 PM, Reuven Lax <[email protected]> wrote:
>
>>
>>
>> On Tue, Dec 5, 2017 at 12:00 PM, Kenneth Knowles <[email protected]> wrote:
>>
>>> On Tue, Dec 5, 2017 at 11:32 AM, Eugene Kirpichov <[email protected]>
>>> wrote:
>>>
>>>> Btw do you think windowing also should be separate from a PCollection?
>>>> I'm not sure whether it should be considered operational or not.
>>>>
>>>
>>> Interesting question. I think that windowing is more coupled to a
>>> PCollection, but I still mull this over once in a while. Here are some
>>> fairly unedited thoughts:
>>>
>>> Ignoring merging, one perspective is that the window is just a key with
>>> a deadline.
>>>
>>
>> That is only true when performing an aggregation. Records can be
>> associated with a window, and do not require keys at that point. The
>> "deadline" only applies when something  like a GBK is assigned.
>>
>
> Yea, that situation -- windows assigned but no aggregation yet -- is
> analogous to data being a KV prior to the GBK. The main function that
> windows actually serve in the life of data processing is to allow
> aggregations over unbounded data with bounded resources. Only aggregation
> really needs them - if you just have a pass-through sequence of ParDos
> windows don't really do anything.
>

I disagree. There are multiple instances where windowing is used without an
aggregation after. Fundamentally windowing is a function on elements. This
function is used during aggregations to bound aggregations, but makes sense
on its own. Thinking of windowing as a "timeout" makes for an intuitive
model, but I don't think it's really the right model. For one thing, that
intuitive model makes less sense in batch.



> Kenn
>
>
>> From this perspective, the distinction between key and window is not
>>> important; you could just say that GBK requires the composite key for a
>>> group to eventually expire (in SQL terms, you just need one of the GROUP BY
>>> arguments to provide the deadline, and they are otherwise all on equal
>>> footing). And so the window is just as much a part of the data as the key.
>>> Without merging, once it is assigned you don't need to keep around the
>>> WindowFn or any such. Of course, our way of automatically propagating
>>> windows from inputs to outputs, akin to making MapValues the default mode
>>> of computation, requires the window to be a distinguished secondary key.
>>>
>>> Another way I think about it is that the windowing + watermark + allowed
>>> lateness defines which elements are a part of a PCollection and which are
>>> not. Dropped data semantically never existed in the first place. This was
>>> actually independent of windowing before the "window expiration" model of
>>> dropping data. I still think window expiration + GC + dropping go together
>>> nicely, and drop less data needlessly, but just dropping data behind the
>>> watermark + allowed lateness has some appeal for isolating the operational
>>> aspect here.
>>>
>>> Operationally, you might take the view that the act of expiration and
>>> dropping all remaining data is a configuration on the GBK. Then the
>>> WindowingStrategy, like windows and KV, are plumbing devices to reach a GBK
>>> that may be deep in a composite (which is certainly true anyhow). I don't
>>> really like this, because I would like the output of a GBK to be a
>>> straightforward function of its input - in the unbounded case I would like
>>> to be specified as having to agree with the bounded spec for any finite
>>> prefix. I'm not sure if an operational view is amenable to this. If they
>>> both work, then being able to switch perspectives back and forth would be
>>> cool.
>>>
>>> I think there are some inconsistencies in the above intuitions, and then
>>> there's merging...
>>>
>>> Kenn
>>>
>>>
>>> Also, I think anyone reading this document really ought to at least skim
>>>> the (linked from there) http://s.apache.org/beam-streams-tables and
>>>> internalize the idea of "PCollections as changelogs, aggregations as tables
>>>> on which the changelog acts". It probably would be good to rewrite our
>>>> documentation with this in mind: even with my experience on the Beam team,
>>>> this simple idea made it much easier for me to think clearly about all the
>>>> concepts.
>>>>
>>>> I'm very excited about both of these ideas, I think they rival in
>>>> importance the idea of batch/streaming unification and will end up being a
>>>> fundamental part of the future of Beam model.
>>>>
>>>> On Thu, Nov 30, 2017 at 8:52 PM Jean-Baptiste Onofré <[email protected]>
>>>> wrote:
>>>>
>>>>> Hi Kenn,
>>>>>
>>>>> very interesting idea. It sounds more usable and "logic".
>>>>>
>>>>> Regards
>>>>> JB
>>>>>
>>>>> On 11/30/2017 09:06 PM, Kenneth Knowles wrote:
>>>>> > Hi all,
>>>>> >
>>>>> > Triggers are one of the more novel aspects of Beam's support for
>>>>> unbounded data.
>>>>> > They are also one of the most challenging aspects of the model.
>>>>> >
>>>>> > Ben & I have been working on a major new idea for how triggers could
>>>>> work in the
>>>>> > Beam model. We think it will make triggers much more usable, create
>>>>> new
>>>>> > opportunities for no-knobs execution/optimization, and improve
>>>>> compatibility
>>>>> > with DSLs like SQL. (also eliminate a whole class of bugs)
>>>>> >
>>>>> > Triggering is for sinks!
>>>>> >
>>>>> > https://s.apache.org/beam-sink-triggers
>>>>> >
>>>>> > Please take a look at this "1"-pager and give feedback.
>>>>> >
>>>>> > Kenn
>>>>>
>>>>> --
>>>>> Jean-Baptiste Onofré
>>>>> [email protected]
>>>>> http://blog.nanthrax.net
>>>>> Talend - http://www.talend.com
>>>>>
>>>>
>>>
>>
>

Re: [DISCUSS] [DOC] Triggering is for sinks!

Reply via email to