On Tue, Dec 5, 2017 at 12:00 PM, Kenneth Knowles <[email protected]> wrote:

> On Tue, Dec 5, 2017 at 11:32 AM, Eugene Kirpichov <[email protected]>
> wrote:
>
>> Btw do you think windowing also should be separate from a PCollection?
>> I'm not sure whether it should be considered operational or not.
>>
>
> Interesting question. I think that windowing is more coupled to a
> PCollection, but I still mull this over once in a while. Here are some
> fairly unedited thoughts:
>
> Ignoring merging, one perspective is that the window is just a key with a
> deadline.
>

That is only true when performing an aggregation. Records can be associated
with a window, and do not require keys at that point. The "deadline" only
applies when something  like a GBK is assigned.

>From this perspective, the distinction between key and window is not
> important; you could just say that GBK requires the composite key for a
> group to eventually expire (in SQL terms, you just need one of the GROUP BY
> arguments to provide the deadline, and they are otherwise all on equal
> footing). And so the window is just as much a part of the data as the key.
> Without merging, once it is assigned you don't need to keep around the
> WindowFn or any such. Of course, our way of automatically propagating
> windows from inputs to outputs, akin to making MapValues the default mode
> of computation, requires the window to be a distinguished secondary key.
>
> Another way I think about it is that the windowing + watermark + allowed
> lateness defines which elements are a part of a PCollection and which are
> not. Dropped data semantically never existed in the first place. This was
> actually independent of windowing before the "window expiration" model of
> dropping data. I still think window expiration + GC + dropping go together
> nicely, and drop less data needlessly, but just dropping data behind the
> watermark + allowed lateness has some appeal for isolating the operational
> aspect here.
>
> Operationally, you might take the view that the act of expiration and
> dropping all remaining data is a configuration on the GBK. Then the
> WindowingStrategy, like windows and KV, are plumbing devices to reach a GBK
> that may be deep in a composite (which is certainly true anyhow). I don't
> really like this, because I would like the output of a GBK to be a
> straightforward function of its input - in the unbounded case I would like
> to be specified as having to agree with the bounded spec for any finite
> prefix. I'm not sure if an operational view is amenable to this. If they
> both work, then being able to switch perspectives back and forth would be
> cool.
>
> I think there are some inconsistencies in the above intuitions, and then
> there's merging...
>
> Kenn
>
>
> Also, I think anyone reading this document really ought to at least skim
>> the (linked from there) http://s.apache.org/beam-streams-tables and
>> internalize the idea of "PCollections as changelogs, aggregations as tables
>> on which the changelog acts". It probably would be good to rewrite our
>> documentation with this in mind: even with my experience on the Beam team,
>> this simple idea made it much easier for me to think clearly about all the
>> concepts.
>>
>> I'm very excited about both of these ideas, I think they rival in
>> importance the idea of batch/streaming unification and will end up being a
>> fundamental part of the future of Beam model.
>>
>> On Thu, Nov 30, 2017 at 8:52 PM Jean-Baptiste Onofré <[email protected]>
>> wrote:
>>
>>> Hi Kenn,
>>>
>>> very interesting idea. It sounds more usable and "logic".
>>>
>>> Regards
>>> JB
>>>
>>> On 11/30/2017 09:06 PM, Kenneth Knowles wrote:
>>> > Hi all,
>>> >
>>> > Triggers are one of the more novel aspects of Beam's support for
>>> unbounded data.
>>> > They are also one of the most challenging aspects of the model.
>>> >
>>> > Ben & I have been working on a major new idea for how triggers could
>>> work in the
>>> > Beam model. We think it will make triggers much more usable, create new
>>> > opportunities for no-knobs execution/optimization, and improve
>>> compatibility
>>> > with DSLs like SQL. (also eliminate a whole class of bugs)
>>> >
>>> > Triggering is for sinks!
>>> >
>>> > https://s.apache.org/beam-sink-triggers
>>> >
>>> > Please take a look at this "1"-pager and give feedback.
>>> >
>>> > Kenn
>>>
>>> --
>>> Jean-Baptiste Onofré
>>> [email protected]
>>> http://blog.nanthrax.net
>>> Talend - http://www.talend.com
>>>
>>
>

Reply via email to