Re: [DISCUSS] [DOC] Triggering is for sinks!

Kenneth Knowles Tue, 05 Dec 2017 12:01:01 -0800

On Tue, Dec 5, 2017 at 11:32 AM, Eugene Kirpichov <[email protected]>
wrote:

> Btw do you think windowing also should be separate from a PCollection? I'm
> not sure whether it should be considered operational or not.
>

Interesting question. I think that windowing is more coupled to a
PCollection, but I still mull this over once in a while. Here are some
fairly unedited thoughts:

Ignoring merging, one perspective is that the window is just a key with a
deadline. From this perspective, the distinction between key and window is
not important; you could just say that GBK requires the composite key for a
group to eventually expire (in SQL terms, you just need one of the GROUP BY
arguments to provide the deadline, and they are otherwise all on equal
footing). And so the window is just as much a part of the data as the key.
Without merging, once it is assigned you don't need to keep around the
WindowFn or any such. Of course, our way of automatically propagating
windows from inputs to outputs, akin to making MapValues the default mode
of computation, requires the window to be a distinguished secondary key.

Another way I think about it is that the windowing + watermark + allowed
lateness defines which elements are a part of a PCollection and which are
not. Dropped data semantically never existed in the first place. This was
actually independent of windowing before the "window expiration" model of
dropping data. I still think window expiration + GC + dropping go together
nicely, and drop less data needlessly, but just dropping data behind the
watermark + allowed lateness has some appeal for isolating the operational
aspect here.

Operationally, you might take the view that the act of expiration and
dropping all remaining data is a configuration on the GBK. Then the
WindowingStrategy, like windows and KV, are plumbing devices to reach a GBK
that may be deep in a composite (which is certainly true anyhow). I don't
really like this, because I would like the output of a GBK to be a
straightforward function of its input - in the unbounded case I would like
to be specified as having to agree with the bounded spec for any finite
prefix. I'm not sure if an operational view is amenable to this. If they
both work, then being able to switch perspectives back and forth would be
cool.

I think there are some inconsistencies in the above intuitions, and then
there's merging...

Kenn

Also, I think anyone reading this document really ought to at least skim
> the (linked from there) http://s.apache.org/beam-streams-tables and
> internalize the idea of "PCollections as changelogs, aggregations as tables
> on which the changelog acts". It probably would be good to rewrite our
> documentation with this in mind: even with my experience on the Beam team,
> this simple idea made it much easier for me to think clearly about all the
> concepts.
>
> I'm very excited about both of these ideas, I think they rival in
> importance the idea of batch/streaming unification and will end up being a
> fundamental part of the future of Beam model.
>
> On Thu, Nov 30, 2017 at 8:52 PM Jean-Baptiste Onofré <[email protected]>
> wrote:
>
>> Hi Kenn,
>>
>> very interesting idea. It sounds more usable and "logic".
>>
>> Regards
>> JB
>>
>> On 11/30/2017 09:06 PM, Kenneth Knowles wrote:
>> > Hi all,
>> >
>> > Triggers are one of the more novel aspects of Beam's support for
>> unbounded data.
>> > They are also one of the most challenging aspects of the model.
>> >
>> > Ben & I have been working on a major new idea for how triggers could
>> work in the
>> > Beam model. We think it will make triggers much more usable, create new
>> > opportunities for no-knobs execution/optimization, and improve
>> compatibility
>> > with DSLs like SQL. (also eliminate a whole class of bugs)
>> >
>> > Triggering is for sinks!
>> >
>> > https://s.apache.org/beam-sink-triggers
>> >
>> > Please take a look at this "1"-pager and give feedback.
>> >
>> > Kenn
>>
>> --
>> Jean-Baptiste Onofré
>> [email protected]
>> http://blog.nanthrax.net
>> Talend - http://www.talend.com
>>
>

Re: [DISCUSS] [DOC] Triggering is for sinks!

Reply via email to