That should have read our dynamic sinks.

On Wed, Dec 6, 2017 at 10:05 PM, Reuven Lax <[email protected]> wrote:

>
>
> On Wed, Dec 6, 2017 at 9:53 PM, Kenneth Knowles <[email protected]> wrote:
>
>>
>>
>> On Wed, Dec 6, 2017 at 9:45 PM, Reuven Lax <[email protected]> wrote:
>>>
>>> Ignoring merging, one perspective is that the window is just a key with
>>>>>> a deadline.
>>>>>>
>>>>>
>>>>> That is only true when performing an aggregation. Records can be
>>>>> associated with a window, and do not require keys at that point. The
>>>>> "deadline" only applies when something  like a GBK is assigned.
>>>>>
>>>>
>>>> Yea, that situation -- windows assigned but no aggregation yet -- is
>>>> analogous to data being a KV prior to the GBK. The main function that
>>>> windows actually serve in the life of data processing is to allow
>>>> aggregations over unbounded data with bounded resources. Only aggregation
>>>> really needs them - if you just have a pass-through sequence of ParDos
>>>> windows don't really do anything.
>>>>
>>>
>>> I disagree. There are multiple instances where windowing is used without
>>> an aggregation after. Fundamentally windowing is a function on elements.
>>> This function is used during aggregations to bound aggregations, but makes
>>> sense on its own. Thinking of windowing as a "timeout" makes for an
>>> intuitive model, but I don't think it's really the right model. For one
>>> thing, that intuitive model makes less sense in batch.
>>>
>>
>> What are the instances where windowing is used without an aggregation?
>>
>
> One example is in our destination sinks. Mapping to a destination is
> usually done by simply examining the window on an element, and these sinks
> generally do not group by that window.
>
>
>> Kenn
>>
>>
>>
>>
>>>
>>>
>>>
>>>> Kenn
>>>>
>>>>
>>>>> From this perspective, the distinction between key and window is not
>>>>>> important; you could just say that GBK requires the composite key for a
>>>>>> group to eventually expire (in SQL terms, you just need one of the GROUP 
>>>>>> BY
>>>>>> arguments to provide the deadline, and they are otherwise all on equal
>>>>>> footing). And so the window is just as much a part of the data as the 
>>>>>> key.
>>>>>> Without merging, once it is assigned you don't need to keep around the
>>>>>> WindowFn or any such. Of course, our way of automatically propagating
>>>>>> windows from inputs to outputs, akin to making MapValues the default mode
>>>>>> of computation, requires the window to be a distinguished secondary key.
>>>>>>
>>>>>> Another way I think about it is that the windowing + watermark +
>>>>>> allowed lateness defines which elements are a part of a PCollection and
>>>>>> which are not. Dropped data semantically never existed in the first 
>>>>>> place.
>>>>>> This was actually independent of windowing before the "window expiration"
>>>>>> model of dropping data. I still think window expiration + GC + dropping 
>>>>>> go
>>>>>> together nicely, and drop less data needlessly, but just dropping data
>>>>>> behind the watermark + allowed lateness has some appeal for isolating the
>>>>>> operational aspect here.
>>>>>>
>>>>>> Operationally, you might take the view that the act of expiration and
>>>>>> dropping all remaining data is a configuration on the GBK. Then the
>>>>>> WindowingStrategy, like windows and KV, are plumbing devices to reach a 
>>>>>> GBK
>>>>>> that may be deep in a composite (which is certainly true anyhow). I don't
>>>>>> really like this, because I would like the output of a GBK to be a
>>>>>> straightforward function of its input - in the unbounded case I would 
>>>>>> like
>>>>>> to be specified as having to agree with the bounded spec for any finite
>>>>>> prefix. I'm not sure if an operational view is amenable to this. If they
>>>>>> both work, then being able to switch perspectives back and forth would be
>>>>>> cool.
>>>>>>
>>>>>> I think there are some inconsistencies in the above intuitions, and
>>>>>> then there's merging...
>>>>>>
>>>>>> Kenn
>>>>>>
>>>>>>
>>>>>> Also, I think anyone reading this document really ought to at least
>>>>>>> skim the (linked from there) http://s.apache.org/beam-streams-tables and
>>>>>>> internalize the idea of "PCollections as changelogs, aggregations as 
>>>>>>> tables
>>>>>>> on which the changelog acts". It probably would be good to rewrite our
>>>>>>> documentation with this in mind: even with my experience on the Beam 
>>>>>>> team,
>>>>>>> this simple idea made it much easier for me to think clearly about all 
>>>>>>> the
>>>>>>> concepts.
>>>>>>>
>>>>>>> I'm very excited about both of these ideas, I think they rival in
>>>>>>> importance the idea of batch/streaming unification and will end up 
>>>>>>> being a
>>>>>>> fundamental part of the future of Beam model.
>>>>>>>
>>>>>>> On Thu, Nov 30, 2017 at 8:52 PM Jean-Baptiste Onofré <
>>>>>>> [email protected]> wrote:
>>>>>>>
>>>>>>>> Hi Kenn,
>>>>>>>>
>>>>>>>> very interesting idea. It sounds more usable and "logic".
>>>>>>>>
>>>>>>>> Regards
>>>>>>>> JB
>>>>>>>>
>>>>>>>> On 11/30/2017 09:06 PM, Kenneth Knowles wrote:
>>>>>>>> > Hi all,
>>>>>>>> >
>>>>>>>> > Triggers are one of the more novel aspects of Beam's support for
>>>>>>>> unbounded data.
>>>>>>>> > They are also one of the most challenging aspects of the model.
>>>>>>>> >
>>>>>>>> > Ben & I have been working on a major new idea for how triggers
>>>>>>>> could work in the
>>>>>>>> > Beam model. We think it will make triggers much more usable,
>>>>>>>> create new
>>>>>>>> > opportunities for no-knobs execution/optimization, and improve
>>>>>>>> compatibility
>>>>>>>> > with DSLs like SQL. (also eliminate a whole class of bugs)
>>>>>>>> >
>>>>>>>> > Triggering is for sinks!
>>>>>>>> >
>>>>>>>> > https://s.apache.org/beam-sink-triggers
>>>>>>>> >
>>>>>>>> > Please take a look at this "1"-pager and give feedback.
>>>>>>>> >
>>>>>>>> > Kenn
>>>>>>>>
>>>>>>>> --
>>>>>>>> Jean-Baptiste Onofré
>>>>>>>> [email protected]
>>>>>>>> http://blog.nanthrax.net
>>>>>>>> Talend - http://www.talend.com
>>>>>>>>
>>>>>>>
>>>>>>
>>>>>
>>>>
>>>
>>
>

Reply via email to