This is a great help. Thank you. I like the custom window solution
pattern as a way to hold the watermark and merge down to keep the watermark
where it is needed. Perhaps there is some interesting generalized session
window here.. I'll have to digest the stateful DoFn approach. Avoiding
unnecessary shuffles is a good note.

As a side note, there is MIN, MAX and END_OF_WINDOW TimestampCombiner. Has
it been discussed to ever allow more customization here? Seems like
customizing the combiner with element-awareness would have solved this
problem, as well.


On Wed, Nov 13, 2019 at 7:56 PM Kenneth Knowles <[email protected]> wrote:

> You've done a very good analysis* and I think your solution is pretty
> clever. The simple fact is this: the watermark has to be held to the
> minimum of any output you intend to produce. So for your use case, the hold
> has to be the timestamp of the Green element. Your solution does hold the
> watermark to the right time. I have a couple thoughts that may be helpful.
>
> 0. If you partition by user does the stream contain a bunch of Orange,
> Green, Blue elements? Is it possible that a session contains multiple
> [Orange, Green, Blue] sequences? Is it possible that an [Orange, Green,
> Blue] sequence is split across multiple sessions?
>
> 1. In your proposed solution, it probably could be expressed as a new
> merging WindowFn. You would assign each Green element to two tagged windows
> that were GreenFromOrange and GreenToBlue type, and have a separate window
> tag for OrangeWindow and BlueWindow. Then GreenFromOrange merges with
> OrangeWindow only, etc.
>
> 2. This might also turn out simply as a stateful DoFn, where you manually
> manage what state the funnel is in. When you set a timer to wait for the
> Orange element, you may need an upcoming feature where you set a timer for
> a future event time but the watermark is held to the Green element's
> timestamp. CC Reuven on that use case.
>
> What I would like to avoid is you having to do two shuffles (on whatever
> runner). This should be doable with one.
>
> *SessionWindow plus EARLIEST holding up the watermark/pipeline was an
> early complaint. That is part of why we switched the default to
> end-of-window (also it is easier to understand and more efficient to
> compute)
>
> Kenn
>
> On Wed, Nov 13, 2019 at 3:25 PM Aaron Dixon <[email protected]> wrote:
>
>> This is a real use case we have, but simplified:
>>
>> My user session look like this: user visits a page, and clicks three
>> buttons: Orange then Green then Blue.
>>
>> I need to compute the average time between Orange & Blue clicks but I
>> need to window on the timestamp of the green button click.
>>
>> In requirements terms: Compute average time between Orange and Blue for
>> all Green clicks that occur on Monday. (So User could click Orange on
>> Sunday, Green on Monday and Blue on Tuesday.)
>>
>> One strategy is to try to use a single SessionWindow to capture the
>> entire user session; then calculate the *span* (time between Orange and
>> Blue clicks) and *then* compute average of all spans.
>>
>> To do this the *span*/counts would have to all "land" in a window
>> representing Monday.
>>
>> If I use a SessionWindow w/ TimestampCombiner/EARLIEST then I can make
>> sure they land in this window using .outputWithTimestamp without worrying
>> that I'll be regressing the event timestamp.
>>
>> Except when I use this Combiner/EARLIEST strategy my watermark is held up
>> substantially (and incidentally seems to drag the pipeline).
>>
>> But if I use Beam's default TimestampCombiner/END_OF_WINDOW then I won't
>> be able to output the *span* result at a timestamp representing the
>> Green click.
>>
>> So a single SessionWindow seems out. (Unless I'm missing something.)
>>
>> The only other strategy I can conceive of at the moment is to capture
>> *two* sessions, representing each "leg" of the overall session. One
>> windows on the [Orange,Green] (using END_OF_WINDOW); the other [Green,Blue]
>> (using EARLIEST). Then I can "join" these two to get both legs together and
>> compute the overall span. This seems like a quite complicated way to solve
>> this (simple?) problem.
>>
>> Thoughts? What am I missing?
>>
>

Reply via email to