You've done a very good analysis* and I think your solution is pretty
clever. The simple fact is this: the watermark has to be held to the
minimum of any output you intend to produce. So for your use case, the hold
has to be the timestamp of the Green element. Your solution does hold the
watermark to the right time. I have a couple thoughts that may be helpful.

0. If you partition by user does the stream contain a bunch of Orange,
Green, Blue elements? Is it possible that a session contains multiple
[Orange, Green, Blue] sequences? Is it possible that an [Orange, Green,
Blue] sequence is split across multiple sessions?

1. In your proposed solution, it probably could be expressed as a new
merging WindowFn. You would assign each Green element to two tagged windows
that were GreenFromOrange and GreenToBlue type, and have a separate window
tag for OrangeWindow and BlueWindow. Then GreenFromOrange merges with
OrangeWindow only, etc.

2. This might also turn out simply as a stateful DoFn, where you manually
manage what state the funnel is in. When you set a timer to wait for the
Orange element, you may need an upcoming feature where you set a timer for
a future event time but the watermark is held to the Green element's
timestamp. CC Reuven on that use case.

What I would like to avoid is you having to do two shuffles (on whatever
runner). This should be doable with one.

*SessionWindow plus EARLIEST holding up the watermark/pipeline was an early
complaint. That is part of why we switched the default to end-of-window
(also it is easier to understand and more efficient to compute)

Kenn

On Wed, Nov 13, 2019 at 3:25 PM Aaron Dixon <[email protected]> wrote:

> This is a real use case we have, but simplified:
>
> My user session look like this: user visits a page, and clicks three
> buttons: Orange then Green then Blue.
>
> I need to compute the average time between Orange & Blue clicks but I need
> to window on the timestamp of the green button click.
>
> In requirements terms: Compute average time between Orange and Blue for
> all Green clicks that occur on Monday. (So User could click Orange on
> Sunday, Green on Monday and Blue on Tuesday.)
>
> One strategy is to try to use a single SessionWindow to capture the entire
> user session; then calculate the *span* (time between Orange and Blue
> clicks) and *then* compute average of all spans.
>
> To do this the *span*/counts would have to all "land" in a window
> representing Monday.
>
> If I use a SessionWindow w/ TimestampCombiner/EARLIEST then I can make
> sure they land in this window using .outputWithTimestamp without worrying
> that I'll be regressing the event timestamp.
>
> Except when I use this Combiner/EARLIEST strategy my watermark is held up
> substantially (and incidentally seems to drag the pipeline).
>
> But if I use Beam's default TimestampCombiner/END_OF_WINDOW then I won't
> be able to output the *span* result at a timestamp representing the Green
> click.
>
> So a single SessionWindow seems out. (Unless I'm missing something.)
>
> The only other strategy I can conceive of at the moment is to capture
> *two* sessions, representing each "leg" of the overall session. One
> windows on the [Orange,Green] (using END_OF_WINDOW); the other [Green,Blue]
> (using EARLIEST). Then I can "join" these two to get both legs together and
> compute the overall span. This seems like a quite complicated way to solve
> this (simple?) problem.
>
> Thoughts? What am I missing?
>

Reply via email to