This is a real use case we have, but simplified: My user session look like this: user visits a page, and clicks three buttons: Orange then Green then Blue.
I need to compute the average time between Orange & Blue clicks but I need to window on the timestamp of the green button click. In requirements terms: Compute average time between Orange and Blue for all Green clicks that occur on Monday. (So User could click Orange on Sunday, Green on Monday and Blue on Tuesday.) One strategy is to try to use a single SessionWindow to capture the entire user session; then calculate the *span* (time between Orange and Blue clicks) and *then* compute average of all spans. To do this the *span*/counts would have to all "land" in a window representing Monday. If I use a SessionWindow w/ TimestampCombiner/EARLIEST then I can make sure they land in this window using .outputWithTimestamp without worrying that I'll be regressing the event timestamp. Except when I use this Combiner/EARLIEST strategy my watermark is held up substantially (and incidentally seems to drag the pipeline). But if I use Beam's default TimestampCombiner/END_OF_WINDOW then I won't be able to output the *span* result at a timestamp representing the Green click. So a single SessionWindow seems out. (Unless I'm missing something.) The only other strategy I can conceive of at the moment is to capture *two* sessions, representing each "leg" of the overall session. One windows on the [Orange,Green] (using END_OF_WINDOW); the other [Green,Blue] (using EARLIEST). Then I can "join" these two to get both legs together and compute the overall span. This seems like a quite complicated way to solve this (simple?) problem. Thoughts? What am I missing?
