On Wed, Nov 13, 2019 at 7:39 PM Aaron Dixon <[email protected]> wrote:
> This is a great help. Thank you. I like the custom window solution > pattern as a way to hold the watermark and merge down to keep the watermark > where it is needed. Perhaps there is some interesting generalized session > window here.. I'll have to digest the stateful DoFn approach. Avoiding > unnecessary shuffles is a good note. > > As a side note, there is MIN, MAX and END_OF_WINDOW TimestampCombiner. Has > it been discussed to ever allow more customization here? Seems like > customizing the combiner with element-awareness would have solved this > problem, as well. > Good question :-). In fact, the original design was OutputTimeFn that was a custom user-defined function. - Discussed a bit here: https://docs.google.com/document/d/12r7frmxNickxB5tbpuEh_n35_IJeVZn1peOrBrhhP6Y/edit#heading=h.45gyyckqhg4c - Removed here: https://github.com/apache/beam/pull/2683 Users found the name and choices confusing. Also, the runner needs to grok the behavior of it in order to implement many optimizations. Before portability there were a bunch of instanceof checks and anything not on the happy path was slower. With portability, it needs to be represented in protobuf. Reinstating this would mean adding a new CUSTOM enum to TimestampCombiner and it would have performance costs. Kenn > > > On Wed, Nov 13, 2019 at 7:56 PM Kenneth Knowles <[email protected]> wrote: > >> You've done a very good analysis* and I think your solution is pretty >> clever. The simple fact is this: the watermark has to be held to the >> minimum of any output you intend to produce. So for your use case, the hold >> has to be the timestamp of the Green element. Your solution does hold the >> watermark to the right time. I have a couple thoughts that may be helpful. >> >> 0. If you partition by user does the stream contain a bunch of Orange, >> Green, Blue elements? Is it possible that a session contains multiple >> [Orange, Green, Blue] sequences? Is it possible that an [Orange, Green, >> Blue] sequence is split across multiple sessions? >> >> 1. In your proposed solution, it probably could be expressed as a new >> merging WindowFn. You would assign each Green element to two tagged windows >> that were GreenFromOrange and GreenToBlue type, and have a separate window >> tag for OrangeWindow and BlueWindow. Then GreenFromOrange merges with >> OrangeWindow only, etc. >> >> 2. This might also turn out simply as a stateful DoFn, where you manually >> manage what state the funnel is in. When you set a timer to wait for the >> Orange element, you may need an upcoming feature where you set a timer for >> a future event time but the watermark is held to the Green element's >> timestamp. CC Reuven on that use case. >> >> What I would like to avoid is you having to do two shuffles (on whatever >> runner). This should be doable with one. >> >> *SessionWindow plus EARLIEST holding up the watermark/pipeline was an >> early complaint. That is part of why we switched the default to >> end-of-window (also it is easier to understand and more efficient to >> compute) >> >> Kenn >> >> On Wed, Nov 13, 2019 at 3:25 PM Aaron Dixon <[email protected]> wrote: >> >>> This is a real use case we have, but simplified: >>> >>> My user session look like this: user visits a page, and clicks three >>> buttons: Orange then Green then Blue. >>> >>> I need to compute the average time between Orange & Blue clicks but I >>> need to window on the timestamp of the green button click. >>> >>> In requirements terms: Compute average time between Orange and Blue for >>> all Green clicks that occur on Monday. (So User could click Orange on >>> Sunday, Green on Monday and Blue on Tuesday.) >>> >>> One strategy is to try to use a single SessionWindow to capture the >>> entire user session; then calculate the *span* (time between Orange and >>> Blue clicks) and *then* compute average of all spans. >>> >>> To do this the *span*/counts would have to all "land" in a window >>> representing Monday. >>> >>> If I use a SessionWindow w/ TimestampCombiner/EARLIEST then I can make >>> sure they land in this window using .outputWithTimestamp without worrying >>> that I'll be regressing the event timestamp. >>> >>> Except when I use this Combiner/EARLIEST strategy my watermark is held >>> up substantially (and incidentally seems to drag the pipeline). >>> >>> But if I use Beam's default TimestampCombiner/END_OF_WINDOW then I won't >>> be able to output the *span* result at a timestamp representing the >>> Green click. >>> >>> So a single SessionWindow seems out. (Unless I'm missing something.) >>> >>> The only other strategy I can conceive of at the moment is to capture >>> *two* sessions, representing each "leg" of the overall session. One >>> windows on the [Orange,Green] (using END_OF_WINDOW); the other [Green,Blue] >>> (using EARLIEST). Then I can "join" these two to get both legs together and >>> compute the overall span. This seems like a quite complicated way to solve >>> this (simple?) problem. >>> >>> Thoughts? What am I missing? >>> >>
