Oh nice—that will be great—will look forward to this one! Any idea of
Dataflow will support?

On Sat, Jan 11, 2020 at 9:07 PM Reuven Lax <[email protected]> wrote:

> There is now (as of last week) a way to hold back the watermark with the
> state API (though not yet in a released version of Beam). If you set a
> timer using withOutputTimetstamp(t), the watermark will be held to t.
>
> On Sat, Jan 11, 2020 at 4:15 PM Aaron Dixon <[email protected]> wrote:
>
>> Hi Reuven thanks for your quick reply
>>
>>  I've tried that but the drag it puts on the watermark was too intrusive.
>> For example, -- even if just a single user among many decided to remain
>> logged-in for a few days then the watermark holds everything else back.
>>
>> This was when using a custom session window. I've recently been using the
>> State API to do my custom session tracking to avoid issues with downward
>> merging of windows (see earlier mailing list thread) ... with the State API
>> .. I'm not able to hold the watermark back (I think) ... but in any case, I
>> prefer the behavior where the watermark moves forward with the upstream
>> events and to deal with the very few straggler users by a lateness
>> configuration.
>>
>> Does that make sense? So far to me this seems very reasonable (to want to
>> keep the watermark moving and deal w/ the late events the few of which
>> actually fall out of the window using explicit lateness configuration.)
>>
>> On Sat, Jan 11, 2020 at 4:57 PM Reuven Lax <[email protected]> wrote:
>>
>>> Have you looked at using
>>> withTimestampCombiner(TimestampCombiner.EARLIEST)? This will hold the
>>> downstream watermark back to the beginning of the window (presumably the
>>> timestamp of the LOGIN event), so you can .call outputWithTimestamp using
>>> the CLICK GREEN timestamp without needing to set the allowed-lateness skew.
>>>
>>> Reuven
>>>
>>> On Sat, Jan 11, 2020 at 1:50 PM Aaron Dixon <[email protected]> wrote:
>>>
>>>> I've just built a pipeline in Beam and after exploring several options
>>>> for my use case, I've ended up relying on the deprecated
>>>> .outputWithTimestamp() + DoFn#getAllowedTimestampSkew in what seems to me a
>>>> quite valid use case. So I suppose this is a vote for un-deprecating this
>>>> API (or a teachable moment in which I could be pointed to a more suitable
>>>> non-deprecated approach.)
>>>>
>>>> I'll stick with a previously simplification of my use case:
>>>>
>>>> I get these events from my users:
>>>>     LOGIN
>>>>     CLICK GREEN BUTTON
>>>>     LOGOUT
>>>>
>>>> I capture user session duration (logout time *minus* login time) and I
>>>> want to perform a PER DAY average (i.e., my window is on CalendarDays) BUT
>>>> where the aggregation's timestamp is the time of the CLICK GREEN event.
>>>>
>>>> So once I calculate and emit a single user's session duration I need to
>>>> .outputWithTimestamp using the CLICK GREEN event's timestamp. This
>>>> involves, of course, outputting with a timestamp *before* the watermark.
>>>>
>>>> In most cases my users LOGOUT in the same day as the CLICK GREEN BUTTON
>>>> event, so even though I'm typically outputting a timestamp before the
>>>> watermark the CalendarDay window is not yet closed and so most user session
>>>> duration's do not affect a late aggregation for that CalendarDay.
>>>>
>>>> Only when a LOGOUT occurs on a day later than the CLICK GREEN event do
>>>> I have to contend with potentially late data contributing back to a prior
>>>> CalendarDay.
>>>>
>>>> In any case, I have .withAllowedLateness to allow me to make a call
>>>> here about what I'm willing tradeoff (keeping windows open vs. dropping
>>>> data for users with overly long sessions), etc.
>>>>
>>>> This here seems to be a simple scenario (it is effectively my
>>>> real-world scenario) and the
>>>> .outputWithTimestamp + DoFn#getAllowedTimestampSkew seem to cover it in a
>>>> straightforward, effective way.
>>>>
>>>> However of course I don't like building production code on deprecated
>>>> capabilities -- so advice on alternatives (or perhaps a reconsideration of
>>>> this deprecation :) ) would be appreciated.
>>>>
>>>>

Reply via email to