Reuven thanks for your insights so far. Just wanted to press a little more on the deprecation question as I'm still (so far) convinced that my use case is quite a straightforward justification (I'm looking for confirmation or correction to my thinking here.) I've simplified my use case a bit if it helps things:
Use case: "For users that login on a given calendar day, what is the average login time?" So I have two event types LOGIN and LOGOUT. I capture a user login session (using custom windowing or state api, doesn't matter) and I use the default TimestampCombiner/END_OF_WINDOW because I want my aggregations to not be delayed. However per my use case requirements I must window using the LOGIN time. So I use outputWithTimestamp plus skew configuration to this end. Since most of my users login and logout within the same calendar day, I get may per-day aggregations right on time in real-time. Only for the few users that logout after the day that they login will I see actual late aggregations produced in which case I can leverage Beam's various lateness configuration levers to trade completeness for storage, etc. This to me seems a *very* straightforward justification for my use of DoFn#getAllowedTimestampSkew. Should this justify not deprecating that facility. I realize there are other various solutions, now and coming soon, that involve holding the watermark -- but any solution that requires holding the watermark means that I have to give up getting on-time aggregations at the very end of the calendar day (window). I would much rather (and reasonably so?) get on-time aggregations covering the majority of my users and be happy to refine these averages when my few latent users logout in a later day. In some Beam documentation [1] there is the idea of "unobservably late data". That is, I have specific elements that are output late (behind the watermark) but because they are guaranteed to land *within the window* and they are therefore promoted to be on-time. This conceptualization of things seems very well-suited to my simple use case but definitely open to a different way of thinking in my approach. My main concern is that my pipeline will be leveraging a Deprecated facility (DoFn#getAllowedTimestampSkew) but I don't see other viable options (within Beam) yet. (Hope I'm not pressing too hard on this question here. I think this use case is interesting because it ...seems... to be a rather simple/distilled justification for being able to output data behind the watermark mid-stream.) [1] https://beam.apache.org/blog/2016/10/20/test-stream.html On Sat, Jan 11, 2020 at 10:10 PM Aaron Dixon <atdi...@gmail.com> wrote: > Oh nice—that will be great—will look forward to this one! Any idea of > Dataflow will support? > > On Sat, Jan 11, 2020 at 9:07 PM Reuven Lax <re...@google.com> wrote: > >> There is now (as of last week) a way to hold back the watermark with the >> state API (though not yet in a released version of Beam). If you set a >> timer using withOutputTimetstamp(t), the watermark will be held to t. >> >> On Sat, Jan 11, 2020 at 4:15 PM Aaron Dixon <atdi...@gmail.com> wrote: >> >>> Hi Reuven thanks for your quick reply >>> >>> I've tried that but the drag it puts on the watermark was too >>> intrusive. For example, -- even if just a single user among many decided to >>> remain logged-in for a few days then the watermark holds everything else >>> back. >>> >>> This was when using a custom session window. I've recently been using >>> the State API to do my custom session tracking to avoid issues with >>> downward merging of windows (see earlier mailing list thread) ... with the >>> State API .. I'm not able to hold the watermark back (I think) ... but in >>> any case, I prefer the behavior where the watermark moves forward with the >>> upstream events and to deal with the very few straggler users by a lateness >>> configuration. >>> >>> Does that make sense? So far to me this seems very reasonable (to want >>> to keep the watermark moving and deal w/ the late events the few of which >>> actually fall out of the window using explicit lateness configuration.) >>> >>> On Sat, Jan 11, 2020 at 4:57 PM Reuven Lax <re...@google.com> wrote: >>> >>>> Have you looked at using >>>> withTimestampCombiner(TimestampCombiner.EARLIEST)? This will hold the >>>> downstream watermark back to the beginning of the window (presumably the >>>> timestamp of the LOGIN event), so you can .call outputWithTimestamp using >>>> the CLICK GREEN timestamp without needing to set the allowed-lateness skew. >>>> >>>> Reuven >>>> >>>> On Sat, Jan 11, 2020 at 1:50 PM Aaron Dixon <atdi...@gmail.com> wrote: >>>> >>>>> I've just built a pipeline in Beam and after exploring several options >>>>> for my use case, I've ended up relying on the deprecated >>>>> .outputWithTimestamp() + DoFn#getAllowedTimestampSkew in what seems to me >>>>> a >>>>> quite valid use case. So I suppose this is a vote for un-deprecating this >>>>> API (or a teachable moment in which I could be pointed to a more suitable >>>>> non-deprecated approach.) >>>>> >>>>> I'll stick with a previously simplification of my use case: >>>>> >>>>> I get these events from my users: >>>>> LOGIN >>>>> CLICK GREEN BUTTON >>>>> LOGOUT >>>>> >>>>> I capture user session duration (logout time *minus* login time) and I >>>>> want to perform a PER DAY average (i.e., my window is on CalendarDays) BUT >>>>> where the aggregation's timestamp is the time of the CLICK GREEN event. >>>>> >>>>> So once I calculate and emit a single user's session duration I need >>>>> to .outputWithTimestamp using the CLICK GREEN event's timestamp. This >>>>> involves, of course, outputting with a timestamp *before* the watermark. >>>>> >>>>> In most cases my users LOGOUT in the same day as the CLICK GREEN >>>>> BUTTON event, so even though I'm typically outputting a timestamp before >>>>> the watermark the CalendarDay window is not yet closed and so most user >>>>> session duration's do not affect a late aggregation for that CalendarDay. >>>>> >>>>> Only when a LOGOUT occurs on a day later than the CLICK GREEN event do >>>>> I have to contend with potentially late data contributing back to a prior >>>>> CalendarDay. >>>>> >>>>> In any case, I have .withAllowedLateness to allow me to make a call >>>>> here about what I'm willing tradeoff (keeping windows open vs. dropping >>>>> data for users with overly long sessions), etc. >>>>> >>>>> This here seems to be a simple scenario (it is effectively my >>>>> real-world scenario) and the >>>>> .outputWithTimestamp + DoFn#getAllowedTimestampSkew seem to cover it in a >>>>> straightforward, effective way. >>>>> >>>>> However of course I don't like building production code on deprecated >>>>> capabilities -- so advice on alternatives (or perhaps a reconsideration of >>>>> this deprecation :) ) would be appreciated. >>>>> >>>>>