Re: Timer.withOutputTimestamp().offset().setRelative seems unusable with event time

Reuven Lax Wed, 05 May 2021 11:12:32 -0700

Timer ordering is I believe part of the Beam model. Cleanup timers are not
though - that's an implementation detail (and an optimization, since the
programming model works fine if nothing is ever cleaned up).


On Wed, May 5, 2021 at 11:02 AM Kenneth Knowles <k...@apache.org> wrote:

>
>
> On Wed, May 5, 2021 at 9:55 AM Reuven Lax <re...@google.com> wrote:
>
>>
>>
>> On Wed, May 5, 2021 at 9:36 AM Jan Lukavský <je...@seznam.cz> wrote:
>>
>>> Yes, what I meant was the distinguishing of cleanup timers from plain
>>> user other timers - that seems to be due to the fact that they fire base on
>>> different watermark (input/output).
>>>
>> They don't though, because AFAIK no runner has the capability to fire
>> timers based on output watermarks. Beam relies on timer ordering to fire
>> cleanup timers, which has a somewhat similar effect - by the time the
>> cleanup timer fires, all other timers are guaranteed to have fired.
>>
>
> Again, cleanup timers are not part of the Beam model. Beam doesn't rely on
> ordering of timers. Beam doesn't even rely on the _existence_ of cleanup
> timers. Beam doesn't even rely on the existence of _cleanup_.
>
> Kenn
>
>
>>
>>
>>> And firing timers based on output watermark might be actually a good
>>> user-facing feature, because that might help tracking output watermark in
>>> transforms that want to deal with potentially droppable data downstream
>>> (the input would have to be re-windowed to global window, of course). I
>>> don't know if there are other use-cases, if not maybe it might be
>>> sufficient to create a DroppableDataSplit transform, that would create a
>>> PCollectionTuple with droppable and other data. But that was just an idea
>>> when Kenn mentioned that the cleanup timers "fire differently" - I
>>> generally think that when there is a need for a different behavior, than it
>>> might signal there is something possibly fundamental.
>>>
>>
>> We provide OnWindowExpiration, which is basically this. It fires after
>> all other windows and timers, but just before the cleanup happens (in fact
>> we currently piggy back on the cleanup timer to implement
>> OnWindowExpiration).
>>
>>>  Jan
>>> On 5/5/21 5:11 PM, Reuven Lax wrote:
>>>
>>> This is one way to put it. I think in practice Beam guarantees that
>>> timers fire in order for a given key (though there is still a bit of a bug
>>> around looping timers - the fix for that got rolled back). This means that
>>> as long as the runner sets the cleanup timer to be 1ms passed the end of
>>> the window (plus allowed lateness), it's guaranteed to be the last timer
>>> that fires for that window.
>>>
>>>
>>>
>>> On Wed, May 5, 2021 at 2:41 AM Jan Lukavský <je...@seznam.cz> wrote:
>>>
>>>> Hm, one thing in the last paragraph seems there might be some logical
>>>> gap.
>>>>
>>>> > For a runner author to implement a "cleanup timer" requires a
>>>> different mechanism. A window expires when *both* the input element
>>>> watermark *and* the timer watermark are past the expiry time. In other
>>>> words, the cleanup timer fires according to the minimum of these
>>>> watermarks, combined. It *cannot* fire according to the input element
>>>> watermark. If you naively try to implement it as a user timer, it will be
>>>> incorrect. Incidentally this is why @OnWindowExpiration is a meaningful
>>>> feature.
>>>> The description describes a timer that fires not according to input
>>>> watermark, but according to the output watermark (once the output watermark
>>>> reaches certain point in time). That logically implies, that such a timer
>>>> cannot have non-droppable output (at least if its output timestamp belongs
>>>> to the respective window) and cannot create a watermark hold (because that
>>>> would block the progress of the output watermark and might cause the timer
>>>> to not fire ever). This maybe might be a useful user-feature as well,
>>>> probably again mostly related to how user-code might want to deal with
>>>> droppable data.
>>>>
>>>>  Jan
>>>>
>>>> On 5/4/21 6:41 PM, Kenneth Knowles wrote:
>>>>
>>>> Mean to also add +Reuven Lax <re...@google.com>
>>>>
>>>> On Tue, May 4, 2021 at 9:41 AM Kenneth Knowles <k...@apache.org> wrote:
>>>>
>>>>> Explicitly pinging a couple folks who were involved in the original
>>>>> change which yours essentially reverts. There's a model question here that
>>>>> I want to clarify on-list:
>>>>>
>>>>> When you have a ParDo setting timers, you have an additional watermark
>>>>> that must be considered:
>>>>>
>>>>>  - input element watermark
>>>>>  - output watermark
>>>>>  - *(user) timer watermark*
>>>>>
>>>>> The timer watermark is an input to the ParDo. Sometimes you might
>>>>> think of the "timer channel" as a self loop, where each timer is an
>>>>> element. Each timer has a timestamp (the output timestamp) and separately
>>>>> some instructions on when to deliver that timer. This is the same as the
>>>>> usual difference between event time and processing time.
>>>>>
>>>>> The instruction on when to deliver a timer can have two forms:
>>>>>
>>>>>  - wait a certain amount of processing time
>>>>>  - deliver the timer when the *input element watermark* reaches a time
>>>>> X
>>>>>
>>>>> Here is an important point: "cleanup timers" are *not* user timers.
>>>>> They are an implementation detail. They are not part of the model. The
>>>>> runner's job is to reclaim resources as windows expire. A user should 
>>>>> never
>>>>> be reasoning about how their timers relate to cleanup timers (except for
>>>>> resource consumption). Because there is no relationship except that the
>>>>> cleanup should happen "eventually" and invisibly.
>>>>>
>>>>> For a runner author to implement a "cleanup timer" requires a
>>>>> different mechanism. A window expires when *both* the input element
>>>>> watermark *and* the timer watermark are past the expiry time. In other
>>>>> words, the cleanup timer fires according to the minimum of these
>>>>> watermarks, combined. It *cannot* fire according to the input element
>>>>> watermark. If you naively try to implement it as a user timer, it will be
>>>>> incorrect. Incidentally this is why @OnWindowExpiration is a meaningful
>>>>> feature.
>>>>>
>>>>> Kenn
>>>>>
>>>>> On Tue, May 4, 2021 at 4:29 AM Jan Lukavský <je...@seznam.cz> wrote:
>>>>>
>>>>>> Hi Kenn,
>>>>>>
>>>>>> I created BEAM-12276 [1] with PR [2].
>>>>>>
>>>>>>  Jan
>>>>>>
>>>>>> [1] https://issues.apache.org/jira/browse/BEAM-12276
>>>>>>
>>>>>> [2] https://github.com/apache/beam/pull/14718
>>>>>> On 5/3/21 7:46 PM, Kenneth Knowles wrote:
>>>>>>
>>>>>> This seems like just a bug. If you set a timer for X and have output
>>>>>> timestamp Y where X < Y this should be fine. Is the problem the current
>>>>>> input watermark? Are you trying to set a timer with output timestamp that
>>>>>> is already past? I think that should be allowed, too, as long as the 
>>>>>> window
>>>>>> is not expired, but I may be missing something.
>>>>>>
>>>>>> Some greater detail would be useful - maybe the full stack trace
>>>>>> and/or a failing unit test in a PR?
>>>>>>
>>>>>> Kenn
>>>>>>
>>>>>> On Thu, Apr 29, 2021 at 12:51 AM Jan Lukavský <je...@seznam.cz>
>>>>>> wrote:
>>>>>>
>>>>>>> Hi,
>>>>>>>
>>>>>>> I have come across a bug with timer output timestamp - when using
>>>>>>> event
>>>>>>> time and relative timers, setting the timer can arbitrarily throw
>>>>>>> IllegalArgumentException if the firing timestamp (input watermark)
>>>>>>> is
>>>>>>> ahead of the output timestamp (like
>>>>>>> .java.lang.IllegalArgumentException:
>>>>>>> Attempted to set an event-time timer with an output timestamp of
>>>>>>> 2021-04-29T07:16:19.369Z that is after the timer firing timestamp
>>>>>>> -290308-12-21T19:59:05.225Z). But there is no way to access the
>>>>>>> firing
>>>>>>> timestamp from user code. This means that the use has to either
>>>>>>> catch
>>>>>>> the IllegalArgumentException, or not use this construct at all.
>>>>>>>
>>>>>>> Catching the exception should probably not be part of a contract, so
>>>>>>> we
>>>>>>> should do one of the following:
>>>>>>>
>>>>>>>   a) either throw the exception right away and disable using
>>>>>>> relative
>>>>>>> timers with output timestamp completely, or
>>>>>>>
>>>>>>>   b) support it correctly
>>>>>>>
>>>>>>> What is the actual reason not to support output timestamps, that are
>>>>>>> ahead of firing timestamp? From my understanding, that should not be
>>>>>>> an
>>>>>>> issue, because there is TimestampCombiner.EARLIEST on the
>>>>>>> watermarkholdstate that corresponds to the output timestamp. If that
>>>>>>> is
>>>>>>> correct can we simply remove the check?
>>>>>>>
>>>>>>>   Jan
>>>>>>>
>>>>>>>

Re: Timer.withOutputTimestamp().offset().setRelative seems unusable with event time

Reply via email to