Re: [VOTE] @RequiresTimeSortedInput stateful DoFn annotation

Jan Lukavský Sat, 09 Nov 2019 02:46:10 -0800

Hi,

I'll try to summarize the mailing list threads to clarify why I thinkthis addition is needed (and actually necessary):

a) there are situations where the order of input events matter(obviously any finite state machine)

b) in streaming case, this can be handled by the current machinery(e.g. holding elements in state, sorting all elements with timestampless than input watermark, dropping latecomers)


 c) in batch case, this can be handled the same way, but

i) due to the nature of batch processing, that has extremerequirements on the size of state needed to hold the elements (actually,in extreme, that might be the whole input, which might not be feasible)

ii) although it is true, that watermark might (and will) fall behindin streaming processing as well so that similar issues might arise theretoo, it is hardly imaginable that it will fall behind as much as severalyears (but it is absolutely natural in batch case) - I'm talking aboutregular streaming processing, not some kappa like architectures, wherethis happens as well, but is causes troubles ([1])

iii) given the fact, that some runners already use sort-mergegroupings, it is actually virtually for free to also sort elementsinside groups by timestamps, the runner just has to know, that it shoulddo so

I don't want to go too far into details to keep this focused, but thefact that runner would know that it should sort by timestamp beforestateful pardo brings additional features that are currently unavailable- e.g. actually shift event time smoothly, as elements flow through, notfrom -inf to +inf in one shot. That might have positive effect on timersbeing fired smoothly and thus for instance being able to free some statethat would have to be held until the end of computation otherwise.

Therefore, I think it is essential for users to be able to tell runnerthat a particular stateful pardo depends on order of input events, sothat the runner can use optimizations available in batch case. Thestreaming case is mostly unaffected by that, because all the sorting canbe handled the usual way.

Hope this helps to clarify why it would be good to introduce (some way)to mark stateful pardos as "time sorted".


Cheers,

 Jan

[1]https://www.ververica.com/resources/flink-forward-san-francisco-2019/moving-from-lambda-and-kappa-architectures-to-kappa-at-uber


Hope these thoughts help

On 11/8/19 11:35 AM, Jan Lukavský wrote:

Hi Max,
thanks for comment. I probably should have put links to discussionthreads here in the vote thread. Relevant would be
- (a pretty lengthy) discussion about whether sorting by timestampshould be part of the model - [1]
 - part of the discussion related to the annotation - [2]
Regarding the open question in the design document - these are notmeant to be open questions in regard to the design of the annotationand I'll remove that for now, as it is not (directly) related.
Now - main reason for this vote is that there is actually not a clearconsensus in the ML thread. There are plenty of words like "should","could", "would" and "maybe", so I wanted to be sure there isconsensus to include this. I already run this in production forseveral months, so it is definitely useful for me. :-) But that mightnot be sufficient.
I'd be very happy to answer any more questions.

Thanks,

 Jan
[1]https://lists.apache.org/thread.html/4609a1bb1662690d67950e76d2f1108b51327b8feaf9580de659552e@%3Cdev.beam.apache.org%3E
[2]https://lists.apache.org/thread.html/dd9bec903102d9fcb4f390dc01513c0921eac1fedd8bcfdac630aaee@%3Cdev.beam.apache.org%3E
On 11/8/19 11:08 AM, Maximilian Michels wrote:
Hi Jan,
Disclaimer: I haven't followed the discussion closely, so I do notwant to comment on the technical details of the feature here.
From the outside, it looks like there may be open questions. Also, wemay need more motivation for what we can build with this feature orhow it will become useful to users.
There are many threads in Beam and I believe we need to carefullyprioritize the Beam feature set in order to focus on the things thatprovide the most value to our users.
Cheers,
Max

On 07.11.19 15:55, Jan Lukavský wrote:
Hi,
is there anything I can do to make this more attractive? :-) Anyfeedback would be much appreciated.
Many thanks,
  Jan

Dne 5. 11. 2019 14:10 napsal uživatel Jan Lukavský <[email protected]>:

    Hi,
I'd like to open a vote on accepting design document [1] as abase for
    implementation of @RequiresTimeSortedInput annotation for stateful
DoFns. Associated JIRA [2] and PR [3] contains only subset ofthe whole functionality (allowed lateness ignored and no possibility tospecify
    UDF for time - or sequential number - to be extracted from data).
    The PR
    will be subject to independent review process (please feel free to
self-request review if you are interested in this) after thevote would eventually succeed. Missing features from the design documentwill be
    added later in subsequent JIRA issues, so that it doesn't block
    availability of this feature.

    Please vote on adding support for @RequiresTimeSortedInput.
The vote is open for the next 72 hours and passes if at leastthree +1
    and no -1 PMC (binding) votes are cast.

    [ ] +1 Add support for @RequiresTimeSortedInput
[ ] 0 I don't have a strong opinion about this, but I assumeit's ok
    [ ] -1 Do not support @RequiresTimeSortedInput - please provide
    explanation.

    Thanks,

      Jan

    [1]
https://docs.google.com/document/d/1ObLVUFsf1NcG8ZuIZE4aVy2RYKx2FfyMhkZYWPnI9-c/edit?usp=sharing
    [2] https://issues.apache.org/jira/browse/BEAM-8550

    [3] https://github.com/apache/beam/pull/8774

Re: [VOTE] @RequiresTimeSortedInput stateful DoFn annotation

Reply via email to