Re: [CANCELLED] [VOTE] @RequiresTimeSortedInput stateful DoFn annotation

Kenneth Knowles Thu, 14 Nov 2019 01:17:15 -0800

Hi Jan,

I want to acknowledge your careful consideration of the community here.


I myself have simply not had the time to dedicate to considering this
proposal. So, like Max, I would have a bit of an "outside" perspective so
would hesitate to cast any sort of vote.

I think you have chosen a good course of action - continue to grow
awareness and understanding of the problem / awareness and understanding of
the solution.

Kenn

On Tue, Nov 12, 2019 at 12:45 AM Jan Lukavský <je...@seznam.cz> wrote:

> I'm cancelling this due to lack of activity. I will issue a follow-up
> thread to find solution.
>
> On 11/9/19 11:45 AM, Jan Lukavský wrote:
> > Hi,
> >
> > I'll try to summarize the mailing list threads to clarify why I think
> > this addition is needed (and actually necessary):
> >
> >  a) there are situations where the order of input events matter
> > (obviously any finite state machine)
> >
> >  b) in streaming case, this can be handled by the current machinery
> > (e.g. holding elements in state, sorting all elements with timestamp
> > less than input watermark, dropping latecomers)
> >
> >  c) in batch case, this can be handled the same way, but
> >
> >   i) due to the nature of batch processing, that has extreme
> > requirements on the size of state needed to hold the elements
> > (actually, in extreme, that might be the whole input, which might not
> > be feasible)
> >
> >   ii) although it is true, that watermark might (and will) fall behind
> > in streaming processing as well so that similar issues might arise
> > there too, it is hardly imaginable that it will fall behind as much as
> > several years (but it is absolutely natural in batch case) - I'm
> > talking about regular streaming processing, not some kappa like
> > architectures, where this happens as well, but is causes troubles ([1])
> >
> >   iii) given the fact, that some runners already use sort-merge
> > groupings, it is actually virtually for free to also sort elements
> > inside groups by timestamps, the runner just has to know, that it
> > should do so
> >
> > I don't want to go too far into details to keep this focused, but the
> > fact that runner would know that it should sort by timestamp before
> > stateful pardo brings additional features that are currently
> > unavailable - e.g. actually shift event time smoothly, as elements
> > flow through, not from -inf to +inf in one shot. That might have
> > positive effect on timers being fired smoothly and thus for instance
> > being able to free some state that would have to be held until the end
> > of computation otherwise.
> >
> > Therefore, I think it is essential for users to be able to tell runner
> > that a particular stateful pardo depends on order of input events, so
> > that the runner can use optimizations available in batch case. The
> > streaming case is mostly unaffected by that, because all the sorting
> > can be handled the usual way.
> >
> > Hope this helps to clarify why it would be good to introduce (some
> > way) to mark stateful pardos as "time sorted".
> >
> > Cheers,
> >
> >  Jan
> >
> > [1]
> >
> https://www.ververica.com/resources/flink-forward-san-francisco-2019/moving-from-lambda-and-kappa-architectures-to-kappa-at-uber
> >
> > Hope these thoughts help
> >
> > On 11/8/19 11:35 AM, Jan Lukavský wrote:
> >> Hi Max,
> >>
> >> thanks for comment. I probably should have put links to discussion
> >> threads here in the vote thread. Relevant would be
> >>
> >>  - (a pretty lengthy) discussion about whether sorting by timestamp
> >> should be part of the model - [1]
> >>
> >>  - part of the discussion related to the annotation - [2]
> >>
> >> Regarding the open question in the design document - these are not
> >> meant to be open questions in regard to the design of the annotation
> >> and I'll remove that for now, as it is not (directly) related.
> >>
> >> Now - main reason for this vote is that there is actually not a clear
> >> consensus in the ML thread. There are plenty of words like "should",
> >> "could", "would" and "maybe", so I wanted to be sure there is
> >> consensus to include this. I already run this in production for
> >> several months, so it is definitely useful for me. :-) But that might
> >> not be sufficient.
> >>
> >> I'd be very happy to answer any more questions.
> >>
> >> Thanks,
> >>
> >>  Jan
> >>
> >> [1]
> >>
> https://lists.apache.org/thread.html/4609a1bb1662690d67950e76d2f1108b51327b8feaf9580de659552e@%3Cdev.beam.apache.org%3E
> >>
> >> [2]
> >>
> https://lists.apache.org/thread.html/dd9bec903102d9fcb4f390dc01513c0921eac1fedd8bcfdac630aaee@%3Cdev.beam.apache.org%3E
> >>
> >> On 11/8/19 11:08 AM, Maximilian Michels wrote:
> >>> Hi Jan,
> >>>
> >>> Disclaimer: I haven't followed the discussion closely, so I do not
> >>> want to comment on the technical details of the feature here.
> >>>
> >>> From the outside, it looks like there may be open questions. Also,
> >>> we may need more motivation for what we can build with this feature
> >>> or how it will become useful to users.
> >>>
> >>> There are many threads in Beam and I believe we need to carefully
> >>> prioritize the Beam feature set in order to focus on the things that
> >>> provide the most value to our users.
> >>>
> >>> Cheers,
> >>> Max
> >>>
> >>> On 07.11.19 15:55, Jan Lukavský wrote:
> >>>> Hi,
> >>>> is there anything I can do to make this more attractive? :-) Any
> >>>> feedback would be much appreciated.
> >>>> Many thanks,
> >>>>   Jan
> >>>>
> >>>> Dne 5. 11. 2019 14:10 napsal uživatel Jan Lukavský <je...@seznam.cz>:
> >>>>
> >>>>     Hi,
> >>>>
> >>>>     I'd like to open a vote on accepting design document [1] as a
> >>>> base for
> >>>>     implementation of @RequiresTimeSortedInput annotation for stateful
> >>>>     DoFns. Associated JIRA [2] and PR [3] contains only subset of
> >>>> the whole
> >>>>     functionality (allowed lateness ignored and no possibility to
> >>>> specify
> >>>>     UDF for time - or sequential number - to be extracted from data).
> >>>>     The PR
> >>>>     will be subject to independent review process (please feel free to
> >>>>     self-request review if you are interested in this) after the
> >>>> vote would
> >>>>     eventually succeed. Missing features from the design document
> >>>> will be
> >>>>     added later in subsequent JIRA issues, so that it doesn't block
> >>>>     availability of this feature.
> >>>>
> >>>>     Please vote on adding support for @RequiresTimeSortedInput.
> >>>>
> >>>>     The vote is open for the next 72 hours and passes if at least
> >>>> three +1
> >>>>     and no -1 PMC (binding) votes are cast.
> >>>>
> >>>>     [ ] +1 Add support for @RequiresTimeSortedInput
> >>>>
> >>>>     [ ] 0 I don't have a strong opinion about this, but I assume
> >>>> it's ok
> >>>>
> >>>>     [ ] -1 Do not support @RequiresTimeSortedInput - please provide
> >>>>     explanation.
> >>>>
> >>>>     Thanks,
> >>>>
> >>>>       Jan
> >>>>
> >>>>     [1]
> >>>>
> https://docs.google.com/document/d/1ObLVUFsf1NcG8ZuIZE4aVy2RYKx2FfyMhkZYWPnI9-c/edit?usp=sharing
> >>>>
> >>>>
> >>>>
> >>>>     [2] https://issues.apache.org/jira/browse/BEAM-8550
> >>>>
> >>>>     [3] https://github.com/apache/beam/pull/8774
> >>>>
> >>>>
>

Re: [CANCELLED] [VOTE] @RequiresTimeSortedInput stateful DoFn annotation

Reply via email to