Hi Jan, I want to acknowledge your careful consideration of the community here.
I myself have simply not had the time to dedicate to considering this proposal. So, like Max, I would have a bit of an "outside" perspective so would hesitate to cast any sort of vote. I think you have chosen a good course of action - continue to grow awareness and understanding of the problem / awareness and understanding of the solution. Kenn On Tue, Nov 12, 2019 at 12:45 AM Jan Lukavský <je...@seznam.cz> wrote: > I'm cancelling this due to lack of activity. I will issue a follow-up > thread to find solution. > > On 11/9/19 11:45 AM, Jan Lukavský wrote: > > Hi, > > > > I'll try to summarize the mailing list threads to clarify why I think > > this addition is needed (and actually necessary): > > > > a) there are situations where the order of input events matter > > (obviously any finite state machine) > > > > b) in streaming case, this can be handled by the current machinery > > (e.g. holding elements in state, sorting all elements with timestamp > > less than input watermark, dropping latecomers) > > > > c) in batch case, this can be handled the same way, but > > > > i) due to the nature of batch processing, that has extreme > > requirements on the size of state needed to hold the elements > > (actually, in extreme, that might be the whole input, which might not > > be feasible) > > > > ii) although it is true, that watermark might (and will) fall behind > > in streaming processing as well so that similar issues might arise > > there too, it is hardly imaginable that it will fall behind as much as > > several years (but it is absolutely natural in batch case) - I'm > > talking about regular streaming processing, not some kappa like > > architectures, where this happens as well, but is causes troubles ([1]) > > > > iii) given the fact, that some runners already use sort-merge > > groupings, it is actually virtually for free to also sort elements > > inside groups by timestamps, the runner just has to know, that it > > should do so > > > > I don't want to go too far into details to keep this focused, but the > > fact that runner would know that it should sort by timestamp before > > stateful pardo brings additional features that are currently > > unavailable - e.g. actually shift event time smoothly, as elements > > flow through, not from -inf to +inf in one shot. That might have > > positive effect on timers being fired smoothly and thus for instance > > being able to free some state that would have to be held until the end > > of computation otherwise. > > > > Therefore, I think it is essential for users to be able to tell runner > > that a particular stateful pardo depends on order of input events, so > > that the runner can use optimizations available in batch case. The > > streaming case is mostly unaffected by that, because all the sorting > > can be handled the usual way. > > > > Hope this helps to clarify why it would be good to introduce (some > > way) to mark stateful pardos as "time sorted". > > > > Cheers, > > > > Jan > > > > [1] > > > https://www.ververica.com/resources/flink-forward-san-francisco-2019/moving-from-lambda-and-kappa-architectures-to-kappa-at-uber > > > > Hope these thoughts help > > > > On 11/8/19 11:35 AM, Jan Lukavský wrote: > >> Hi Max, > >> > >> thanks for comment. I probably should have put links to discussion > >> threads here in the vote thread. Relevant would be > >> > >> - (a pretty lengthy) discussion about whether sorting by timestamp > >> should be part of the model - [1] > >> > >> - part of the discussion related to the annotation - [2] > >> > >> Regarding the open question in the design document - these are not > >> meant to be open questions in regard to the design of the annotation > >> and I'll remove that for now, as it is not (directly) related. > >> > >> Now - main reason for this vote is that there is actually not a clear > >> consensus in the ML thread. There are plenty of words like "should", > >> "could", "would" and "maybe", so I wanted to be sure there is > >> consensus to include this. I already run this in production for > >> several months, so it is definitely useful for me. :-) But that might > >> not be sufficient. > >> > >> I'd be very happy to answer any more questions. > >> > >> Thanks, > >> > >> Jan > >> > >> [1] > >> > https://lists.apache.org/thread.html/4609a1bb1662690d67950e76d2f1108b51327b8feaf9580de659552e@%3Cdev.beam.apache.org%3E > >> > >> [2] > >> > https://lists.apache.org/thread.html/dd9bec903102d9fcb4f390dc01513c0921eac1fedd8bcfdac630aaee@%3Cdev.beam.apache.org%3E > >> > >> On 11/8/19 11:08 AM, Maximilian Michels wrote: > >>> Hi Jan, > >>> > >>> Disclaimer: I haven't followed the discussion closely, so I do not > >>> want to comment on the technical details of the feature here. > >>> > >>> From the outside, it looks like there may be open questions. Also, > >>> we may need more motivation for what we can build with this feature > >>> or how it will become useful to users. > >>> > >>> There are many threads in Beam and I believe we need to carefully > >>> prioritize the Beam feature set in order to focus on the things that > >>> provide the most value to our users. > >>> > >>> Cheers, > >>> Max > >>> > >>> On 07.11.19 15:55, Jan Lukavský wrote: > >>>> Hi, > >>>> is there anything I can do to make this more attractive? :-) Any > >>>> feedback would be much appreciated. > >>>> Many thanks, > >>>> Jan > >>>> > >>>> Dne 5. 11. 2019 14:10 napsal uživatel Jan Lukavský <je...@seznam.cz>: > >>>> > >>>> Hi, > >>>> > >>>> I'd like to open a vote on accepting design document [1] as a > >>>> base for > >>>> implementation of @RequiresTimeSortedInput annotation for stateful > >>>> DoFns. Associated JIRA [2] and PR [3] contains only subset of > >>>> the whole > >>>> functionality (allowed lateness ignored and no possibility to > >>>> specify > >>>> UDF for time - or sequential number - to be extracted from data). > >>>> The PR > >>>> will be subject to independent review process (please feel free to > >>>> self-request review if you are interested in this) after the > >>>> vote would > >>>> eventually succeed. Missing features from the design document > >>>> will be > >>>> added later in subsequent JIRA issues, so that it doesn't block > >>>> availability of this feature. > >>>> > >>>> Please vote on adding support for @RequiresTimeSortedInput. > >>>> > >>>> The vote is open for the next 72 hours and passes if at least > >>>> three +1 > >>>> and no -1 PMC (binding) votes are cast. > >>>> > >>>> [ ] +1 Add support for @RequiresTimeSortedInput > >>>> > >>>> [ ] 0 I don't have a strong opinion about this, but I assume > >>>> it's ok > >>>> > >>>> [ ] -1 Do not support @RequiresTimeSortedInput - please provide > >>>> explanation. > >>>> > >>>> Thanks, > >>>> > >>>> Jan > >>>> > >>>> [1] > >>>> > https://docs.google.com/document/d/1ObLVUFsf1NcG8ZuIZE4aVy2RYKx2FfyMhkZYWPnI9-c/edit?usp=sharing > >>>> > >>>> > >>>> > >>>> [2] https://issues.apache.org/jira/browse/BEAM-8550 > >>>> > >>>> [3] https://github.com/apache/beam/pull/8774 > >>>> > >>>> >