Agreed on sticking to standard terminology and explaining details. A deep technical blog plus a section on this topic in Apex doc would work.
Thks, Amol On Tue, Feb 2, 2016 at 10:51 PM, Thomas Weise <[email protected]> wrote: > We should stick with standard terminology but make sure the differences are > well explained. That's necessary because other platforms use the same words > with different meaning, compare Storm, Spark Streaming and Flink. > > Take "exactly once" as example. Elsewhere you will find it claimed when it > really is "at least once". Events are replayed and computation repeated. > When all operations in the overall system are idempotent, then it is > possible to avoid effects such as double counting, duplicate web service > calls or rows in the database etc. Hence, the engine cannot claim to > support "exactly once", this is only valid when operators used in the > application collectively support it. > > In Apex, the engine provides the hooks (endWindow, committed) to achieve > idempotency in operators that have an effect on external systems. There are > several implementations of operators that can be used with at-least-once > processing mode that will deliver "exactly-once" for the application when > all operations in the DAG are idempotent. > > > > > > On Tue, Feb 2, 2016 at 10:26 PM, Shubham Pathak <[email protected]> > wrote: > > > +1 for adding detailed explanation about the concepts in tutorials. > > > > > > On Wed, Feb 3, 2016 at 11:30 AM, Chinmay Kolhatkar < > > [email protected]> > > wrote: > > > > > +1 for Vlad's suggestion. Searching for keywords like "at least once", > > "at > > > most once" and "exactly once" tells that these terminologies are are > > widely > > > popular where semantics are defined for tuple processing. > > > Adding example applications for each of them would help in educating > the > > > terminologies in Apex context. > > > > > > On Wed, Feb 3, 2016 at 8:52 AM, Chanchal Singh < > > [email protected] > > > > > > > wrote: > > > > > > > I do agree with Vlad. it will be good to have good explanation with > > > example > > > > for existing names as it will be not create confusion for those who > > > already > > > > knows it and also for those who are beginners. > > > > > > > > On Wed, Feb 3, 2016 at 8:38 AM, Amol Kekre <[email protected]> > > wrote: > > > > > > > > > I agree with Vlad too. > > > > > > > > > > Thks > > > > > Amol > > > > > > > > > > > > > > > On Tue, Feb 2, 2016 at 3:33 PM, Munagala Ramanath < > > [email protected] > > > > > > > > > wrote: > > > > > > > > > > > I agree with Vlad: these names are so deeply embedded in the > > > community > > > > > that > > > > > > changing them is likely > > > > > > to create more problems than it solves. > > > > > > > > > > > > Ram > > > > > > > > > > > > On Tue, Feb 2, 2016 at 3:29 PM, Vlad Rozov < > > [email protected]> > > > > > > wrote: > > > > > > > > > > > > > I vote to keep original names and educate/explain their meaning > > to > > > > non > > > > > > > technical audience as delivery guarantee is not specific to > Apex, > > > but > > > > > has > > > > > > > common meaning for all streaming platforms. > > > > > > > > > > > > > > Vlad > > > > > > > > > > > > > > > > > > > > > On 2/2/16 15:17, Timothy Farkas wrote: > > > > > > > > > > > > > >> Could we provide Processing and Output Centric Aliases for the > > > > > > >> ProcessingModes? > > > > > > >> > > > > > > >> ProcessingMode.AT_MOST_ONCE_OUTPUT = > ProcessingMode.AT_MOST_ONCE > > > > > > >> ProcessingMode.EXACTLY_ONCE_OUTPUT = > > ProcessingMode.AT_LEAST_ONCE > > > > > > >> > > > > > > >> ProcessingMode.AT_MOST_ONCE_PROCESSING = > > > ProcessingMode.AT_MOST_ONCE > > > > > > >> ProcessingMode.AT_LEAST_ONCE_PROCESSING = > > > > ProcessingMode.AT_LEAST_ONCE > > > > > > >> ProcessingMode.EXACTLY_ONCE_PROCESSING = > > > ProcessingMode.EXACTLY_ONCE > > > > > > >> > > > > > > >> Tim > > > > > > >> > > > > > > >> On Tue, Feb 2, 2016 at 3:00 PM, Pramod Immaneni < > > > > > [email protected] > > > > > > > > > > > > > >> wrote: > > > > > > >> > > > > > > >> Well output guarantees are managed by the operators themselves > > so > > > > the > > > > > > user > > > > > > >>> will typically not see that as part of the engine features, > > they > > > > only > > > > > > see > > > > > > >>> processing guarantees and while they are technically correct > as > > > far > > > > > as > > > > > > >>> individual operators are concerned the names give a different > > > idea. > > > > > > >>> > > > > > > >>> Thanks > > > > > > >>> > > > > > > >>> On Tue, Feb 2, 2016 at 2:53 PM, Timothy Farkas < > > > > [email protected]> > > > > > > >>> wrote: > > > > > > >>> > > > > > > >>> I think I understand the ambiguity you are trying to clear up > > > > Pramod. > > > > > > >>>> Perhaps it can be disambiguated by distinguishing between > > > > Processing > > > > > > >>>> Guarantees and Output Guarantees, when explaining to people. > > > > > > Processing > > > > > > >>>> Guarantees apply to the way tuples are transmitted between > > > > > operators. > > > > > > >>>> Output Guarantees apply to the way output operators write > > tuples > > > > to > > > > > a > > > > > > >>>> > > > > > > >>> Data > > > > > > >>> > > > > > > >>>> Sink. > > > > > > >>>> > > > > > > >>>> This way we can describe each term intuitively in each > > context: > > > > > > >>>> > > > > > > >>>> At Most Once: A tuple can be dropped or transmitted > (written) > > > only > > > > > > once. > > > > > > >>>> At Least Once: A tuple can be transmitted (written) one or > > more > > > > > times. > > > > > > >>>> Exactly Once: A tuple is transmitted (written) only once. > > > > > > >>>> > > > > > > >>>> Then we could provide a table with the strongest Output > > > Guarantee > > > > > that > > > > > > >>>> is > > > > > > >>>> possible for each Processing Guarantee. > > > > > > >>>> > > > > > > >>>> Processing | Strongest Output Guarantee > > > > > > >>>> ---------------------------------------------- > > > > > > >>>> At Most Once | At Most Once > > > > > > >>>> At Least Once | Exactly Once > > > > > > >>>> Exactly Once | Exactly Once > > > > > > >>>> > > > > > > >>>> Thoughts? > > > > > > >>>> > > > > > > >>>> Thanks, > > > > > > >>>> Tim > > > > > > >>>> > > > > > > >>>> On Tue, Feb 2, 2016 at 2:25 PM, Sandesh Hegde < > > > > > > [email protected]> > > > > > > >>>> wrote: > > > > > > >>>> > > > > > > >>>> I agree with Tim. Instead of new terminologies, better > > > explanation > > > > > for > > > > > > >>>>> > > > > > > >>>> the > > > > > > >>>> > > > > > > >>>>> existing once are more useful. > > > > > > >>>>> > > > > > > >>>>> On Tue, Feb 2, 2016 at 2:23 PM Pramod Immaneni < > > > > > > [email protected] > > > > > > >>>>> wrote: > > > > > > >>>>> > > > > > > >>>>> The idea is to disambiguate without using at least once > since > > > > > exactly > > > > > > >>>>>> > > > > > > >>>>> once > > > > > > >>>>> > > > > > > >>>>>> output can still be achieved with those. Any other names > are > > > > fine, > > > > > > >>>>>> > > > > > > >>>>> those > > > > > > >>>> > > > > > > >>>>> were just suggestions. > > > > > > >>>>>> > > > > > > >>>>>> On Tue, Feb 2, 2016 at 2:10 PM, Timothy Farkas < > > > > > [email protected] > > > > > > > > > > > > > >>>>>> wrote: > > > > > > >>>>>> > > > > > > >>>>>> The new names don't make as much sense to me as the > original > > > > > names. > > > > > > >>>>>>> > > > > > > >>>>>> The > > > > > > >>>> > > > > > > >>>>> concepts require some thought to understand, and it won't > > > > > > >>>>>>> > > > > > > >>>>>> necessarily > > > > > > >>> > > > > > > >>>> be > > > > > > >>>>> > > > > > > >>>>>> made easier with a name change. I think a better way to > > attack > > > > > > >>>>>>> misunderstandings is to clearly explain what a window, > > > > operator, > > > > > > >>>>>>> > > > > > > >>>>>> input > > > > > > >>>> > > > > > > >>>>> operator, output operator, tuple, checkpoint, and DAG is > with > > > > > > >>>>>>> > > > > > > >>>>>> really > > > > > > >>> > > > > > > >>>> clean > > > > > > >>>>>> > > > > > > >>>>>>> and simple illustrations of the concepts. Then we can > > explain > > > > > more > > > > > > >>>>>>> > > > > > > >>>>>> involved > > > > > > >>>>>> > > > > > > >>>>>>> concepts like At Least Once, At Most Once, and Exactly > Once > > > > with > > > > > > >>>>>>> > > > > > > >>>>>> well > > > > > > >>> > > > > > > >>>> thought illustrations. Without a clear explanation of the > > basic > > > > > > >>>>>>> > > > > > > >>>>>> vocabulary, > > > > > > >>>>>> > > > > > > >>>>>>> and without pictures, it is difficult to get even > technical > > > > > people > > > > > > >>>>>>> > > > > > > >>>>>> to > > > > > > >>> > > > > > > >>>> understand these concepts. > > > > > > >>>>>>> > > > > > > >>>>>>> Thanks, > > > > > > >>>>>>> Tim > > > > > > >>>>>>> > > > > > > >>>>>>> On Tue, Feb 2, 2016 at 9:13 AM, Pramod Immaneni < > > > > > > >>>>>>> > > > > > > >>>>>> [email protected]> > > > > > > >>>>> > > > > > > >>>>>> wrote: > > > > > > >>>>>>> > > > > > > >>>>>>> Today we support three different processing modes for > > > > operators, > > > > > > >>>>>>>> > > > > > > >>>>>>> "at > > > > > > >>>> > > > > > > >>>>> least > > > > > > >>>>>>> > > > > > > >>>>>>>> once", "at most once" and "exactly once" which determine > > > tuple > > > > > > >>>>>>>> > > > > > > >>>>>>> processing > > > > > > >>>>>> > > > > > > >>>>>>> and recovery behavior when there is operator recovery > from > > > > > > >>>>>>>> > > > > > > >>>>>>> failure. > > > > > > >>> > > > > > > >>>> The > > > > > > >>>>> > > > > > > >>>>>> default being at least once where the tuples are replayed > > from > > > > > > >>>>>>>> > > > > > > >>>>>>> the > > > > > > >>> > > > > > > >>>> recovered checkpoint. > > > > > > >>>>>>>> > > > > > > >>>>>>>> At least once works well for most applications. > Typically > > > > > > >>>>>>>> > > > > > > >>>>>>> applications > > > > > > >>>>> > > > > > > >>>>>> persist the final output of processing through the DAG > into > > > > > > >>>>>>>> > > > > > > >>>>>>> various > > > > > > >>> > > > > > > >>>> outputs > > > > > > >>>>>>> > > > > > > >>>>>>>> like key value stores, databases or even HDFS files. In > > many > > > > of > > > > > > >>>>>>>> > > > > > > >>>>>>> these > > > > > > >>>> > > > > > > >>>>> cases > > > > > > >>>>>>> > > > > > > >>>>>>>> various strategies can be employed to save the data > > "exactly > > > > > > >>>>>>>> > > > > > > >>>>>>> once" > > > > > > >>> > > > > > > >>>> in > > > > > > >>>> > > > > > > >>>>> the > > > > > > >>>>>> > > > > > > >>>>>>> output, such as transactions, rewinding, meta data > storage, > > > > > > >>>>>>>> > > > > > > >>>>>>> idempotent > > > > > > >>>>> > > > > > > >>>>>> operations etc. Furthermore the exactly once processing > > mode, > > > > > > >>>>>>>> > > > > > > >>>>>>> which > > > > > > >>> > > > > > > >>>> is > > > > > > >>>>> > > > > > > >>>>>> a > > > > > > >>>>>> > > > > > > >>>>>>> checkpoint performed every window is rarely used. All > this > > > > leads > > > > > > >>>>>>>> > > > > > > >>>>>>> to > > > > > > >>> > > > > > > >>>> confusion especially to somebody new and also makes it > > difficult > > > > > > >>>>>>>> > > > > > > >>>>>>> to > > > > > > >>> > > > > > > >>>> explain > > > > > > >>>>>>> > > > > > > >>>>>>>> these names to less technical audience in meetups and > > public > > > > > > >>>>>>>> > > > > > > >>>>>>> forums. > > > > > > >>>> > > > > > > >>>>> What I am proposing is only a name change which will make > > this > > > > > > >>>>>>>> > > > > > > >>>>>>> more > > > > > > >>> > > > > > > >>>> intuitive to understand. Something simple like "repeat" for > > "at > > > > > > >>>>>>>> > > > > > > >>>>>>> least > > > > > > >>>> > > > > > > >>>>> once", "latest" for "at most once" and "repeat latest" for > > > > > > >>>>>>>> > > > > > > >>>>>>> "exactly > > > > > > >>> > > > > > > >>>> once" > > > > > > >>>>>> > > > > > > >>>>>>> can do the trick. > > > > > > >>>>>>>> > > > > > > >>>>>>>> Thanks > > > > > > >>>>>>>> > > > > > > >>>>>>>> > > > > > > > > > > > > > > > > > > > > > > > > > > > >
