I am in the process of writing a technical blog on this topic. Thanks
> On Feb 3, 2016, at 8:14 AM, Amol Kekre <[email protected]> wrote: > > Agreed on sticking to standard terminology and explaining details. A deep > technical blog plus a section on this topic in Apex doc would work. > > Thks, > Amol > > > On Tue, Feb 2, 2016 at 10:51 PM, Thomas Weise <[email protected]> > wrote: > >> We should stick with standard terminology but make sure the differences are >> well explained. That's necessary because other platforms use the same words >> with different meaning, compare Storm, Spark Streaming and Flink. >> >> Take "exactly once" as example. Elsewhere you will find it claimed when it >> really is "at least once". Events are replayed and computation repeated. >> When all operations in the overall system are idempotent, then it is >> possible to avoid effects such as double counting, duplicate web service >> calls or rows in the database etc. Hence, the engine cannot claim to >> support "exactly once", this is only valid when operators used in the >> application collectively support it. >> >> In Apex, the engine provides the hooks (endWindow, committed) to achieve >> idempotency in operators that have an effect on external systems. There are >> several implementations of operators that can be used with at-least-once >> processing mode that will deliver "exactly-once" for the application when >> all operations in the DAG are idempotent. >> >> >> >> >> >> On Tue, Feb 2, 2016 at 10:26 PM, Shubham Pathak <[email protected]> >> wrote: >> >>> +1 for adding detailed explanation about the concepts in tutorials. >>> >>> >>> On Wed, Feb 3, 2016 at 11:30 AM, Chinmay Kolhatkar < >>> [email protected]> >>> wrote: >>> >>>> +1 for Vlad's suggestion. Searching for keywords like "at least once", >>> "at >>>> most once" and "exactly once" tells that these terminologies are are >>> widely >>>> popular where semantics are defined for tuple processing. >>>> Adding example applications for each of them would help in educating >> the >>>> terminologies in Apex context. >>>> >>>> On Wed, Feb 3, 2016 at 8:52 AM, Chanchal Singh < >>> [email protected] >>>> wrote: >>>> >>>>> I do agree with Vlad. it will be good to have good explanation with >>>> example >>>>> for existing names as it will be not create confusion for those who >>>> already >>>>> knows it and also for those who are beginners. >>>>> >>>>> On Wed, Feb 3, 2016 at 8:38 AM, Amol Kekre <[email protected]> >>> wrote: >>>>> >>>>>> I agree with Vlad too. >>>>>> >>>>>> Thks >>>>>> Amol >>>>>> >>>>>> >>>>>> On Tue, Feb 2, 2016 at 3:33 PM, Munagala Ramanath < >>> [email protected] >>>>> >>>>>> wrote: >>>>>> >>>>>>> I agree with Vlad: these names are so deeply embedded in the >>>> community >>>>>> that >>>>>>> changing them is likely >>>>>>> to create more problems than it solves. >>>>>>> >>>>>>> Ram >>>>>>> >>>>>>> On Tue, Feb 2, 2016 at 3:29 PM, Vlad Rozov < >>> [email protected]> >>>>>>> wrote: >>>>>>> >>>>>>>> I vote to keep original names and educate/explain their meaning >>> to >>>>> non >>>>>>>> technical audience as delivery guarantee is not specific to >> Apex, >>>> but >>>>>> has >>>>>>>> common meaning for all streaming platforms. >>>>>>>> >>>>>>>> Vlad >>>>>>>> >>>>>>>> >>>>>>>>> On 2/2/16 15:17, Timothy Farkas wrote: >>>>>>>>> >>>>>>>>> Could we provide Processing and Output Centric Aliases for the >>>>>>>>> ProcessingModes? >>>>>>>>> >>>>>>>>> ProcessingMode.AT_MOST_ONCE_OUTPUT = >> ProcessingMode.AT_MOST_ONCE >>>>>>>>> ProcessingMode.EXACTLY_ONCE_OUTPUT = >>> ProcessingMode.AT_LEAST_ONCE >>>>>>>>> >>>>>>>>> ProcessingMode.AT_MOST_ONCE_PROCESSING = >>>> ProcessingMode.AT_MOST_ONCE >>>>>>>>> ProcessingMode.AT_LEAST_ONCE_PROCESSING = >>>>> ProcessingMode.AT_LEAST_ONCE >>>>>>>>> ProcessingMode.EXACTLY_ONCE_PROCESSING = >>>> ProcessingMode.EXACTLY_ONCE >>>>>>>>> >>>>>>>>> Tim >>>>>>>>> >>>>>>>>> On Tue, Feb 2, 2016 at 3:00 PM, Pramod Immaneni < >>>>>> [email protected] >>>>>>>> >>>>>>>>> wrote: >>>>>>>>> >>>>>>>>> Well output guarantees are managed by the operators themselves >>> so >>>>> the >>>>>>> user >>>>>>>>>> will typically not see that as part of the engine features, >>> they >>>>> only >>>>>>> see >>>>>>>>>> processing guarantees and while they are technically correct >> as >>>> far >>>>>> as >>>>>>>>>> individual operators are concerned the names give a different >>>> idea. >>>>>>>>>> >>>>>>>>>> Thanks >>>>>>>>>> >>>>>>>>>> On Tue, Feb 2, 2016 at 2:53 PM, Timothy Farkas < >>>>> [email protected]> >>>>>>>>>> wrote: >>>>>>>>>> >>>>>>>>>> I think I understand the ambiguity you are trying to clear up >>>>> Pramod. >>>>>>>>>>> Perhaps it can be disambiguated by distinguishing between >>>>> Processing >>>>>>>>>>> Guarantees and Output Guarantees, when explaining to people. >>>>>>> Processing >>>>>>>>>>> Guarantees apply to the way tuples are transmitted between >>>>>> operators. >>>>>>>>>>> Output Guarantees apply to the way output operators write >>> tuples >>>>> to >>>>>> a >>>>>>>>>> Data >>>>>>>>>> >>>>>>>>>>> Sink. >>>>>>>>>>> >>>>>>>>>>> This way we can describe each term intuitively in each >>> context: >>>>>>>>>>> >>>>>>>>>>> At Most Once: A tuple can be dropped or transmitted >> (written) >>>> only >>>>>>> once. >>>>>>>>>>> At Least Once: A tuple can be transmitted (written) one or >>> more >>>>>> times. >>>>>>>>>>> Exactly Once: A tuple is transmitted (written) only once. >>>>>>>>>>> >>>>>>>>>>> Then we could provide a table with the strongest Output >>>> Guarantee >>>>>> that >>>>>>>>>>> is >>>>>>>>>>> possible for each Processing Guarantee. >>>>>>>>>>> >>>>>>>>>>> Processing | Strongest Output Guarantee >>>>>>>>>>> ---------------------------------------------- >>>>>>>>>>> At Most Once | At Most Once >>>>>>>>>>> At Least Once | Exactly Once >>>>>>>>>>> Exactly Once | Exactly Once >>>>>>>>>>> >>>>>>>>>>> Thoughts? >>>>>>>>>>> >>>>>>>>>>> Thanks, >>>>>>>>>>> Tim >>>>>>>>>>> >>>>>>>>>>> On Tue, Feb 2, 2016 at 2:25 PM, Sandesh Hegde < >>>>>>> [email protected]> >>>>>>>>>>> wrote: >>>>>>>>>>> >>>>>>>>>>> I agree with Tim. Instead of new terminologies, better >>>> explanation >>>>>> for >>>>>>>>>>> the >>>>>>>>>>> >>>>>>>>>>>> existing once are more useful. >>>>>>>>>>>> >>>>>>>>>>>> On Tue, Feb 2, 2016 at 2:23 PM Pramod Immaneni < >>>>>>> [email protected] >>>>>>>>>>>> wrote: >>>>>>>>>>>> >>>>>>>>>>>> The idea is to disambiguate without using at least once >> since >>>>>> exactly >>>>>>>>>>>> once >>>>>>>>>>>> >>>>>>>>>>>>> output can still be achieved with those. Any other names >> are >>>>> fine, >>>>>>>>>>>> those >>>>>>>>>>> >>>>>>>>>>>> were just suggestions. >>>>>>>>>>>>> >>>>>>>>>>>>> On Tue, Feb 2, 2016 at 2:10 PM, Timothy Farkas < >>>>>> [email protected] >>>>>>>> >>>>>>>>>>>>> wrote: >>>>>>>>>>>>> >>>>>>>>>>>>> The new names don't make as much sense to me as the >> original >>>>>> names. >>>>>>>>>>>>> The >>>>>>>>>>> >>>>>>>>>>>> concepts require some thought to understand, and it won't >>>>>>>>>>>>> necessarily >>>>>>>>>> >>>>>>>>>>> be >>>>>>>>>>>> >>>>>>>>>>>>> made easier with a name change. I think a better way to >>> attack >>>>>>>>>>>>>> misunderstandings is to clearly explain what a window, >>>>> operator, >>>>>>>>>>>>> input >>>>>>>>>>> >>>>>>>>>>>> operator, output operator, tuple, checkpoint, and DAG is >> with >>>>>>>>>>>>> really >>>>>>>>>> >>>>>>>>>>> clean >>>>>>>>>>>>> >>>>>>>>>>>>>> and simple illustrations of the concepts. Then we can >>> explain >>>>>> more >>>>>>>>>>>>> involved >>>>>>>>>>>>> >>>>>>>>>>>>>> concepts like At Least Once, At Most Once, and Exactly >> Once >>>>> with >>>>>>>>>>>>> well >>>>>>>>>> >>>>>>>>>>> thought illustrations. Without a clear explanation of the >>> basic >>>>>>>>>>>>> vocabulary, >>>>>>>>>>>>> >>>>>>>>>>>>>> and without pictures, it is difficult to get even >> technical >>>>>> people >>>>>>>>>>>>> to >>>>>>>>>> >>>>>>>>>>> understand these concepts. >>>>>>>>>>>>>> >>>>>>>>>>>>>> Thanks, >>>>>>>>>>>>>> Tim >>>>>>>>>>>>>> >>>>>>>>>>>>>> On Tue, Feb 2, 2016 at 9:13 AM, Pramod Immaneni < >>>>>>>>>>>>> [email protected]> >>>>>>>>>>>> >>>>>>>>>>>>> wrote: >>>>>>>>>>>>>> >>>>>>>>>>>>>> Today we support three different processing modes for >>>>> operators, >>>>>>>>>>>>>> "at >>>>>>>>>>> >>>>>>>>>>>> least >>>>>>>>>>>>>> >>>>>>>>>>>>>>> once", "at most once" and "exactly once" which determine >>>> tuple >>>>>>>>>>>>>> processing >>>>>>>>>>>>> >>>>>>>>>>>>>> and recovery behavior when there is operator recovery >> from >>>>>>>>>>>>>> failure. >>>>>>>>>> >>>>>>>>>>> The >>>>>>>>>>>> >>>>>>>>>>>>> default being at least once where the tuples are replayed >>> from >>>>>>>>>>>>>> the >>>>>>>>>> >>>>>>>>>>> recovered checkpoint. >>>>>>>>>>>>>>> >>>>>>>>>>>>>>> At least once works well for most applications. >> Typically >>>>>>>>>>>>>> applications >>>>>>>>>>>> >>>>>>>>>>>>> persist the final output of processing through the DAG >> into >>>>>>>>>>>>>> various >>>>>>>>>> >>>>>>>>>>> outputs >>>>>>>>>>>>>> >>>>>>>>>>>>>>> like key value stores, databases or even HDFS files. In >>> many >>>>> of >>>>>>>>>>>>>> these >>>>>>>>>>> >>>>>>>>>>>> cases >>>>>>>>>>>>>> >>>>>>>>>>>>>>> various strategies can be employed to save the data >>> "exactly >>>>>>>>>>>>>> once" >>>>>>>>>> >>>>>>>>>>> in >>>>>>>>>>> >>>>>>>>>>>> the >>>>>>>>>>>>> >>>>>>>>>>>>>> output, such as transactions, rewinding, meta data >> storage, >>>>>>>>>>>>>> idempotent >>>>>>>>>>>> >>>>>>>>>>>>> operations etc. Furthermore the exactly once processing >>> mode, >>>>>>>>>>>>>> which >>>>>>>>>> >>>>>>>>>>> is >>>>>>>>>>>> >>>>>>>>>>>>> a >>>>>>>>>>>>> >>>>>>>>>>>>>> checkpoint performed every window is rarely used. All >> this >>>>> leads >>>>>>>>>>>>>> to >>>>>>>>>> >>>>>>>>>>> confusion especially to somebody new and also makes it >>> difficult >>>>>>>>>>>>>> to >>>>>>>>>> >>>>>>>>>>> explain >>>>>>>>>>>>>> >>>>>>>>>>>>>>> these names to less technical audience in meetups and >>> public >>>>>>>>>>>>>> forums. >>>>>>>>>>> >>>>>>>>>>>> What I am proposing is only a name change which will make >>> this >>>>>>>>>>>>>> more >>>>>>>>>> >>>>>>>>>>> intuitive to understand. Something simple like "repeat" for >>> "at >>>>>>>>>>>>>> least >>>>>>>>>>> >>>>>>>>>>>> once", "latest" for "at most once" and "repeat latest" for >>>>>>>>>>>>>> "exactly >>>>>>>>>> >>>>>>>>>>> once" >>>>>>>>>>>>> >>>>>>>>>>>>>> can do the trick. >>>>>>>>>>>>>>> >>>>>>>>>>>>>>> Thanks >>
