Could we provide Processing and Output Centric Aliases for the ProcessingModes?
ProcessingMode.AT_MOST_ONCE_OUTPUT = ProcessingMode.AT_MOST_ONCE ProcessingMode.EXACTLY_ONCE_OUTPUT = ProcessingMode.AT_LEAST_ONCE ProcessingMode.AT_MOST_ONCE_PROCESSING = ProcessingMode.AT_MOST_ONCE ProcessingMode.AT_LEAST_ONCE_PROCESSING = ProcessingMode.AT_LEAST_ONCE ProcessingMode.EXACTLY_ONCE_PROCESSING = ProcessingMode.EXACTLY_ONCE Tim On Tue, Feb 2, 2016 at 3:00 PM, Pramod Immaneni <[email protected]> wrote: > Well output guarantees are managed by the operators themselves so the user > will typically not see that as part of the engine features, they only see > processing guarantees and while they are technically correct as far as > individual operators are concerned the names give a different idea. > > Thanks > > On Tue, Feb 2, 2016 at 2:53 PM, Timothy Farkas <[email protected]> > wrote: > > > I think I understand the ambiguity you are trying to clear up Pramod. > > Perhaps it can be disambiguated by distinguishing between Processing > > Guarantees and Output Guarantees, when explaining to people. Processing > > Guarantees apply to the way tuples are transmitted between operators. > > Output Guarantees apply to the way output operators write tuples to a > Data > > Sink. > > > > This way we can describe each term intuitively in each context: > > > > At Most Once: A tuple can be dropped or transmitted (written) only once. > > At Least Once: A tuple can be transmitted (written) one or more times. > > Exactly Once: A tuple is transmitted (written) only once. > > > > Then we could provide a table with the strongest Output Guarantee that is > > possible for each Processing Guarantee. > > > > Processing | Strongest Output Guarantee > > ---------------------------------------------- > > At Most Once | At Most Once > > At Least Once | Exactly Once > > Exactly Once | Exactly Once > > > > Thoughts? > > > > Thanks, > > Tim > > > > On Tue, Feb 2, 2016 at 2:25 PM, Sandesh Hegde <[email protected]> > > wrote: > > > > > I agree with Tim. Instead of new terminologies, better explanation for > > the > > > existing once are more useful. > > > > > > On Tue, Feb 2, 2016 at 2:23 PM Pramod Immaneni <[email protected] > > > > > wrote: > > > > > > > The idea is to disambiguate without using at least once since exactly > > > once > > > > output can still be achieved with those. Any other names are fine, > > those > > > > were just suggestions. > > > > > > > > On Tue, Feb 2, 2016 at 2:10 PM, Timothy Farkas <[email protected]> > > > > wrote: > > > > > > > > > The new names don't make as much sense to me as the original names. > > The > > > > > concepts require some thought to understand, and it won't > necessarily > > > be > > > > > made easier with a name change. I think a better way to attack > > > > > misunderstandings is to clearly explain what a window, operator, > > input > > > > > operator, output operator, tuple, checkpoint, and DAG is with > really > > > > clean > > > > > and simple illustrations of the concepts. Then we can explain more > > > > involved > > > > > concepts like At Least Once, At Most Once, and Exactly Once with > well > > > > > thought illustrations. Without a clear explanation of the basic > > > > vocabulary, > > > > > and without pictures, it is difficult to get even technical people > to > > > > > understand these concepts. > > > > > > > > > > Thanks, > > > > > Tim > > > > > > > > > > On Tue, Feb 2, 2016 at 9:13 AM, Pramod Immaneni < > > > [email protected]> > > > > > wrote: > > > > > > > > > > > Today we support three different processing modes for operators, > > "at > > > > > least > > > > > > once", "at most once" and "exactly once" which determine tuple > > > > processing > > > > > > and recovery behavior when there is operator recovery from > failure. > > > The > > > > > > default being at least once where the tuples are replayed from > the > > > > > > recovered checkpoint. > > > > > > > > > > > > At least once works well for most applications. Typically > > > applications > > > > > > persist the final output of processing through the DAG into > various > > > > > outputs > > > > > > like key value stores, databases or even HDFS files. In many of > > these > > > > > cases > > > > > > various strategies can be employed to save the data "exactly > once" > > in > > > > the > > > > > > output, such as transactions, rewinding, meta data storage, > > > idempotent > > > > > > operations etc. Furthermore the exactly once processing mode, > which > > > is > > > > a > > > > > > checkpoint performed every window is rarely used. All this leads > to > > > > > > confusion especially to somebody new and also makes it difficult > to > > > > > explain > > > > > > these names to less technical audience in meetups and public > > forums. > > > > > > > > > > > > What I am proposing is only a name change which will make this > more > > > > > > intuitive to understand. Something simple like "repeat" for "at > > least > > > > > > once", "latest" for "at most once" and "repeat latest" for > "exactly > > > > once" > > > > > > can do the trick. > > > > > > > > > > > > Thanks > > > > > > > > > > > > > > > > > > > > >
