Today we support three different processing modes for operators, "at least
once", "at most once" and "exactly once" which determine tuple processing
and recovery behavior when there is operator recovery from failure. The
default being at least once where the tuples are replayed from the
recovered checkpoint.

At least once works well for most applications. Typically applications
persist the final output of processing through the DAG into various outputs
like key value stores, databases or even HDFS files. In many of these cases
various strategies can be employed to save the data "exactly once" in the
output, such as transactions, rewinding, meta data storage, idempotent
operations etc. Furthermore the exactly once processing mode, which is a
checkpoint performed every window is rarely used. All this leads to
confusion especially to somebody new and also makes it difficult to explain
these names to less technical audience in meetups and public forums.

What I am proposing is only a name change which will make this more
intuitive to understand. Something simple like "repeat" for "at least
once", "latest" for "at most once" and "repeat latest" for "exactly once"
can do the trick.

Thanks

Reply via email to