Github user tdas commented on a diff in the pull request:
https://github.com/apache/spark/pull/4956#discussion_r26089171
--- Diff: docs/streaming-programming-guide.md ---
@@ -1868,13 +1961,38 @@ Furthermore, there are two kinds of failures that
we should be concerned about:
With this basic knowledge, let us understand the fault-tolerance semantics
of Spark Streaming.
-## Semantics with files as input source
+## Definitions
+{:.no_toc}
+The semantics of streaming systems are often captured in terms of how many
times each record can be processed by the system. There are three types of
guarantees that a system can provide under all possible operating conditions
(despite failures, etc.)
+
+1. *At most once*: Each record will be either processed once or not
processed at all.
+2. *At least once*: Each record will be processed one or more times. This
is stronger than *at-most once* as it ensure that no data will be lost. But
there may be duplicates.
+3. *Exactly once*: Each record will be processed exactly once - no data
will be lost and no data will be processed multiple times. This is obviously
the strongest guarantee of the three.
+
+## Basic Semantics
+{:.no_toc}
+In any stream processing system, broadly speaking, there are three steps
in processing the data.
+1. *Receiving the data*: The data is received from sources using Receivers
or otherwise.
+1. *Transforming the data*: The data received data is transformed using
DStream and RDD transformations.
+1. *Pushing out the data*: The final transformed data is pushed out to
external systems like file systems, databases, dashboards, etc.
+
+If a streaming application has to achieve end-to-end exactly-once
guarantees, then each step has to provide exactly-once guarantee. That is, each
record must be received exactly once, transformed exactly once, and pushed to
downstream systems exactly once. In case of Spark Streaming, lets understand
the scope of Spark Streaming.
+
+1. *Receiving the data*: Different input sources provided different
guarantees. This is discussed in detail in the next subsection.
+1. *Transforming the data*: All data that has been received will be
processed _exactly once_, thanks to the guarantees that RDDs provide. Even if
there are failures, as long as the received input data is accessible, the final
transformed RDDs will always have the same contents.
+1. *Pushing out the data*: Output operations by default ensure _at-least
once_ semantics because it depends on the type of output operation (idempotent,
or not) and the semantics of the downstream system (supports transactions or
not). But users can implement their own transaction mechanisms to achieve
_exactly-once_ semantics. This is discussed in more details later in the
section.
--- End diff --
Yeah. I kept this to-the-point. I am guessing people who are reading this
section, already understand those semantics.
---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at [email protected] or file a JIRA ticket
with INFRA.
---
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]