Github user JoshRosen commented on a diff in the pull request:
https://github.com/apache/spark/pull/4956#discussion_r26087878
--- Diff: docs/streaming-programming-guide.md ---
@@ -1868,13 +1961,38 @@ Furthermore, there are two kinds of failures that
we should be concerned about:
With this basic knowledge, let us understand the fault-tolerance semantics
of Spark Streaming.
-## Semantics with files as input source
+## Definitions
+{:.no_toc}
+The semantics of streaming systems are often captured in terms of how many
times each record can be processed by the system. There are three types of
guarantees that a system can provide under all possible operating conditions
(despite failures, etc.)
+
+1. *At most once*: Each record will be either processed once or not
processed at all.
+2. *At least once*: Each record will be processed one or more times. This
is stronger than *at-most once* as it ensure that no data will be lost. But
there may be duplicates.
+3. *Exactly once*: Each record will be processed exactly once - no data
will be lost and no data will be processed multiple times. This is obviously
the strongest guarantee of the three.
+
+## Basic Semantics
+{:.no_toc}
+In any stream processing system, broadly speaking, there are three steps
in processing the data.
+1. *Receiving the data*: The data is received from sources using Receivers
or otherwise.
+1. *Transforming the data*: The data received data is transformed using
DStream and RDD transformations.
+1. *Pushing out the data*: The final transformed data is pushed out to
external systems like file systems, databases, dashboards, etc.
+
+If a streaming application has to achieve end-to-end exactly-once
guarantees, then each step has to provide exactly-once guarantee. That is, each
record must be received exactly once, transformed exactly once, and pushed to
downstream systems exactly once. In case of Spark Streaming, lets understand
the scope of Spark Streaming.
--- End diff --
lets -> let's.
Also: "In case of Spark Streaming, let's understand the scope of Spark
Streaming" sounds a little ["By installing Java, you will be able to experience
the power of Java"](http://www.joelonsoftware.com/items/2009/01/12.html) to me.
I guess that this sentence is trying to say that we need to clearly define the
boundary of Spark Streaming vs. external systems in order to meaningfully talk
about guarantees (e.g. it can't guarantee transactional behavior of downstream
systems, etc.), e.g. let's be clear about the scope of where these guarantees
hold.
---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at [email protected] or file a JIRA ticket
with INFRA.
---
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]