Github user JoshRosen commented on a diff in the pull request:

    https://github.com/apache/spark/pull/4956#discussion_r26088026
  
    --- Diff: docs/streaming-programming-guide.md ---
    @@ -1868,13 +1961,38 @@ Furthermore, there are two kinds of failures that 
we should be concerned about:
     
     With this basic knowledge, let us understand the fault-tolerance semantics 
of Spark Streaming.
     
    -## Semantics with files as input source
    +## Definitions
    +{:.no_toc}
    +The semantics of streaming systems are often captured in terms of how many 
times each record can be processed by the system. There are three types of 
guarantees that a system can provide under all possible operating conditions 
(despite failures, etc.)
    +
    +1. *At most once*: Each record will be either processed once or not 
processed at all.
    +2. *At least once*: Each record will be processed one or more times. This 
is stronger than *at-most once* as it ensure that no data will be lost. But 
there may be duplicates.
    +3. *Exactly once*: Each record will be processed exactly once - no data 
will be lost and no data will be processed multiple times. This is obviously 
the strongest guarantee of the three.
    +
    +## Basic Semantics
    +{:.no_toc}
    +In any stream processing system, broadly speaking, there are three steps 
in processing the data.
    +1. *Receiving the data*: The data is received from sources using Receivers 
or otherwise.
    +1. *Transforming the data*: The data received data is transformed using 
DStream and RDD transformations.
    +1. *Pushing out the data*: The final transformed data is pushed out to 
external systems like file systems, databases, dashboards, etc.
    +
    +If a streaming application has to achieve end-to-end exactly-once 
guarantees, then each step has to provide exactly-once guarantee. That is, each 
record must be received exactly once, transformed exactly once, and pushed to 
downstream systems exactly once. In case of Spark Streaming, lets understand 
the scope of Spark Streaming.
    +
    +1. *Receiving the data*: Different input sources provided different 
guarantees. This is discussed in detail in the next subsection.
    +1. *Transforming the data*: All data that has been received will be 
processed _exactly once_, thanks to the guarantees that RDDs provide. Even if 
there are failures, as long as the received input data is accessible, the final 
transformed RDDs will always have the same contents.
    +1. *Pushing out the data*: Output operations by default ensure _at-least 
once_ semantics because it depends on the type of output operation (idempotent, 
or not) and the semantics of the downstream system (supports transactions or 
not). But users can implement their own transaction mechanisms to achieve 
_exactly-once_ semantics. This is discussed in more details later in the 
section.
    --- End diff --
    
    I guess this technically extends to ANY operation that does external 
side-effects, although I guess we don't need to spell that out super-explicitly 
because users aren't really supposed to be performing side-effects in their RDD 
transformations (e.g. sending data from a `mapPartitions` call).


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at [email protected] or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to