[GitHub] spark pull request: [SPARK-6128][Streaming][Documentation] Updates...

JoshRosen Mon, 09 Mar 2015 16:45:59 -0700

Github user JoshRosen commented on a diff in the pull request:

    https://github.com/apache/spark/pull/4956#discussion_r26087878
  
    --- Diff: docs/streaming-programming-guide.md ---
    @@ -1868,13 +1961,38 @@ Furthermore, there are two kinds of failures that 
we should be concerned about:
     
     With this basic knowledge, let us understand the fault-tolerance semantics 
of Spark Streaming.
     
    -## Semantics with files as input source
    +## Definitions
    +{:.no_toc}
    +The semantics of streaming systems are often captured in terms of how many 
times each record can be processed by the system. There are three types of 
guarantees that a system can provide under all possible operating conditions 
(despite failures, etc.)
    +
    +1. *At most once*: Each record will be either processed once or not 
processed at all.
    +2. *At least once*: Each record will be processed one or more times. This 
is stronger than *at-most once* as it ensure that no data will be lost. But 
there may be duplicates.
    +3. *Exactly once*: Each record will be processed exactly once - no data 
will be lost and no data will be processed multiple times. This is obviously 
the strongest guarantee of the three.
    +
    +## Basic Semantics
    +{:.no_toc}
    +In any stream processing system, broadly speaking, there are three steps 
in processing the data.
    +1. *Receiving the data*: The data is received from sources using Receivers 
or otherwise.
    +1. *Transforming the data*: The data received data is transformed using 
DStream and RDD transformations.
    +1. *Pushing out the data*: The final transformed data is pushed out to 
external systems like file systems, databases, dashboards, etc.
    +
    +If a streaming application has to achieve end-to-end exactly-once 
guarantees, then each step has to provide exactly-once guarantee. That is, each 
record must be received exactly once, transformed exactly once, and pushed to 
downstream systems exactly once. In case of Spark Streaming, lets understand 
the scope of Spark Streaming.
    --- End diff --
    
    lets -> let's.
    
    Also: "In case of Spark Streaming, let's understand the scope of Spark 
Streaming" sounds a little ["By installing Java, you will be able to experience 
the power of Java"](http://www.joelonsoftware.com/items/2009/01/12.html) to me. 
 I guess that this sentence is trying to say that we need to clearly define the 
boundary of Spark Streaming vs. external systems in order to meaningfully talk 
about guarantees (e.g. it can't guarantee transactional behavior of downstream 
systems, etc.), e.g. let's be clear about the scope of where these guarantees 
hold.



---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at [email protected] or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[GitHub] spark pull request: [SPARK-6128][Streaming][Documentation] Updates...

Reply via email to