Github user tdas commented on a diff in the pull request:

    https://github.com/apache/spark/pull/4956#discussion_r26089036
  
    --- Diff: docs/streaming-programming-guide.md ---
    @@ -1933,17 +2057,24 @@ The following table summarizes the semantics under 
failures:
       </tr>
     </table>
     
    +### With Kafka Direct API
    +{:.no_toc}
    +In Spark 1.3, we have introduced a new Kafka Direct API, which can ensure 
that all the Kafka data is received by Spark Streaming exactly once. Along with 
this, if you implement exactly-once output operation, you can achieve 
end-to-end exactly-once guarantees. This approach (experimental as of Spark 
1.3) is further discussed in the [Kafka Integration 
Guide](stream-kafka-integration.html).
    +
     ## Semantics of output operations
     {:.no_toc}
    -Since all data is modeled as RDDs with their lineage of deterministic 
operations, any recomputation
    - always leads to the same result. As a result, all DStream transformations 
are guaranteed to have
    - _exactly-once_ semantics. That is, the final transformed result will be 
same even if there were
    - was a worker node failure. However, output operations (like `foreachRDD`) 
have _at-least once_
    - semantics, that is, the transformed data may get written to an external 
entity more than once in
    - the event of a worker failure. While this is acceptable for saving to 
HDFS using the
    - `saveAs***Files` operations (as the file will simply get over-written by 
the same data),
    - additional transactions-like mechanisms may be necessary to achieve 
exactly-once semantics
    - for output operations.
    +Output operations (like `foreachRDD`) have _at-least once_ semantics, that 
is, 
    +the transformed data may get written to an external entity more than once 
in
    +the event of a worker failure. While this is acceptable for saving to file 
systems using the
    +`saveAs***Files` operations (as the file will simply get over-written with 
the same data),
    --- End diff --
    
    Done.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at [email protected] or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to