[GitHub] spark pull request: [SPARK-6128][Streaming][Documentation] Updates...

JoshRosen Mon, 09 Mar 2015 16:34:48 -0700

Github user JoshRosen commented on a diff in the pull request:

    https://github.com/apache/spark/pull/4956#discussion_r26087345
  
    --- Diff: docs/streaming-kafka-integration.md ---
    @@ -2,58 +2,154 @@
     layout: global
     title: Spark Streaming + Kafka Integration Guide
     ---
    -[Apache Kafka](http://kafka.apache.org/) is publish-subscribe messaging 
rethought as a distributed, partitioned, replicated commit log service.  Here 
we explain how to configure Spark Streaming to receive data from Kafka.
    +[Apache Kafka](http://kafka.apache.org/) is publish-subscribe messaging 
rethought as a distributed, partitioned, replicated commit log service.  Here 
we explain how to configure Spark Streaming to receive data from Kafka. There 
are two approaches to this - the old approach using Receivers and Kafka's 
high-level API, and a new experimental approach (introduced in Spark 1.3) 
without using Receivers. They have different programming models, performance 
characteristics, and semantics guarantees, so read on for more details.  
     
    -1. **Linking:** In your SBT/Maven project definition, link your streaming 
application against the following artifact (see [Linking 
section](streaming-programming-guide.html#linking) in the main programming 
guide for further information).
    +## Approach 1: Receiver-based Approach
    +This approach uses a Receiver to receive the data. The Received is 
implemented using the Kafka high-level consumer API. As with all receivers, the 
data received from Kafka through a Receiver is stored in Spark executors, and 
then jobs launched by Spark Streaming processes the data. 
    +
    +However, under default configuration, this approach can loose data under 
failures (see [receiver 
reliability](streaming-programming-guide.html#receiver-reliability). To ensure 
zero-data loss, you have to additionally enable Write Ahead Logs in Spark 
Streaming. To ensure zero-data loss, enable the Write Ahead Logs (introduced in 
Spark 1.2) . This synchronously saves all the received Kafka data into write 
ahead logs on a distributed file system (e.g HDFS), so that all the data can be 
recovered on failure. Ssee [Deploying 
section](streaming-programming-guide.html#deploying-applications) in the 
streaming programming guide for more details on Write Ahead Logs.
    --- End diff --
    
    "loose" -> "lose".
    
    "zero-data" probably shouldn't be hyphenated. There's an extra space before 
the period at the end of the this sentence, too.
    
    Typo: "Ssee".



---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at [email protected] or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[GitHub] spark pull request: [SPARK-6128][Streaming][Documentation] Updates...

Reply via email to