Github user JoshRosen commented on a diff in the pull request:
https://github.com/apache/spark/pull/4956#discussion_r26087345
--- Diff: docs/streaming-kafka-integration.md ---
@@ -2,58 +2,154 @@
layout: global
title: Spark Streaming + Kafka Integration Guide
---
-[Apache Kafka](http://kafka.apache.org/) is publish-subscribe messaging
rethought as a distributed, partitioned, replicated commit log service. Here
we explain how to configure Spark Streaming to receive data from Kafka.
+[Apache Kafka](http://kafka.apache.org/) is publish-subscribe messaging
rethought as a distributed, partitioned, replicated commit log service. Here
we explain how to configure Spark Streaming to receive data from Kafka. There
are two approaches to this - the old approach using Receivers and Kafka's
high-level API, and a new experimental approach (introduced in Spark 1.3)
without using Receivers. They have different programming models, performance
characteristics, and semantics guarantees, so read on for more details.
-1. **Linking:** In your SBT/Maven project definition, link your streaming
application against the following artifact (see [Linking
section](streaming-programming-guide.html#linking) in the main programming
guide for further information).
+## Approach 1: Receiver-based Approach
+This approach uses a Receiver to receive the data. The Received is
implemented using the Kafka high-level consumer API. As with all receivers, the
data received from Kafka through a Receiver is stored in Spark executors, and
then jobs launched by Spark Streaming processes the data.
+
+However, under default configuration, this approach can loose data under
failures (see [receiver
reliability](streaming-programming-guide.html#receiver-reliability). To ensure
zero-data loss, you have to additionally enable Write Ahead Logs in Spark
Streaming. To ensure zero-data loss, enable the Write Ahead Logs (introduced in
Spark 1.2) . This synchronously saves all the received Kafka data into write
ahead logs on a distributed file system (e.g HDFS), so that all the data can be
recovered on failure. Ssee [Deploying
section](streaming-programming-guide.html#deploying-applications) in the
streaming programming guide for more details on Write Ahead Logs.
--- End diff --
"loose" -> "lose".
"zero-data" probably shouldn't be hyphenated. There's an extra space before
the period at the end of the this sentence, too.
Typo: "Ssee".
---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at [email protected] or file a JIRA ticket
with INFRA.
---
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]