Github user tdas commented on a diff in the pull request:
https://github.com/apache/spark/pull/6863#discussion_r32788358
--- Diff: docs/streaming-kafka-integration.md ---
@@ -74,15 +74,15 @@ Next, we discuss how to use this approach in your
streaming application.
[Maven
repository](http://search.maven.org/#search|ga|1|a%3A%22spark-streaming-kafka-assembly_2.10%22%20AND%20v%3A%22{{site.SPARK_VERSION_SHORT}}%22)
and add it to `spark-submit` with `--jars`.
## Approach 2: Direct Approach (No Receivers)
-This is a new receiver-less "direct" approach has been introduced in Spark
1.3 to ensure stronger end-to-end guarantees. Instead of using receivers to
receive data, this approach periodically queries Kafka for the latest offsets
in each topic+partition, and accordingly defines the offset ranges to process
in each batch. When the jobs to process the data are launched, Kafka's simple
consumer API is used to read the defined ranges of offsets from Kafka (similar
to read files from a file system). Note that this is an experimental feature in
Spark 1.3 and is only available in the Scala and Java API.
+This new receiver-less "direct" approach has been introduced in Spark 1.3
to ensure stronger end-to-end guarantees. Instead of using receivers to receive
data, this approach periodically queries Kafka for the latest offsets in each
topic+partition, and accordingly defines the offset ranges to process in each
batch. When the jobs to process the data are launched, Kafka's simple consumer
API is used to read the defined ranges of offsets from Kafka (similar to read
files from a file system). Note that this is an experimental feature introduced
in Spark 1.3 for the Scala and Java API. Spark 1.4 added a Python API, but it
is not yet at full feature parity.
-This approach has the following advantages over the received-based
approach (i.e. Approach 1).
+This approach has the following advantages over the receiver-based
approach (i.e. Approach 1).
-- *Simplified Parallelism:* No need to create multiple input Kafka streams
and union-ing them. With `directStream`, Spark Streaming will create as many
RDD partitions as there is Kafka partitions to consume, which will all read
data from Kafka in parallel. So there is one-to-one mapping between Kafka and
RDD partitions, which is easier to understand and tune.
+- *Simplified Parallelism:* No need to create multiple input Kafka streams
and union them. With `directStream`, Spark Streaming will create as many RDD
partitions as there are Kafka partitions to consume, which will all read data
from Kafka in parallel. So there is a one-to-one mapping between Kafka and RDD
partitions, which is easier to understand and tune.
-- *Efficiency:* Achieving zero-data loss in the first approach required
the data to be stored in a Write Ahead Log, which further replicated the data.
This is actually inefficient as the data effectively gets replicated twice -
once by Kafka, and a second time by the Write Ahead Log. This second approach
eliminate the problem as there is no receiver, and hence no need for Write
Ahead Logs.
+- *Efficiency:* Achieving zero-data loss in the first approach required
the data to be stored in a Write Ahead Log, which further replicated the data.
This is actually inefficient as the data effectively gets replicated twice -
once by Kafka, and a second time by the Write Ahead Log. This second approach
eliminates the problem as there is no receiver, and hence no need for Write
Ahead Logs. As long as you have sufficient Kafka retention, messages can be
recovered from Kafka.
-- *Exactly-once semantics:* The first approach uses Kafka's high level API
to store consumed offsets in Zookeeper. This is traditionally the way to
consume data from Kafka. While this approach (in combination with write ahead
logs) can ensure zero data loss (i.e. at-least once semantics), there is a
small chance some records may get consumed twice under some failures. This
occurs because of inconsistencies between data reliably received by Spark
Streaming and offsets tracked by Zookeeper. Hence, in this second approach, we
use simple Kafka API that does not use Zookeeper and offsets tracked only by
Spark Streaming within its checkpoints. This eliminates inconsistencies between
Spark Streaming and Zookeeper/Kafka, and so each record is received by Spark
Streaming effectively exactly once despite failures.
+- *Exactly-once semantics:* The first approach uses Kafka's high level API
to store consumed offsets in Zookeeper. This is traditionally the way to
consume data from Kafka. While this approach (in combination with write ahead
logs) can ensure zero data loss (i.e. at-least once semantics), there is a
small chance some records may get consumed twice under some failures. This
occurs because of inconsistencies between data reliably received by Spark
Streaming and offsets tracked by Zookeeper. Hence, in this second approach, we
use simple Kafka API that does not use Zookeeper. Offsets are tracked by Spark
Streaming within its checkpoints. This eliminates inconsistencies between Spark
Streaming and Zookeeper/Kafka, and so each record is received by Spark
Streaming effectively exactly once despite failures. In order to achieve
exactly-once semantics for output of your results, your save method must be
either idempotent, or an atomic transaction that saves results and offsets in
your ow
n data store.
--- End diff --
* "save method" --> "output operation that saves the data to external data
stores"
* Also refer to the fault-tolerance section in the main programming guide.
This stuff is common for everything.
*nit: Can you capitalize Write Ahead Logs
---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at [email protected] or file a JIRA ticket
with INFRA.
---
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]