Github user tdas commented on a diff in the pull request:

    https://github.com/apache/spark/pull/6863#discussion_r32788358
  
    --- Diff: docs/streaming-kafka-integration.md ---
    @@ -74,15 +74,15 @@ Next, we discuss how to use this approach in your 
streaming application.
        [Maven 
repository](http://search.maven.org/#search|ga|1|a%3A%22spark-streaming-kafka-assembly_2.10%22%20AND%20v%3A%22{{site.SPARK_VERSION_SHORT}}%22)
 and add it to `spark-submit` with `--jars`.
     
     ## Approach 2: Direct Approach (No Receivers)
    -This is a new receiver-less "direct" approach has been introduced in Spark 
1.3 to ensure stronger end-to-end guarantees. Instead of using receivers to 
receive data, this approach periodically queries Kafka for the latest offsets 
in each topic+partition, and accordingly defines the offset ranges to process 
in each batch. When the jobs to process the data are launched, Kafka's simple 
consumer API is used to read the defined ranges of offsets from Kafka (similar 
to read files from a file system). Note that this is an experimental feature in 
Spark 1.3 and is only available in the Scala and Java API.
    +This new receiver-less "direct" approach has been introduced in Spark 1.3 
to ensure stronger end-to-end guarantees. Instead of using receivers to receive 
data, this approach periodically queries Kafka for the latest offsets in each 
topic+partition, and accordingly defines the offset ranges to process in each 
batch. When the jobs to process the data are launched, Kafka's simple consumer 
API is used to read the defined ranges of offsets from Kafka (similar to read 
files from a file system). Note that this is an experimental feature introduced 
in Spark 1.3 for the Scala and Java API. Spark 1.4 added a Python API, but it 
is not yet at full feature parity.
     
    -This approach has the following advantages over the received-based 
approach (i.e. Approach 1).
    +This approach has the following advantages over the receiver-based 
approach (i.e. Approach 1).
     
    -- *Simplified Parallelism:* No need to create multiple input Kafka streams 
and union-ing them. With `directStream`, Spark Streaming will create as many 
RDD partitions as there is Kafka partitions to consume, which will all read 
data from Kafka in parallel. So there is one-to-one mapping between Kafka and 
RDD partitions, which is easier to understand and tune.
    +- *Simplified Parallelism:* No need to create multiple input Kafka streams 
and union them. With `directStream`, Spark Streaming will create as many RDD 
partitions as there are Kafka partitions to consume, which will all read data 
from Kafka in parallel. So there is a one-to-one mapping between Kafka and RDD 
partitions, which is easier to understand and tune.
     
    -- *Efficiency:* Achieving zero-data loss in the first approach required 
the data to be stored in a Write Ahead Log, which further replicated the data. 
This is actually inefficient as the data effectively gets replicated twice - 
once by Kafka, and a second time by the Write Ahead Log. This second approach 
eliminate the problem as there is no receiver, and hence no need for Write 
Ahead Logs.
    +- *Efficiency:* Achieving zero-data loss in the first approach required 
the data to be stored in a Write Ahead Log, which further replicated the data. 
This is actually inefficient as the data effectively gets replicated twice - 
once by Kafka, and a second time by the Write Ahead Log. This second approach 
eliminates the problem as there is no receiver, and hence no need for Write 
Ahead Logs. As long as you have sufficient Kafka retention, messages can be 
recovered from Kafka.
     
    -- *Exactly-once semantics:* The first approach uses Kafka's high level API 
to store consumed offsets in Zookeeper. This is traditionally the way to 
consume data from Kafka. While this approach (in combination with write ahead 
logs) can ensure zero data loss (i.e. at-least once semantics), there is a 
small chance some records may get consumed twice under some failures. This 
occurs because of inconsistencies between data reliably received by Spark 
Streaming and offsets tracked by Zookeeper. Hence, in this second approach, we 
use simple Kafka API that does not use Zookeeper and offsets tracked only by 
Spark Streaming within its checkpoints. This eliminates inconsistencies between 
Spark Streaming and Zookeeper/Kafka, and so each record is received by Spark 
Streaming effectively exactly once despite failures.
    +- *Exactly-once semantics:* The first approach uses Kafka's high level API 
to store consumed offsets in Zookeeper. This is traditionally the way to 
consume data from Kafka. While this approach (in combination with write ahead 
logs) can ensure zero data loss (i.e. at-least once semantics), there is a 
small chance some records may get consumed twice under some failures. This 
occurs because of inconsistencies between data reliably received by Spark 
Streaming and offsets tracked by Zookeeper. Hence, in this second approach, we 
use simple Kafka API that does not use Zookeeper. Offsets are tracked by Spark 
Streaming within its checkpoints. This eliminates inconsistencies between Spark 
Streaming and Zookeeper/Kafka, and so each record is received by Spark 
Streaming effectively exactly once despite failures. In order to achieve 
exactly-once semantics for output of your results, your save method must be 
either idempotent, or an atomic transaction that saves results and offsets in 
your ow
 n data store.
    --- End diff --
    
    * "save method" --> "output operation that saves the data to external data 
stores" 
    * Also refer to the fault-tolerance section in the main programming guide. 
This stuff is common for everything.
    *nit: Can you capitalize Write Ahead Logs


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at [email protected] or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to