GitHub user zsxwing opened a pull request:
https://github.com/apache/spark/pull/15367
[SPARK-17346][SQL] Add Kafka source for Structured Streaming (branch 2.0)
## What changes were proposed in this pull request?
Backport
https://github.com/apache/spark/commit/9293734d35eb3d6e4fd4ebb86f54dd5d3a35e6db
into branch 2.0.
The only difference is the Spark version in pom file.
## How was this patch tested?
Jenkins.
You can merge this pull request into a Git repository by running:
$ git pull https://github.com/zsxwing/spark kafka-source-branch-2.0
Alternatively you can review and apply these changes as the patch at:
https://github.com/apache/spark/pull/15367.patch
To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:
This closes #15367
----
commit 17e6b5c3fb702e66748634260670f4f843aa38fe
Author: Shixiong Zhu <[email protected]>
Date: 2016-10-05T23:45:45Z
[SPARK-17346][SQL] Add Kafka source for Structured Streaming
## What changes were proposed in this pull request?
This PR adds a new project ` external/kafka-0-10-sql` for Structured
Streaming Kafka source.
It's based on the design doc:
https://docs.google.com/document/d/19t2rWe51x7tq2e5AOfrsM9qb8_m7BRuv9fel9i0PqR8/edit?usp=sharing
tdas did most of work and part of them was inspired by koeninger's work.
### Introduction
The Kafka source is a structured streaming data source to poll data from
Kafka. The schema of reading data is as follows:
Column | Type
---- | ----
key | binary
value | binary
topic | string
partition | int
offset | long
timestamp | long
timestampType | int
The source can deal with deleting topics. However, the user should make
sure there is no Spark job processing the data when deleting a topic.
### Configuration
The user can use `DataStreamReader.option` to set the following
configurations.
Kafka Source's options | value | default | meaning
------ | ------- | ------ | -----
startingOffset | ["earliest", "latest"] | "latest" | The start point when a
query is started, either "earliest" which is from the earliest offset, or
"latest" which is just from the latest offset. Note: This only applies when a
new Streaming query is started, and that resuming will always pick up from
where the query left off.
failOnDataLost | [true, false] | true | Whether to fail the query when it's
possible that data is lost (e.g., topics are deleted, or offsets are out of
range). This may be a false alarm. You can disable it when it doesn't work as
you expected.
subscribe | A comma-separated list of topics | (none) | The topic list to
subscribe. Only one of "subscribe" and "subscribeParttern" options can be
specified for Kafka source.
subscribePattern | Java regex string | (none) | The pattern used to
subscribe the topic. Only one of "subscribe" and "subscribeParttern" options
can be specified for Kafka source.
kafka.consumer.poll.timeoutMs | long | 512 | The timeout in milliseconds to
poll data from Kafka in executors
fetchOffset.numRetries | int | 3 | Number of times to retry before giving
up fatch Kafka latest offsets.
fetchOffset.retryIntervalMs | long | 10 | milliseconds to wait before
retrying to fetch Kafka offsets
Kafka's own configurations can be set via `DataStreamReader.option` with
`kafka.` prefix, e.g, `stream.option("kafka.bootstrap.servers", "host:port")`
### Usage
* Subscribe to 1 topic
```Scala
spark
.readStream
.format("kafka")
.option("kafka.bootstrap.servers", "host:port")
.option("subscribe", "topic1")
.load()
```
* Subscribe to multiple topics
```Scala
spark
.readStream
.format("kafka")
.option("kafka.bootstrap.servers", "host:port")
.option("subscribe", "topic1,topic2")
.load()
```
* Subscribe to a pattern
```Scala
spark
.readStream
.format("kafka")
.option("kafka.bootstrap.servers", "host:port")
.option("subscribePattern", "topic.*")
.load()
```
## How was this patch tested?
The new unit tests.
Author: Shixiong Zhu <[email protected]>
Author: Tathagata Das <[email protected]>
Author: Shixiong Zhu <[email protected]>
Author: cody koeninger <[email protected]>
Closes #15102 from zsxwing/kafka-source.
commit a560190dba618698ae2d9728e8b35c5954772467
Author: Shixiong Zhu <[email protected]>
Date: 2016-10-06T00:02:12Z
Fix version
----
---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at [email protected] or file a JIRA ticket
with INFRA.
---
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]