This is an automated email from the ASF dual-hosted git repository.
lresende pushed a commit to branch master
in repository https://gitbox.apache.org/repos/asf/bahir.git
The following commit(s) were added to refs/heads/master by this push:
new 3912360 [MINOR] Add license headers and section titles to component
README.md (#95)
3912360 is described below
commit 3912360ca5bcca269a30ff42120cac46934693c4
Author: Luciano Resende <[email protected]>
AuthorDate: Sun Jan 19 12:17:48 2020 -0800
[MINOR] Add license headers and section titles to component README.md (#95)
---
README.md | 20 +++++++++-
pom.xml | 1 -
sql-cloudant/README.md | 94 +++++++++++++++++++++++++++-----------------
sql-streaming-akka/README.md | 64 +++++++++++++++++++-----------
sql-streaming-jdbc/README.md | 22 ++++++++++-
sql-streaming-mqtt/README.md | 27 +++++++++++--
sql-streaming-sqs/README.md | 26 ++++++++++--
streaming-akka/README.md | 23 ++++++++++-
streaming-mqtt/README.md | 25 ++++++++++--
streaming-pubnub/README.md | 30 +++++++++++---
streaming-pubsub/README.md | 34 ++++++++++++----
streaming-twitter/README.md | 27 +++++++++++--
streaming-zeromq/README.md | 22 ++++++++++-
13 files changed, 322 insertions(+), 93 deletions(-)
diff --git a/README.md b/README.md
index bae2005..158f812 100644
--- a/README.md
+++ b/README.md
@@ -1,3 +1,21 @@
+<!--
+{% comment %}
+Licensed to the Apache Software Foundation (ASF) under one or more
+contributor license agreements. See the NOTICE file distributed with
+this work for additional information regarding copyright ownership.
+The ASF licenses this file to You under the Apache License, Version 2.0
+(the "License"); you may not use this file except in compliance with
+the License. You may obtain a copy of the License at
+
+ http://www.apache.org/licenses/LICENSE-2.0
+
+Unless required by applicable law or agreed to in writing, software
+distributed under the License is distributed on an "AS IS" BASIS,
+WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+See the License for the specific language governing permissions and
+limitations under the License.
+{% endcomment %}
+-->
# Apache Bahir
Apache Bahir provides extensions to distributed analytics platforms such as
Apache Spark & Apache Flink.
@@ -8,7 +26,7 @@ Apache Bahir provides extensions to distributed analytics
platforms such as Apac
The Initial Bahir source code (see issue
[BAHIR-1](https://issues.apache.org/jira/browse/BAHIR-1)) containing the source
for the Apache Spark streaming connectors for akka, mqtt, twitter, zeromq
extracted from [Apache Spark revision
8301fad](https://github.com/apache/spark/tree/8301fadd8d269da11e72870b7a889596e3337839)
-(before the [deletion of the streaming connectors akka, mqtt, twitter,
zeromq](https://issues.apache.org/jira/browse/SPARK-13843)).
+(before the [deletion of the streaming connectors akka, mqtt, twitter,
zeromq](https://issues.apache.org/jira/browse/SPARK-13843)).
## Source code structure
diff --git a/pom.xml b/pom.xml
index 6988e39..943de99 100644
--- a/pom.xml
+++ b/pom.xml
@@ -494,7 +494,6 @@
<exclude>.project</exclude>
<exclude>**/dependency-reduced-pom.xml</exclude>
<exclude>**/target/**</exclude>
- <exclude>**/README.md</exclude>
<exclude>**/examples/data/*.txt</exclude>
<exclude>**/*.iml</exclude>
<exclude>**/src/main/resources/application.conf</exclude>
diff --git a/sql-cloudant/README.md b/sql-cloudant/README.md
index d315d7f..594fb2d 100644
--- a/sql-cloudant/README.md
+++ b/sql-cloudant/README.md
@@ -1,9 +1,29 @@
-A library for reading data from Cloudant or CouchDB databases using Spark SQL
and Spark Streaming.
-
-[IBM® Cloudant®](https://cloudant.com) is a document-oriented DataBase as a
Service (DBaaS). It stores data as documents
-in JSON format. It's built with scalability, high availability, and durability
in mind. It comes with a
-wide variety of indexing options including map-reduce, Cloudant Query,
full-text indexing, and
-geospatial indexing. The replication capabilities make it easy to keep data in
sync between database
+<!--
+{% comment %}
+Licensed to the Apache Software Foundation (ASF) under one or more
+contributor license agreements. See the NOTICE file distributed with
+this work for additional information regarding copyright ownership.
+The ASF licenses this file to You under the Apache License, Version 2.0
+(the "License"); you may not use this file except in compliance with
+the License. You may obtain a copy of the License at
+
+ http://www.apache.org/licenses/LICENSE-2.0
+
+Unless required by applicable law or agreed to in writing, software
+distributed under the License is distributed on an "AS IS" BASIS,
+WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+See the License for the specific language governing permissions and
+limitations under the License.
+{% endcomment %}
+-->
+# Apache CouchDB/Cloudant Data Source, Streaming Connector and SQL Streaming
Data Source
+
+A library for reading data from Cloudant or CouchDB databases using Spark SQL
and Spark Streaming.
+
+[IBM® Cloudant®](https://cloudant.com) is a document-oriented DataBase as a
Service (DBaaS). It stores data as documents
+in JSON format. It's built with scalability, high availability, and durability
in mind. It comes with a
+wide variety of indexing options including map-reduce, Cloudant Query,
full-text indexing, and
+geospatial indexing. The replication capabilities make it easy to keep data in
sync between database
clusters, desktop PCs, and mobile devices.
[Apache CouchDB™](http://couchdb.apache.org) is open source database software
that focuses on ease of use and having an architecture that "completely
embraces the Web". It has a document-oriented NoSQL database architecture and
is implemented in the concurrency-oriented language Erlang; it uses JSON to
store data, JavaScript as its query language using MapReduce, and HTTP for an
API.
@@ -30,16 +50,16 @@ Unlike using `--jars`, using `--packages` ensures that this
library and its depe
The `--packages` argument can also be used with `bin/spark-submit`.
Submit a job in Python:
-
+
spark-submit --master local[4] --packages
org.apache.bahir:spark-sql-cloudant__{{site.SCALA_BINARY_VERSION}}:{{site.SPARK_VERSION}}
<path to python script>
-
+
Submit a job in Scala:
spark-submit --class "<your class>" --master local[4] --packages
org.apache.bahir:spark-sql-cloudant__{{site.SCALA_BINARY_VERSION}}:{{site.SPARK_VERSION}}
<path to spark-sql-cloudant jar>
This library is cross-published for Scala 2.11 and Scala 2.12, so users should
replace the proper Scala version in the commands listed above.
-## Configuration options
+## Configuration options
The configuration is obtained in the following sequence:
1. default in the Config, which is set in the application.conf
@@ -64,7 +84,7 @@ cloudant.host| |cloudant host url
cloudant.username| |cloudant userid
cloudant.password| |cloudant password
cloudant.numberOfRetries|3| number of times to replay a request that received
a 429 `Too Many Requests` response
-cloudant.useQuery|false|by default, `_all_docs` endpoint is used if
configuration 'view' and 'index' (see below) are not set. When useQuery is
enabled, `_find` endpoint will be used in place of `_all_docs` when query
condition is not on primary key field (_id), so that query predicates may be
driven into datastore.
+cloudant.useQuery|false|by default, `_all_docs` endpoint is used if
configuration 'view' and 'index' (see below) are not set. When useQuery is
enabled, `_find` endpoint will be used in place of `_all_docs` when query
condition is not on primary key field (_id), so that query predicates may be
driven into datastore.
cloudant.queryLimit|25|the maximum number of results returned when querying
the `_find` endpoint.
cloudant.storageLevel|MEMORY_ONLY|the storage level for persisting Spark RDDs
during load when `cloudant.endpoint` is set to `_changes`. See [RDD
Persistence
section](https://spark.apache.org/docs/latest/programming-guide.html#rdd-persistence)
in Spark's Progamming Guide for all available storage level options.
cloudant.timeout|60000|stop the response after waiting the defined number of
milliseconds for data. Only supported with `changes` endpoint.
@@ -74,12 +94,12 @@ jsonstore.rdd.minInPartition|10|the min rows in a partition.
jsonstore.rdd.requestTimeout|900000|the request timeout in milliseconds
bulkSize|200|the bulk save size
schemaSampleSize|-1|the sample size for RDD schema discovery. 1 means we are
using only the first document for schema discovery; -1 means all documents; 0
will be treated as 1; any number N means min(N, total) docs. Only supported
with `_all_docs` endpoint.
-createDBOnSave|false|whether to create a new database during save operation.
If false, a database should already exist. If true, a new database will be
created. If true, and a database with a provided name already exists, an error
will be raised.
+createDBOnSave|false|whether to create a new database during save operation.
If false, a database should already exist. If true, a new database will be
created. If true, and a database with a provided name already exists, an error
will be raised.
The `cloudant.endpoint` option sets ` _changes` or `_all_docs` API endpoint to
be called while loading Cloudant data into Spark DataFrames or SQL Tables.
-**Note:** When using `_changes` API, please consider:
-1. Results are partially ordered and may not be be presented in order in
+**Note:** When using `_changes` API, please consider:
+1. Results are partially ordered and may not be be presented in order in
which documents were updated.
2. In case of shards' unavailability, you may see duplicate results (changes
that have been seen already)
3. Can use `selector` option to filter Cloudant docs during load
@@ -90,23 +110,23 @@ which documents were updated.
When using `_all_docs` API:
1. Supports parallel reads (using offset and range) and partitioning.
2. Using partitions may not represent the true snapshot of a database. Some
docs
- may be added or deleted in the database between loading data into different
+ may be added or deleted in the database between loading data into different
Spark partitions.
If loading Cloudant docs from a database greater than 100 MB, set
`cloudant.endpoint` to `_changes` and `spark.streaming.unpersist` to `false`.
This will enable RDD persistence during load against `_changes` endpoint and
allow the persisted RDDs to be accessible after streaming completes.
-
-See
[CloudantChangesDFSuite](src/test/scala/org/apache/bahir/cloudant/CloudantChangesDFSuite.scala)
+
+See
[CloudantChangesDFSuite](src/test/scala/org/apache/bahir/cloudant/CloudantChangesDFSuite.scala)
for examples of loading data into a Spark DataFrame with `_changes` API.
### Configuration on Spark SQL Temporary Table or DataFrame
-Besides all the configurations passed to a temporary table or dataframe
through SparkConf, it is also possible to set the following configurations in
temporary table or dataframe using OPTIONS:
+Besides all the configurations passed to a temporary table or dataframe
through SparkConf, it is also possible to set the following configurations in
temporary table or dataframe using OPTIONS:
Name | Default | Meaning
--- |:---:| ---
bulkSize|200| the bulk save size
-createDBOnSave|false| whether to create a new database during save operation.
If false, a database should already exist. If true, a new database will be
created. If true, and a database with a provided name already exists, an error
will be raised.
+createDBOnSave|false| whether to create a new database during save operation.
If false, a database should already exist. If true, a new database will be
created. If true, and a database with a provided name already exists, an error
will be raised.
database| | Cloudant database name
index| | Cloudant Search index without the database name. Search index queries
are limited to returning 200 results so can only be used to load data with <=
200 results.
path| | Cloudant: as database name if database is not present
@@ -114,7 +134,7 @@ schemaSampleSize|-1| the sample size used to discover the
schema for this temp t
selector|all documents| a selector written in Cloudant Query syntax,
specifying conditions for selecting documents when the `cloudant.endpoint`
option is set to `_changes`. Only documents satisfying the selector's
conditions will be retrieved from Cloudant and loaded into Spark.
view| | Cloudant view w/o the database name. only used for load.
-For fast loading, views are loaded without include_docs. Thus, a derived
schema will always be: `{id, key, value}`, where `value `can be a compount
field. An example of loading data from a view:
+For fast loading, views are loaded without include_docs. Thus, a derived
schema will always be: `{id, key, value}`, where `value `can be a compount
field. An example of loading data from a view:
```python
spark.sql(" CREATE TEMPORARY TABLE flightTable1 USING
org.apache.bahir.cloudant OPTIONS ( database 'n_flight', view
'_design/view/_view/AA0')")
@@ -140,8 +160,8 @@ The above stated configuration keys can also be set using
`spark-submit --conf`
### Python API
-#### Using SQL In Python
-
+#### Using SQL In Python
+
```python
spark = SparkSession\
.builder\
@@ -168,7 +188,7 @@ Submit job example:
spark-submit --packages
org.apache.bahir:spark-sql-cloudant_{{site.SCALA_BINARY_VERSION}}:{{site.SPARK_VERSION}}
--conf spark.cloudant.host=ACCOUNT.cloudant.com --conf
spark.cloudant.username=USERNAME --conf spark.cloudant.password=PASSWORD
sql-cloudant/examples/python/CloudantApp.py
```
-#### Using DataFrame In Python
+#### Using DataFrame In Python
```python
spark = SparkSession\
@@ -182,17 +202,17 @@ spark = SparkSession\
# ***1. Loading dataframe from Cloudant db
df = spark.read.load("n_airportcodemapping", "org.apache.bahir.cloudant")
-df.cache()
+df.cache()
df.printSchema()
df.filter(df.airportName >= 'Moscow').select("_id",'airportName').show()
df.filter(df._id >= 'CAA').select("_id",'airportName').show()
```
See [CloudantDF.py](examples/python/CloudantDF.py) for examples.
-
+
In case of doing multiple operations on a dataframe (select, filter etc.),
you should persist a dataframe. Otherwise, every operation on a dataframe will
load the same data from Cloudant again.
-Persisting will also speed up computation. This statement will persist an RDD
in memory: `df.cache()`. Alternatively for large dbs to persist in memory &
disk, use:
+Persisting will also speed up computation. This statement will persist an RDD
in memory: `df.cache()`. Alternatively for large dbs to persist in memory &
disk, use:
```python
from pyspark import StorageLevel
@@ -203,7 +223,7 @@ df.persist(storageLevel = StorageLevel(True, True, False,
True, 1))
### Scala API
-#### Using SQL In Scala
+#### Using SQL In Scala
```scala
val spark = SparkSession
@@ -216,7 +236,7 @@ val spark = SparkSession
// For implicit conversions of Dataframe to RDDs
import spark.implicits._
-
+
// create a temp table from Cloudant db and query it using sql syntax
spark.sql(
s"""
@@ -238,7 +258,7 @@ Submit job example:
spark-submit --class org.apache.spark.examples.sql.cloudant.CloudantApp
--packages
org.apache.bahir:spark-sql-cloudant_{{site.SCALA_BINARY_VERSION}}:{{site.SPARK_VERSION}}
--conf spark.cloudant.host=ACCOUNT.cloudant.com --conf
spark.cloudant.username=USERNAME --conf spark.cloudant.password=PASSWORD
/path/to/spark-sql-cloudant_{{site.SCALA_BINARY_VERSION}}-{{site.SPARK_VERSION}}-tests.jar
```
-### Using DataFrame In Scala
+### Using DataFrame In Scala
```scala
val spark = SparkSession
@@ -250,12 +270,12 @@ val spark = SparkSession
.config("createDBOnSave","true") // to create a db on save
.config("jsonstore.rdd.partitions", "20") // using 20 partitions
.getOrCreate()
-
+
// 1. Loading data from Cloudant db
val df = spark.read.format("org.apache.bahir.cloudant").load("n_flight")
// Caching df in memory to speed computations
// and not to retrieve data from cloudant again
-df.cache()
+df.cache()
df.printSchema()
// 2. Saving dataframe to Cloudant db
@@ -266,11 +286,11 @@
df2.write.format("org.apache.bahir.cloudant").save("n_flight2")
```
See
[CloudantDF.scala](examples/scala/src/main/scala/mytest/spark/CloudantDF.scala)
for examples.
-
+
[Sample
code](examples/scala/src/main/scala/mytest/spark/CloudantDFOption.scala) on
using DataFrame option to define Cloudant configuration.
-
-
-### Using Streams In Scala
+
+
+### Using Streams In Scala
```scala
val ssc = new StreamingContext(sparkConf, Seconds(10))
@@ -297,13 +317,13 @@ ssc.start()
// run streaming for 120 secs
Thread.sleep(120000L)
ssc.stop(true)
-
+
```
See
[CloudantStreaming.scala](examples/scala/src/main/scala/mytest/spark/CloudantStreaming.scala)
for examples.
-By default, Spark Streaming will load all documents from a database. If you
want to limit the loading to
-specific documents, use `selector` option of `CloudantReceiver` and specify
your conditions
+By default, Spark Streaming will load all documents from a database. If you
want to limit the loading to
+specific documents, use `selector` option of `CloudantReceiver` and specify
your conditions
(See
[CloudantStreamingSelector.scala](examples/scala/src/main/scala/mytest/spark/CloudantStreamingSelector.scala)
example for more details):
diff --git a/sql-streaming-akka/README.md b/sql-streaming-akka/README.md
index a29979b..2141d62 100644
--- a/sql-streaming-akka/README.md
+++ b/sql-streaming-akka/README.md
@@ -1,4 +1,24 @@
-A library for reading data from Akka Actors using Spark SQL Streaming ( or
Structured streaming.).
+<!--
+{% comment %}
+Licensed to the Apache Software Foundation (ASF) under one or more
+contributor license agreements. See the NOTICE file distributed with
+this work for additional information regarding copyright ownership.
+The ASF licenses this file to You under the Apache License, Version 2.0
+(the "License"); you may not use this file except in compliance with
+the License. You may obtain a copy of the License at
+
+ http://www.apache.org/licenses/LICENSE-2.0
+
+Unless required by applicable law or agreed to in writing, software
+distributed under the License is distributed on an "AS IS" BASIS,
+WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+See the License for the specific language governing permissions and
+limitations under the License.
+{% endcomment %}
+-->
+# Spark SQL Streaming Akka Data Source
+
+A library for reading data from Akka Actors using Spark SQL Streaming ( or
Structured streaming.).
## Linking
@@ -32,27 +52,27 @@ A SQL Stream can be created with data streams received from
Akka Feeder actor us
.format("org.apache.bahir.sql.streaming.akka.AkkaStreamSourceProvider")
.option("urlOfPublisher", "feederActorUri")
.load()
-
+
## Enable recovering from failures.
-
+
Setting values for option `persistenceDirPath` helps in recovering in case of
a restart, by restoring the state where it left off before the shutdown.
-
+
sqlContext.readStream
.format("org.apache.bahir.sql.streaming.akka.AkkaStreamSourceProvider")
.option("urlOfPublisher", "feederActorUri")
.option("persistenceDirPath", "/path/to/localdir")
- .load()
-
+ .load()
+
## Configuration options.
-
+
This source uses [Akka Actor
api](http://doc.akka.io/api/akka/2.5/akka/actor/Actor.html).
-
+
* `urlOfPublisher` The url of Publisher or Feeder actor that the Receiver
actor connects to. Set this as the tcp url of the Publisher or Feeder actor.
* `persistenceDirPath` By default it is used for storing incoming messages on
disk.
### Scala API
-An example, for scala API to count words from incoming message stream.
+An example, for scala API to count words from incoming message stream.
// Create DataFrame representing the stream of input lines from
connection
// to publisher or feeder actor
@@ -60,27 +80,27 @@ An example, for scala API to count words from incoming
message stream.
.format("org.apache.bahir.sql.streaming.akka.AkkaStreamSourceProvider")
.option("urlOfPublisher", urlOfPublisher)
.load().as[(String, Timestamp)]
-
+
// Split the lines into words
val words = lines.map(_._1).flatMap(_.split(" "))
-
+
// Generate running word count
val wordCounts = words.groupBy("value").count()
-
+
// Start running the query that prints the running counts to the
console
val query = wordCounts.writeStream
.outputMode("complete")
.format("console")
.start()
-
+
query.awaitTermination()
-
+
Please see `AkkaStreamWordCount.scala` for full example.
-
+
### Java API
-
+
An example, for Java API to count words from incoming message stream.
-
+
// Create DataFrame representing the stream of input lines from
connection
// to publisher or feeder actor
Dataset<String> lines = spark
@@ -88,7 +108,7 @@ An example, for Java API to count words from incoming
message stream.
.format("org.apache.bahir.sql.streaming.akka.AkkaStreamSourceProvider")
.option("urlOfPublisher", urlOfPublisher)
.load().select("value").as(Encoders.STRING());
-
+
// Split the lines into words
Dataset<String> words = lines.flatMap(new FlatMapFunction<String,
String>() {
@Override
@@ -96,16 +116,16 @@ An example, for Java API to count words from incoming
message stream.
return Arrays.asList(s.split(" ")).iterator();
}
}, Encoders.STRING());
-
+
// Generate running word count
Dataset<Row> wordCounts = words.groupBy("value").count();
-
+
// Start running the query that prints the running counts to the
console
StreamingQuery query = wordCounts.writeStream()
.outputMode("complete")
.format("console")
.start();
-
+
query.awaitTermination();
-
+
Please see `JavaAkkaStreamWordCount.java` for full example.
diff --git a/sql-streaming-jdbc/README.md b/sql-streaming-jdbc/README.md
index e302adb..65925c8 100644
--- a/sql-streaming-jdbc/README.md
+++ b/sql-streaming-jdbc/README.md
@@ -1,4 +1,24 @@
-A library for writing data to jdbc using Spark SQL Streaming (or Structured
streaming).
+<!--
+{% comment %}
+Licensed to the Apache Software Foundation (ASF) under one or more
+contributor license agreements. See the NOTICE file distributed with
+this work for additional information regarding copyright ownership.
+The ASF licenses this file to You under the Apache License, Version 2.0
+(the "License"); you may not use this file except in compliance with
+the License. You may obtain a copy of the License at
+
+ http://www.apache.org/licenses/LICENSE-2.0
+
+Unless required by applicable law or agreed to in writing, software
+distributed under the License is distributed on an "AS IS" BASIS,
+WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+See the License for the specific language governing permissions and
+limitations under the License.
+{% endcomment %}
+-->
+# Spark SQL Streaming JDBC Data Source
+
+A library for writing data to JDBC using Spark SQL Streaming (or Structured
streaming).
## Linking
diff --git a/sql-streaming-mqtt/README.md b/sql-streaming-mqtt/README.md
index 0fbf63e..b15f773 100644
--- a/sql-streaming-mqtt/README.md
+++ b/sql-streaming-mqtt/README.md
@@ -1,3 +1,23 @@
+<!--
+{% comment %}
+Licensed to the Apache Software Foundation (ASF) under one or more
+contributor license agreements. See the NOTICE file distributed with
+this work for additional information regarding copyright ownership.
+The ASF licenses this file to You under the Apache License, Version 2.0
+(the "License"); you may not use this file except in compliance with
+the License. You may obtain a copy of the License at
+
+ http://www.apache.org/licenses/LICENSE-2.0
+
+Unless required by applicable law or agreed to in writing, software
+distributed under the License is distributed on an "AS IS" BASIS,
+WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+See the License for the specific language governing permissions and
+limitations under the License.
+{% endcomment %}
+-->
+# Spark SQL Streaming MQTT Data Source
+
A library for writing and reading data from MQTT Servers using Spark SQL
Streaming (or Structured streaming).
## Linking
@@ -95,7 +115,7 @@ Custom environment variables allowing to manage MQTT
connectivity performed by s
### Scala API
-An example, for scala API to count words from incoming message stream.
+An example, for scala API to count words from incoming message stream.
// Create DataFrame representing the stream of input lines from connection
to mqtt server
val lines = spark.readStream
@@ -121,7 +141,7 @@ Please see `MQTTStreamWordCount.scala` for full example.
Review `MQTTSinkWordCou
### Java API
-An example, for Java API to count words from incoming message stream.
+An example, for Java API to count words from incoming message stream.
// Create DataFrame representing the stream of input lines from connection
to mqtt server.
Dataset<String> lines = spark
@@ -154,7 +174,7 @@ Please see `JavaMQTTStreamWordCount.java` for full example.
Review `JavaMQTTSink
## Best Practices.
-1. Turn Mqtt into a more reliable messaging service.
+1. Turn Mqtt into a more reliable messaging service.
> *MQTT is a machine-to-machine (M2M)/"Internet of Things" connectivity
> protocol. It was designed as an extremely lightweight publish/subscribe
> messaging transport.*
@@ -200,4 +220,3 @@ The design of Mqtt and the purpose it serves goes well
together, but often in an
Generally, one would create a lot of streaming pipelines to solve this
problem. This would either require a very sophisticated scheduling setup or
will waste a lot of resources, as it is not certain which stream is using more
amount of data.
The general solution is both less optimum and is more cumbersome to operate,
with multiple moving parts incurs a high maintenance overall. As an
alternative, in this situation, one can setup a single topic kafka-spark
stream, where message from each of the varied stream contains a unique tag
separating one from other streams. This way at the processing end, one can
distinguish the message from one another and apply the right kind of decoding
and processing. Similarly while storing, each [...]
-
diff --git a/sql-streaming-sqs/README.md b/sql-streaming-sqs/README.md
index c59ff5d..7008766 100644
--- a/sql-streaming-sqs/README.md
+++ b/sql-streaming-sqs/README.md
@@ -1,4 +1,24 @@
-A library for reading data from Amzon S3 with optimised listing using Amazon
SQS using Spark SQL Streaming ( or Structured streaming.).
+<!--
+{% comment %}
+Licensed to the Apache Software Foundation (ASF) under one or more
+contributor license agreements. See the NOTICE file distributed with
+this work for additional information regarding copyright ownership.
+The ASF licenses this file to You under the Apache License, Version 2.0
+(the "License"); you may not use this file except in compliance with
+the License. You may obtain a copy of the License at
+
+ http://www.apache.org/licenses/LICENSE-2.0
+
+Unless required by applicable law or agreed to in writing, software
+distributed under the License is distributed on an "AS IS" BASIS,
+WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+See the License for the specific language governing permissions and
+limitations under the License.
+{% endcomment %}
+-->
+# Spark SQL Streaming Amazon SQS Data Source
+
+A library for reading data from Amzon S3 with optimised listing using Amazon
SQS using Spark SQL Streaming ( or Structured streaming.).
## Linking
@@ -32,9 +52,9 @@ Name |Default | Meaning
sqsUrl|required, no default value|sqs queue url, like
'https://sqs.us-east-1.amazonaws.com/330183209093/TestQueue'
region|required, no default value|AWS region where queue is created
fileFormat|required, no default value|file format for the s3 files stored on
Amazon S3
-schema|required, no default value|schema of the data being read
+schema|required, no default value|schema of the data being read
sqsFetchIntervalSeconds|10|time interval (in seconds) after which to fetch
messages from Amazon SQS queue
-sqsLongPollingWaitTimeSeconds|20|wait time (in seconds) for long polling on
Amazon SQS queue
+sqsLongPollingWaitTimeSeconds|20|wait time (in seconds) for long polling on
Amazon SQS queue
sqsMaxConnections|1|number of parallel threads to connect to Amazon SQS queue
sqsMaxRetries|10|Maximum number of consecutive retries in case of a connection
failure to SQS before giving up
ignoreFileDeletion|false|whether to ignore any File deleted message in SQS
queue
diff --git a/streaming-akka/README.md b/streaming-akka/README.md
index bff9c25..66b0afe 100644
--- a/streaming-akka/README.md
+++ b/streaming-akka/README.md
@@ -1,5 +1,24 @@
-
-A library for reading data from Akka Actors using Spark Streaming.
+<!--
+{% comment %}
+Licensed to the Apache Software Foundation (ASF) under one or more
+contributor license agreements. See the NOTICE file distributed with
+this work for additional information regarding copyright ownership.
+The ASF licenses this file to You under the Apache License, Version 2.0
+(the "License"); you may not use this file except in compliance with
+the License. You may obtain a copy of the License at
+
+ http://www.apache.org/licenses/LICENSE-2.0
+
+Unless required by applicable law or agreed to in writing, software
+distributed under the License is distributed on an "AS IS" BASIS,
+WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+See the License for the specific language governing permissions and
+limitations under the License.
+{% endcomment %}
+-->
+# Spark Streaming Akka Connector
+
+A library for reading data from Akka Actors using Spark Streaming.
## Linking
diff --git a/streaming-mqtt/README.md b/streaming-mqtt/README.md
index 811f822..84afbbf 100644
--- a/streaming-mqtt/README.md
+++ b/streaming-mqtt/README.md
@@ -1,5 +1,24 @@
-
-[MQTT](http://mqtt.org/) is MQTT is a machine-to-machine (M2M)/"Internet of
Things" connectivity protocol. It was designed as an extremely lightweight
publish/subscribe messaging transport. It is useful for connections with remote
locations where a small code footprint is required and/or network bandwidth is
at a premium.
+<!--
+{% comment %}
+Licensed to the Apache Software Foundation (ASF) under one or more
+contributor license agreements. See the NOTICE file distributed with
+this work for additional information regarding copyright ownership.
+The ASF licenses this file to You under the Apache License, Version 2.0
+(the "License"); you may not use this file except in compliance with
+the License. You may obtain a copy of the License at
+
+ http://www.apache.org/licenses/LICENSE-2.0
+
+Unless required by applicable law or agreed to in writing, software
+distributed under the License is distributed on an "AS IS" BASIS,
+WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+See the License for the specific language governing permissions and
+limitations under the License.
+{% endcomment %}
+-->
+# Spark Streaming MQTT Connector
+
+[MQTT](http://mqtt.org/) is MQTT is a machine-to-machine (M2M)/"Internet of
Things" connectivity protocol. It was designed as an extremely lightweight
publish/subscribe messaging transport. It is useful for connections with remote
locations where a small code footprint is required and/or network bandwidth is
at a premium.
## Linking
@@ -86,4 +105,4 @@ Create a DStream from a list of topics.
```Python
MQTTUtils.createPairedStream(ssc, broker_url, topics)
-```
\ No newline at end of file
+```
diff --git a/streaming-pubnub/README.md b/streaming-pubnub/README.md
index e8097ee..f0ab495 100644
--- a/streaming-pubnub/README.md
+++ b/streaming-pubnub/README.md
@@ -1,3 +1,21 @@
+<!--
+{% comment %}
+Licensed to the Apache Software Foundation (ASF) under one or more
+contributor license agreements. See the NOTICE file distributed with
+this work for additional information regarding copyright ownership.
+The ASF licenses this file to You under the Apache License, Version 2.0
+(the "License"); you may not use this file except in compliance with
+the License. You may obtain a copy of the License at
+
+ http://www.apache.org/licenses/LICENSE-2.0
+
+Unless required by applicable law or agreed to in writing, software
+distributed under the License is distributed on an "AS IS" BASIS,
+WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+See the License for the specific language governing permissions and
+limitations under the License.
+{% endcomment %}
+-->
# Spark Streaming PubNub Connector
Library for reading data from real-time messaging infrastructure
[PubNub](https://www.pubnub.com/) using Spark Streaming.
@@ -5,11 +23,11 @@ Library for reading data from real-time messaging
infrastructure [PubNub](https:
## Linking
Using SBT:
-
+
libraryDependencies += "org.apache.bahir" %% "spark-streaming-pubnub" %
"{{site.SPARK_VERSION}}"
-
+
Using Maven:
-
+
<dependency>
<groupId>org.apache.bahir</groupId>
<artifactId>spark-streaming-pubnub_{{site.SCALA_BINARY_VERSION}}</artifactId>
@@ -38,7 +56,7 @@ For complete code examples, please review _examples_
directory.
import com.pubnub.api.PNConfiguration
import com.pubnub.api.enums.PNReconnectionPolicy
-
+
import org.apache.spark.streaming.pubnub.{PubNubUtils, SparkPubNubMessage}
val config = new PNConfiguration
@@ -55,7 +73,7 @@ For complete code examples, please review _examples_
directory.
import com.pubnub.api.PNConfiguration
import com.pubnub.api.enums.PNReconnectionPolicy
-
+
import org.apache.spark.streaming.pubnub.PubNubUtils
import org.apache.spark.streaming.pubnub.SparkPubNubMessage
@@ -79,4 +97,4 @@ Anyone playing with PubNub _demo_ credentials may interrupt
the tests, therefore
has to be explicitly enabled by setting environment variable
_ENABLE_PUBNUB_TESTS_ to _1_.
cd streaming-pubnub
- ENABLE_PUBNUB_TESTS=1 mvn clean test
\ No newline at end of file
+ ENABLE_PUBNUB_TESTS=1 mvn clean test
diff --git a/streaming-pubsub/README.md b/streaming-pubsub/README.md
index f20e5c9..74f185e 100644
--- a/streaming-pubsub/README.md
+++ b/streaming-pubsub/README.md
@@ -1,13 +1,33 @@
+<!--
+{% comment %}
+Licensed to the Apache Software Foundation (ASF) under one or more
+contributor license agreements. See the NOTICE file distributed with
+this work for additional information regarding copyright ownership.
+The ASF licenses this file to You under the Apache License, Version 2.0
+(the "License"); you may not use this file except in compliance with
+the License. You may obtain a copy of the License at
+
+ http://www.apache.org/licenses/LICENSE-2.0
+
+Unless required by applicable law or agreed to in writing, software
+distributed under the License is distributed on an "AS IS" BASIS,
+WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+See the License for the specific language governing permissions and
+limitations under the License.
+{% endcomment %}
+-->
+# Spark Streaming Google Cloud Pub/Sub Connector
+
A library for reading data from [Google Cloud
Pub/Sub](https://cloud.google.com/pubsub/) using Spark Streaming.
## Linking
Using SBT:
-
+
libraryDependencies += "org.apache.bahir" %% "spark-streaming-pubsub" %
"{{site.SPARK_VERSION}}"
-
+
Using Maven:
-
+
<dependency>
<groupId>org.apache.bahir</groupId>
<artifactId>spark-streaming-pubsub_{{site.SCALA_BINARY_VERSION}}</artifactId>
@@ -37,12 +57,12 @@ First you need to create credential by SparkGCPCredentials,
it support four type
`SparkGCPCredentials.builder.metadataServiceAccount().build()`
### Scala API
-
+
val lines = PubsubUtils.createStream(ssc, projectId, subscriptionName,
credential, ..)
-
+
### Java API
-
- JavaDStream<SparkPubsubMessage> lines = PubsubUtils.createStream(jssc,
projectId, subscriptionName, credential...)
+
+ JavaDStream<SparkPubsubMessage> lines = PubsubUtils.createStream(jssc,
projectId, subscriptionName, credential...)
See end-to-end examples at [Google Cloud Pubsub
Examples](streaming-pubsub/examples)
diff --git a/streaming-twitter/README.md b/streaming-twitter/README.md
index 1703606..117008e 100644
--- a/streaming-twitter/README.md
+++ b/streaming-twitter/README.md
@@ -1,5 +1,24 @@
-
-A library for reading social data from [twitter](http://twitter.com/) using
Spark Streaming.
+<!--
+{% comment %}
+Licensed to the Apache Software Foundation (ASF) under one or more
+contributor license agreements. See the NOTICE file distributed with
+this work for additional information regarding copyright ownership.
+The ASF licenses this file to You under the Apache License, Version 2.0
+(the "License"); you may not use this file except in compliance with
+the License. You may obtain a copy of the License at
+
+ http://www.apache.org/licenses/LICENSE-2.0
+
+Unless required by applicable law or agreed to in writing, software
+distributed under the License is distributed on an "AS IS" BASIS,
+WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+See the License for the specific language governing permissions and
+limitations under the License.
+{% endcomment %}
+-->
+# Spark Streaming Twitter Connector
+
+A library for reading social data from [twitter](http://twitter.com/) using
Spark Streaming.
## Linking
@@ -44,7 +63,7 @@ can be provided by any of the
[methods](http://twitter4j.org/en/configuration.ht
TwitterUtils.createStream(jssc);
-You can also either get the public stream, or get the filtered stream based on
keywords.
+You can also either get the public stream, or get the filtered stream based on
keywords.
See end-to-end examples at [Twitter
Examples](https://github.com/apache/bahir/tree/master/streaming-twitter/examples).
## Unit Test
@@ -59,4 +78,4 @@ Below listing present how to run complete test suite on local
workstation.
twitter4j.oauth.consumerSecret=${customer secret} \
twitter4j.oauth.accessToken=${access token} \
twitter4j.oauth.accessTokenSecret=${access token secret} \
- mvn clean test
\ No newline at end of file
+ mvn clean test
diff --git a/streaming-zeromq/README.md b/streaming-zeromq/README.md
index 8d57424..1dce0cf 100644
--- a/streaming-zeromq/README.md
+++ b/streaming-zeromq/README.md
@@ -1,6 +1,24 @@
+<!--
+{% comment %}
+Licensed to the Apache Software Foundation (ASF) under one or more
+contributor license agreements. See the NOTICE file distributed with
+this work for additional information regarding copyright ownership.
+The ASF licenses this file to You under the Apache License, Version 2.0
+(the "License"); you may not use this file except in compliance with
+the License. You may obtain a copy of the License at
+
+ http://www.apache.org/licenses/LICENSE-2.0
+
+Unless required by applicable law or agreed to in writing, software
+distributed under the License is distributed on an "AS IS" BASIS,
+WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+See the License for the specific language governing permissions and
+limitations under the License.
+{% endcomment %}
+-->
# Spark Streaming ZeroMQ Connector
-A library for reading data from [ZeroMQ](http://zeromq.org/) using Spark
Streaming.
+A library for reading data from [ZeroMQ](http://zeromq.org/) using Spark
Streaming.
## Linking
@@ -47,4 +65,4 @@ Review end-to-end examples at [ZeroMQ
Examples](https://github.com/apache/bahir/
JavaReceiverInputDStream<String> test1 = ZeroMQUtils.createJavaStream(
ssc, "tcp://server:5555", true, Arrays.asList("my-topic.getBytes()),
StorageLevel.MEMORY_AND_DISK_SER_2()
- );
\ No newline at end of file
+ );