[bahir-website] 05/07: Update documentation for Spark extensions

lresende Mon, 14 Dec 2020 17:41:59 -0800

This is an automated email from the ASF dual-hosted git repository.

lresende pushed a commit to branch master
in repository https://gitbox.apache.org/repos/asf/bahir-website.git


commit fdcf039618d495938aad7d6953038fc11ee585d0
Author: Luciano Resende <[email protected]>
AuthorDate: Mon Dec 14 17:36:56 2020 -0800

    Update documentation for Spark extensions
---
 site/docs/spark/current/spark-sql-cloudant.md      | 94 +++++++++++++---------
 .../docs/spark/current/spark-sql-streaming-akka.md | 66 +++++++++------
 .../docs/spark/current/spark-sql-streaming-mqtt.md | 67 ++++++++++-----
 site/docs/spark/current/spark-streaming-akka.md    | 23 +++++-
 site/docs/spark/current/spark-streaming-mqtt.md    | 25 +++++-
 site/docs/spark/current/spark-streaming-pubnub.md  | 29 ++++++-
 site/docs/spark/current/spark-streaming-pubsub.md  | 42 +++++++---
 site/docs/spark/current/spark-streaming-twitter.md | 41 +++++++++-
 site/docs/spark/current/spark-streaming-zeromq.md  | 24 +++++-
 9 files changed, 308 insertions(+), 103 deletions(-)

diff --git a/site/docs/spark/current/spark-sql-cloudant.md 
b/site/docs/spark/current/spark-sql-cloudant.md
index 355f10c..5cc9704 100644
--- a/site/docs/spark/current/spark-sql-cloudant.md
+++ b/site/docs/spark/current/spark-sql-cloudant.md
@@ -24,12 +24,32 @@ limitations under the License.
 -->
 
 {% include JB/setup %}
-A library for reading data from Cloudant or CouchDB databases using Spark SQL 
and Spark Streaming. 
+<!--
+{% comment %}
+Licensed to the Apache Software Foundation (ASF) under one or more
+contributor license agreements.  See the NOTICE file distributed with
+this work for additional information regarding copyright ownership.
+The ASF licenses this file to You under the Apache License, Version 2.0
+(the "License"); you may not use this file except in compliance with
+the License.  You may obtain a copy of the License at
+
+  http://www.apache.org/licenses/LICENSE-2.0
+
+Unless required by applicable law or agreed to in writing, software
+distributed under the License is distributed on an "AS IS" BASIS,
+WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+See the License for the specific language governing permissions and
+limitations under the License.
+{% endcomment %}
+-->
+# Apache CouchDB/Cloudant Data Source, Streaming Connector and SQL Streaming 
Data Source
 
-[IBM® Cloudant®](https://cloudant.com) is a document-oriented DataBase as a 
Service (DBaaS). It stores data as documents 
-in JSON format. It's built with scalability, high availability, and durability 
in mind. It comes with a 
-wide variety of indexing options including map-reduce, Cloudant Query, 
full-text indexing, and 
-geospatial indexing. The replication capabilities make it easy to keep data in 
sync between database 
+A library for reading data from Cloudant or CouchDB databases using Spark SQL 
and Spark Streaming.
+
+[IBM® Cloudant®](https://cloudant.com) is a document-oriented DataBase as a 
Service (DBaaS). It stores data as documents
+in JSON format. It's built with scalability, high availability, and durability 
in mind. It comes with a
+wide variety of indexing options including map-reduce, Cloudant Query, 
full-text indexing, and
+geospatial indexing. The replication capabilities make it easy to keep data in 
sync between database
 clusters, desktop PCs, and mobile devices.
 
 [Apache CouchDB™](http://couchdb.apache.org) is open source database software 
that focuses on ease of use and having an architecture that "completely 
embraces the Web". It has a document-oriented NoSQL database architecture and 
is implemented in the concurrency-oriented language Erlang; it uses JSON to 
store data, JavaScript as its query language using MapReduce, and HTTP for an 
API.
@@ -56,16 +76,16 @@ Unlike using `--jars`, using `--packages` ensures that this 
library and its depe
 The `--packages` argument can also be used with `bin/spark-submit`.
 
 Submit a job in Python:
-    
+
     spark-submit  --master local[4] --packages 
org.apache.bahir:spark-sql-cloudant__{{site.SCALA_BINARY_VERSION}}:{{site.SPARK_VERSION}}
  <path to python script>
-    
+
 Submit a job in Scala:
 
        spark-submit --class "<your class>" --master local[4] --packages 
org.apache.bahir:spark-sql-cloudant__{{site.SCALA_BINARY_VERSION}}:{{site.SPARK_VERSION}}
 <path to spark-sql-cloudant jar>
 
-This library is compiled for Scala 2.11 only, and intends to support Spark 2.0 
onwards.
+This library is cross-published for Scala 2.11 and Scala 2.12, so users should 
replace the proper Scala version in the commands listed above.
 
-## Configuration options       
+## Configuration options
 The configuration is obtained in the following sequence:
 
 1. default in the Config, which is set in the application.conf
@@ -90,7 +110,7 @@ cloudant.host| |cloudant host url
 cloudant.username| |cloudant userid
 cloudant.password| |cloudant password
 cloudant.numberOfRetries|3| number of times to replay a request that received 
a 429 `Too Many Requests` response
-cloudant.useQuery|false|by default, `_all_docs` endpoint is used if 
configuration 'view' and 'index' (see below) are not set. When useQuery is 
enabled, `_find` endpoint will be used in place of `_all_docs` when query 
condition is not on primary key field (_id), so that query predicates may be 
driven into datastore. 
+cloudant.useQuery|false|by default, `_all_docs` endpoint is used if 
configuration 'view' and 'index' (see below) are not set. When useQuery is 
enabled, `_find` endpoint will be used in place of `_all_docs` when query 
condition is not on primary key field (_id), so that query predicates may be 
driven into datastore.
 cloudant.queryLimit|25|the maximum number of results returned when querying 
the `_find` endpoint.
 cloudant.storageLevel|MEMORY_ONLY|the storage level for persisting Spark RDDs 
during load when `cloudant.endpoint` is set to `_changes`.  See [RDD 
Persistence 
section](https://spark.apache.org/docs/latest/programming-guide.html#rdd-persistence)
 in Spark's Progamming Guide for all available storage level options.
 cloudant.timeout|60000|stop the response after waiting the defined number of 
milliseconds for data.  Only supported with `changes` endpoint.
@@ -100,12 +120,12 @@ jsonstore.rdd.minInPartition|10|the min rows in a 
partition.
 jsonstore.rdd.requestTimeout|900000|the request timeout in milliseconds
 bulkSize|200|the bulk save size
 schemaSampleSize|-1|the sample size for RDD schema discovery. 1 means we are 
using only the first document for schema discovery; -1 means all documents; 0 
will be treated as 1; any number N means min(N, total) docs. Only supported 
with `_all_docs` endpoint.
-createDBOnSave|false|whether to create a new database during save operation. 
If false, a database should already exist. If true, a new database will be 
created. If true, and a database with a provided name already exists, an error 
will be raised. 
+createDBOnSave|false|whether to create a new database during save operation. 
If false, a database should already exist. If true, a new database will be 
created. If true, and a database with a provided name already exists, an error 
will be raised.
 
 The `cloudant.endpoint` option sets ` _changes` or `_all_docs` API endpoint to 
be called while loading Cloudant data into Spark DataFrames or SQL Tables.
 
-**Note:** When using `_changes` API, please consider: 
-1. Results are partially ordered and may not be be presented in order in 
+**Note:** When using `_changes` API, please consider:
+1. Results are partially ordered and may not be be presented in order in
 which documents were updated.
 2. In case of shards' unavailability, you may see duplicate results (changes 
that have been seen already)
 3. Can use `selector` option to filter Cloudant docs during load
@@ -116,23 +136,23 @@ which documents were updated.
 When using `_all_docs` API:
 1. Supports parallel reads (using offset and range) and partitioning.
 2. Using partitions may not represent the true snapshot of a database.  Some 
docs
-   may be added or deleted in the database between loading data into different 
+   may be added or deleted in the database between loading data into different
    Spark partitions.
 
 If loading Cloudant docs from a database greater than 100 MB, set 
`cloudant.endpoint` to `_changes` and `spark.streaming.unpersist` to `false`.
 This will enable RDD persistence during load against `_changes` endpoint and 
allow the persisted RDDs to be accessible after streaming completes.  
- 
-See 
[CloudantChangesDFSuite](src/test/scala/org/apache/bahir/cloudant/CloudantChangesDFSuite.scala)
 
+
+See 
[CloudantChangesDFSuite](src/test/scala/org/apache/bahir/cloudant/CloudantChangesDFSuite.scala)
 for examples of loading data into a Spark DataFrame with `_changes` API.
 
 ### Configuration on Spark SQL Temporary Table or DataFrame
 
-Besides all the configurations passed to a temporary table or dataframe 
through SparkConf, it is also possible to set the following configurations in 
temporary table or dataframe using OPTIONS: 
+Besides all the configurations passed to a temporary table or dataframe 
through SparkConf, it is also possible to set the following configurations in 
temporary table or dataframe using OPTIONS:
 
 Name | Default | Meaning
 --- |:---:| ---
 bulkSize|200| the bulk save size
-createDBOnSave|false| whether to create a new database during save operation. 
If false, a database should already exist. If true, a new database will be 
created. If true, and a database with a provided name already exists, an error 
will be raised. 
+createDBOnSave|false| whether to create a new database during save operation. 
If false, a database should already exist. If true, a new database will be 
created. If true, and a database with a provided name already exists, an error 
will be raised.
 database| | Cloudant database name
 index| | Cloudant Search index without the database name. Search index queries 
are limited to returning 200 results so can only be used to load data with <= 
200 results.
 path| | Cloudant: as database name if database is not present
@@ -140,7 +160,7 @@ schemaSampleSize|-1| the sample size used to discover the 
schema for this temp t
 selector|all documents| a selector written in Cloudant Query syntax, 
specifying conditions for selecting documents when the `cloudant.endpoint` 
option is set to `_changes`. Only documents satisfying the selector's 
conditions will be retrieved from Cloudant and loaded into Spark.
 view| | Cloudant view w/o the database name. only used for load.
 
-For fast loading, views are loaded without include_docs. Thus, a derived 
schema will always be: `{id, key, value}`, where `value `can be a compount 
field. An example of loading data from a view: 
+For fast loading, views are loaded without include_docs. Thus, a derived 
schema will always be: `{id, key, value}`, where `value `can be a compount 
field. An example of loading data from a view:
 
 ```python
 spark.sql(" CREATE TEMPORARY TABLE flightTable1 USING 
org.apache.bahir.cloudant OPTIONS ( database 'n_flight', view 
'_design/view/_view/AA0')")
@@ -166,8 +186,8 @@ The above stated configuration keys can also be set using 
`spark-submit --conf`
 
 ### Python API
 
-#### Using SQL In Python 
-       
+#### Using SQL In Python
+
 ```python
 spark = SparkSession\
     .builder\
@@ -194,7 +214,7 @@ Submit job example:
 spark-submit  --packages 
org.apache.bahir:spark-sql-cloudant_{{site.SCALA_BINARY_VERSION}}:{{site.SPARK_VERSION}}
 --conf spark.cloudant.host=ACCOUNT.cloudant.com --conf 
spark.cloudant.username=USERNAME --conf spark.cloudant.password=PASSWORD 
sql-cloudant/examples/python/CloudantApp.py
 ```
 
-#### Using DataFrame In Python 
+#### Using DataFrame In Python
 
 ```python
 spark = SparkSession\
@@ -208,17 +228,17 @@ spark = SparkSession\
 
 # ***1. Loading dataframe from Cloudant db
 df = spark.read.load("n_airportcodemapping", "org.apache.bahir.cloudant")
-df.cache() 
+df.cache()
 df.printSchema()
 df.filter(df.airportName >= 'Moscow').select("_id",'airportName').show()
 df.filter(df._id >= 'CAA').select("_id",'airportName').show()      
 ```
 
 See [CloudantDF.py](examples/python/CloudantDF.py) for examples.
-       
+
 In case of doing multiple operations on a dataframe (select, filter etc.),
 you should persist a dataframe. Otherwise, every operation on a dataframe will 
load the same data from Cloudant again.
-Persisting will also speed up computation. This statement will persist an RDD 
in memory: `df.cache()`.  Alternatively for large dbs to persist in memory & 
disk, use: 
+Persisting will also speed up computation. This statement will persist an RDD 
in memory: `df.cache()`.  Alternatively for large dbs to persist in memory & 
disk, use:
 
 ```python
 from pyspark import StorageLevel
@@ -229,7 +249,7 @@ df.persist(storageLevel = StorageLevel(True, True, False, 
True, 1))
 
 ### Scala API
 
-#### Using SQL In Scala 
+#### Using SQL In Scala
 
 ```scala
 val spark = SparkSession
@@ -242,7 +262,7 @@ val spark = SparkSession
 
 // For implicit conversions of Dataframe to RDDs
 import spark.implicits._
-    
+
 // create a temp table from Cloudant db and query it using sql syntax
 spark.sql(
     s"""
@@ -264,7 +284,7 @@ Submit job example:
 spark-submit --class org.apache.spark.examples.sql.cloudant.CloudantApp 
--packages 
org.apache.bahir:spark-sql-cloudant_{{site.SCALA_BINARY_VERSION}}:{{site.SPARK_VERSION}}
 --conf spark.cloudant.host=ACCOUNT.cloudant.com --conf 
spark.cloudant.username=USERNAME --conf spark.cloudant.password=PASSWORD  
/path/to/spark-sql-cloudant_{{site.SCALA_BINARY_VERSION}}-{{site.SPARK_VERSION}}-tests.jar
 ```
 
-### Using DataFrame In Scala 
+### Using DataFrame In Scala
 
 ```scala
 val spark = SparkSession
@@ -276,12 +296,12 @@ val spark = SparkSession
       .config("createDBOnSave","true") // to create a db on save
       .config("jsonstore.rdd.partitions", "20") // using 20 partitions
       .getOrCreate()
-          
+
 // 1. Loading data from Cloudant db
 val df = spark.read.format("org.apache.bahir.cloudant").load("n_flight")
 // Caching df in memory to speed computations
 // and not to retrieve data from cloudant again
-df.cache() 
+df.cache()
 df.printSchema()
 
 // 2. Saving dataframe to Cloudant db
@@ -292,11 +312,11 @@ 
df2.write.format("org.apache.bahir.cloudant").save("n_flight2")
 ```
 
 See 
[CloudantDF.scala](examples/scala/src/main/scala/mytest/spark/CloudantDF.scala) 
for examples.
-    
+
 [Sample 
code](examples/scala/src/main/scala/mytest/spark/CloudantDFOption.scala) on 
using DataFrame option to define Cloudant configuration.
- 
- 
-### Using Streams In Scala 
+
+
+### Using Streams In Scala
 
 ```scala
 val ssc = new StreamingContext(sparkConf, Seconds(10))
@@ -323,13 +343,13 @@ ssc.start()
 // run streaming for 120 secs
 Thread.sleep(120000L)
 ssc.stop(true)
-       
+
 ```
 
 See 
[CloudantStreaming.scala](examples/scala/src/main/scala/mytest/spark/CloudantStreaming.scala)
 for examples.
 
-By default, Spark Streaming will load all documents from a database. If you 
want to limit the loading to 
-specific documents, use `selector` option of `CloudantReceiver` and specify 
your conditions 
+By default, Spark Streaming will load all documents from a database. If you 
want to limit the loading to
+specific documents, use `selector` option of `CloudantReceiver` and specify 
your conditions
 (See 
[CloudantStreamingSelector.scala](examples/scala/src/main/scala/mytest/spark/CloudantStreamingSelector.scala)
 example for more details):
 
diff --git a/site/docs/spark/current/spark-sql-streaming-akka.md 
b/site/docs/spark/current/spark-sql-streaming-akka.md
index d88fc91..5fe1bab 100644
--- a/site/docs/spark/current/spark-sql-streaming-akka.md
+++ b/site/docs/spark/current/spark-sql-streaming-akka.md
@@ -24,7 +24,27 @@ limitations under the License.
 -->
 
 {% include JB/setup %}
-A library for reading data from Akka Actors using Spark SQL Streaming ( or 
Structured streaming.). 
+<!--
+{% comment %}
+Licensed to the Apache Software Foundation (ASF) under one or more
+contributor license agreements.  See the NOTICE file distributed with
+this work for additional information regarding copyright ownership.
+The ASF licenses this file to You under the Apache License, Version 2.0
+(the "License"); you may not use this file except in compliance with
+the License.  You may obtain a copy of the License at
+
+  http://www.apache.org/licenses/LICENSE-2.0
+
+Unless required by applicable law or agreed to in writing, software
+distributed under the License is distributed on an "AS IS" BASIS,
+WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+See the License for the specific language governing permissions and
+limitations under the License.
+{% endcomment %}
+-->
+# Spark SQL Streaming Akka Data Source
+
+A library for reading data from Akka Actors using Spark SQL Streaming ( or 
Structured streaming.).
 
 ## Linking
 
@@ -48,7 +68,7 @@ For example, to include it when starting the spark shell:
 Unlike using `--jars`, using `--packages` ensures that this library and its 
dependencies will be added to the classpath.
 The `--packages` argument can also be used with `bin/spark-submit`.
 
-This library is compiled for Scala 2.11 only, and intends to support Spark 2.0 
onwards.
+This library is cross-published for Scala 2.11 and Scala 2.12, so users should 
replace the proper Scala version in the commands listed above.
 
 ## Examples
 
@@ -58,27 +78,27 @@ A SQL Stream can be created with data streams received from 
Akka Feeder actor us
                 
.format("org.apache.bahir.sql.streaming.akka.AkkaStreamSourceProvider")
                 .option("urlOfPublisher", "feederActorUri")
                 .load()
-                
+
 ## Enable recovering from failures.
-                
+
 Setting values for option `persistenceDirPath` helps in recovering in case of 
a restart, by restoring the state where it left off before the shutdown.
-                
+
         sqlContext.readStream
                 
.format("org.apache.bahir.sql.streaming.akka.AkkaStreamSourceProvider")
                 .option("urlOfPublisher", "feederActorUri")
                 .option("persistenceDirPath", "/path/to/localdir")
-                .load() 
-                       
+                .load()
+
 ## Configuration options.
-                       
+
 This source uses [Akka Actor 
api](http://doc.akka.io/api/akka/2.5/akka/actor/Actor.html).
-                       
+
 * `urlOfPublisher` The url of Publisher or Feeder actor that the Receiver 
actor connects to. Set this as the tcp url of the Publisher or Feeder actor.
 * `persistenceDirPath` By default it is used for storing incoming messages on 
disk.
 
 ### Scala API
 
-An example, for scala API to count words from incoming message stream. 
+An example, for scala API to count words from incoming message stream.
 
         // Create DataFrame representing the stream of input lines from 
connection
         // to publisher or feeder actor
@@ -86,27 +106,27 @@ An example, for scala API to count words from incoming 
message stream.
                     
.format("org.apache.bahir.sql.streaming.akka.AkkaStreamSourceProvider")
                     .option("urlOfPublisher", urlOfPublisher)
                     .load().as[(String, Timestamp)]
-    
+
         // Split the lines into words
         val words = lines.map(_._1).flatMap(_.split(" "))
-    
+
         // Generate running word count
         val wordCounts = words.groupBy("value").count()
-    
+
         // Start running the query that prints the running counts to the 
console
         val query = wordCounts.writeStream
                     .outputMode("complete")
                     .format("console")
                     .start()
-    
+
         query.awaitTermination()
-        
+
 Please see `AkkaStreamWordCount.scala` for full example.     
-   
+
 ### Java API
-   
+
 An example, for Java API to count words from incoming message stream.
-   
+
         // Create DataFrame representing the stream of input lines from 
connection
         // to publisher or feeder actor
         Dataset<String> lines = spark
@@ -114,7 +134,7 @@ An example, for Java API to count words from incoming 
message stream.
                                 
.format("org.apache.bahir.sql.streaming.akka.AkkaStreamSourceProvider")
                                 .option("urlOfPublisher", urlOfPublisher)
                                 .load().select("value").as(Encoders.STRING());
-    
+
         // Split the lines into words
         Dataset<String> words = lines.flatMap(new FlatMapFunction<String, 
String>() {
           @Override
@@ -122,16 +142,16 @@ An example, for Java API to count words from incoming 
message stream.
             return Arrays.asList(s.split(" ")).iterator();
           }
         }, Encoders.STRING());
-    
+
         // Generate running word count
         Dataset<Row> wordCounts = words.groupBy("value").count();
-    
+
         // Start running the query that prints the running counts to the 
console
         StreamingQuery query = wordCounts.writeStream()
                                 .outputMode("complete")
                                 .format("console")
                                 .start();
-    
+
         query.awaitTermination();   
-         
+
 Please see `JavaAkkaStreamWordCount.java` for full example.      
diff --git a/site/docs/spark/current/spark-sql-streaming-mqtt.md 
b/site/docs/spark/current/spark-sql-streaming-mqtt.md
index 3317648..55b7c7f 100644
--- a/site/docs/spark/current/spark-sql-streaming-mqtt.md
+++ b/site/docs/spark/current/spark-sql-streaming-mqtt.md
@@ -25,6 +25,26 @@ limitations under the License.
 
 {% include JB/setup %}
 
+<!--
+{% comment %}
+Licensed to the Apache Software Foundation (ASF) under one or more
+contributor license agreements.  See the NOTICE file distributed with
+this work for additional information regarding copyright ownership.
+The ASF licenses this file to You under the Apache License, Version 2.0
+(the "License"); you may not use this file except in compliance with
+the License.  You may obtain a copy of the License at
+
+  http://www.apache.org/licenses/LICENSE-2.0
+
+Unless required by applicable law or agreed to in writing, software
+distributed under the License is distributed on an "AS IS" BASIS,
+WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+See the License for the specific language governing permissions and
+limitations under the License.
+{% endcomment %}
+-->
+# Spark SQL Streaming MQTT Data Source
+
 A library for writing and reading data from MQTT Servers using Spark SQL 
Streaming (or Structured streaming).
 
 ## Linking
@@ -49,7 +69,7 @@ For example, to include it when starting the spark shell:
 Unlike using `--jars`, using `--packages` ensures that this library and its 
dependencies will be added to the classpath.
 The `--packages` argument can also be used with `bin/spark-submit`.
 
-This library is compiled for Scala 2.11 only, and intends to support Spark 2.0 
onwards.
+This library is cross-published for Scala 2.11 and Scala 2.12, so users should 
replace the proper Scala version in the commands listed above.
 
 ## Examples
 
@@ -84,19 +104,31 @@ Setting values for option `localStorage` and `clientId` 
helps in recovering in c
 
 This connector uses [Eclipse Paho Java 
Client](https://eclipse.org/paho/clients/java/). Client API documentation is 
located [here](http://www.eclipse.org/paho/files/javadoc/index.html).
 
- * `brokerUrl` An URL MqttClient connects to. Set this or `path` as the URL of 
the Mqtt Server. e.g. tcp://localhost:1883.
- * `persistence` By default it is used for storing incoming messages on disk. 
If `memory` is provided as value for this option, then recovery on restart is 
not supported.
- * `topic` Topic MqttClient subscribes to.
- * `clientId` clientId, this client is associated with. Provide the same value 
to recover a stopped source client. MQTT sink ignores client identifier, 
because Spark batch can be distributed across multiple workers whereas MQTT 
broker does not allow simultanous connections with same ID from multiple hosts.
- * `QoS` The maximum quality of service to subscribe each topic at. Messages 
published at a lower quality of service will be received at the published QoS. 
Messages published at a higher quality of service will be received using the 
QoS specified on the subscribe.
- * `username` Sets the user name to use for the connection to Mqtt Server. Do 
not set it, if server does not need this. Setting it empty will lead to errors.
- * `password` Sets the password to use for the connection.
- * `cleanSession` Setting it true starts a clean session, removes all 
checkpointed messages by a previous run of this source. This is set to false by 
default.
- * `connectionTimeout` Sets the connection timeout, a value of 0 is 
interpretted as wait until client connects. See 
`MqttConnectOptions.setConnectionTimeout` for more information.
- * `keepAlive` Same as `MqttConnectOptions.setKeepAliveInterval`.
- * `mqttVersion` Same as `MqttConnectOptions.setMqttVersion`.
- * `maxInflight` Same as `MqttConnectOptions.setMaxInflight`
- * `autoReconnect` Same as `MqttConnectOptions.setAutomaticReconnect`
+| Parameter name             | Description                                     
                                                                                
                                                                                
                                                                                
  | Eclipse Paho reference                                                   |
+|----------------------------|---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|--------------------------------------------------------------------------|
+| `brokerUrl`                | URL MQTT client connects to. Specify this 
parameter or _path_. Example: _tcp://localhost:1883_, _ssl://localhost:1883_.   
                                                                                
                                                                                
        |                                                                       
   |
+| `persistence`              | Defines how incoming messages are stored. If 
_memory_ is provided as value for this option, recovery on restart is not 
supported. Otherwise messages are stored on disk and parameter _localStorage_ 
may define target directory.                                                    
             |                                                                  
        |
+| `topic`                    | Topic which client subscribes to.               
                                                                                
                                                                                
                                                                                
  |                                                                          |
+| `clientId`                 | Uniquely identifies client instance. Provide 
the same value to recover a stopped source client. MQTT sink ignores client 
identifier, because Spark batch can be distributed across multiple workers 
whereas MQTT broker does not allow simultaneous connections with same ID from 
multiple hosts. |                                                               
           |
+| `QoS`                      | The maximum quality of service to subscribe 
each topic at. Messages published at a lower quality of service will be 
received at the published QoS. Messages published at a higher quality of 
service will be received using the QoS specified on the subscribe.              
                     |                                                          
                |
+| `username`                 | User name used to authenticate with MQTT 
server. Do not set it, if server does not require authentication. Leaving empty 
may lead to errors.                                                             
                                                                                
         | `MqttConnectOptions.setUserName`                                     
    |
+| `password`                 | User password.                                  
                                                                                
                                                                                
                                                                                
  | `MqttConnectOptions.setPassword`                                         |
+| `cleanSession`             | Setting to _true_ starts a clean session, 
removes all check-pointed messages persisted during previous run. Defaults to 
`false`.                                                                        
                                                                                
          | `MqttConnectOptions.setCleanSession`                                
     |
+| `connectionTimeout`        | Sets the connection timeout, a value of _0_ is 
interpreted as wait until client connects.                                      
                                                                                
                                                                                
   | `MqttConnectOptions.setConnectionTimeout`                                |
+| `keepAlive`                | Sets the "keep alive" interval in seconds.      
                                                                                
                                                                                
                                                                                
  | `MqttConnectOptions.setKeepAliveInterval`                                |
+| `mqttVersion`              | Specify MQTT protocol version.                  
                                                                                
                                                                                
                                                                                
  | `MqttConnectOptions.setMqttVersion`                                      |
+| `maxInflight`              | Sets the maximum inflight requests. Useful for 
high volume traffic.                                                            
                                                                                
                                                                                
   | `MqttConnectOptions.setMaxInflight`                                      |
+| `autoReconnect`            | Sets whether the client will automatically 
attempt to reconnect to the server upon connectivity disruption.                
                                                                                
                                                                                
       | `MqttConnectOptions.setAutomaticReconnect`                             
  |
+| `ssl.protocol`             | SSL protocol. Example: _SSLv3_, _TLS_, _TLSv1_, 
_TLSv1.2_.                                                                      
                                                                                
                                                                                
  | `MqttConnectOptions.setSSLProperties`, `com.ibm.ssl.protocol`            |
+| `ssl.key.store`            | Absolute path to key store file.                
                                                                                
                                                                                
                                                                                
  | `MqttConnectOptions.setSSLProperties`, `com.ibm.ssl.keyStore`            |
+| `ssl.key.store.password`   | Key store password.                             
                                                                                
                                                                                
                                                                                
  | `MqttConnectOptions.setSSLProperties`, `com.ibm.ssl.keyStorePassword`    |
+| `ssl.key.store.type`       | Key store type. Example: _JKS_, _JCEKS_, 
_PKCS12_.                                                                       
                                                                                
                                                                                
         | `MqttConnectOptions.setSSLProperties`, `com.ibm.ssl.keyStoreType`    
    |
+| `ssl.key.store.provider`   | Key store provider. Example: _IBMJCE_.          
                                                                                
                                                                                
                                                                                
  | `MqttConnectOptions.setSSLProperties`, `com.ibm.ssl.keyStoreProvider`    |
+| `ssl.trust.store`          | Absolute path to trust store file.              
                                                                                
                                                                                
                                                                                
  | `MqttConnectOptions.setSSLProperties`, `com.ibm.ssl.trustStore`          |
+| `ssl.trust.store.password` | Trust store password.                           
                                                                                
                                                                                
                                                                                
  | `MqttConnectOptions.setSSLProperties`, `com.ibm.ssl.trustStorePassword`  |
+| `ssl.trust.store.type`     | Trust store type. Example: _JKS_, _JCEKS_, 
_PKCS12_.                                                                       
                                                                                
                                                                                
       | `MqttConnectOptions.setSSLProperties`, `com.ibm.ssl.trustStoreType`    
  |
+| `ssl.trust.store.provider` | Trust store provider. Example: _IBMJCEFIPS_.    
                                                                                
                                                                                
                                                                                
  | `MqttConnectOptions.setSSLProperties`, `com.ibm.ssl.trustStoreProvider`  |
+| `ssl.ciphers`              | List of enabled cipher suites. Example: 
_SSL_RSA_WITH_AES_128_CBC_SHA_.                                                 
                                                                                
                                                                                
          | `MqttConnectOptions.setSSLProperties`, 
`com.ibm.ssl.enabledCipherSuites` |
 
 ## Environment variables
 
@@ -110,7 +142,7 @@ Custom environment variables allowing to manage MQTT 
connectivity performed by s
 
 ### Scala API
 
-An example, for scala API to count words from incoming message stream. 
+An example, for scala API to count words from incoming message stream.
 
     // Create DataFrame representing the stream of input lines from connection 
to mqtt server
     val lines = spark.readStream
@@ -136,7 +168,7 @@ Please see `MQTTStreamWordCount.scala` for full example. 
Review `MQTTSinkWordCou
 
 ### Java API
 
-An example, for Java API to count words from incoming message stream. 
+An example, for Java API to count words from incoming message stream.
 
     // Create DataFrame representing the stream of input lines from connection 
to mqtt server.
     Dataset<String> lines = spark
@@ -169,7 +201,7 @@ Please see `JavaMQTTStreamWordCount.java` for full example. 
Review `JavaMQTTSink
 
 ## Best Practices.
 
-1. Turn Mqtt into a more reliable messaging service. 
+1. Turn Mqtt into a more reliable messaging service.
 
 > *MQTT is a machine-to-machine (M2M)/"Internet of Things" connectivity 
 > protocol. It was designed as an extremely lightweight publish/subscribe 
 > messaging transport.*
 
@@ -215,4 +247,3 @@ The design of Mqtt and the purpose it serves goes well 
together, but often in an
 Generally, one would create a lot of streaming pipelines to solve this 
problem. This would either require a very sophisticated scheduling setup or 
will waste a lot of resources, as it is not certain which stream is using more 
amount of data.
 
 The general solution is both less optimum and is more cumbersome to operate, 
with multiple moving parts incurs a high maintenance overall. As an 
alternative, in this situation, one can setup a single topic kafka-spark 
stream, where message from each of the varied stream contains a unique tag 
separating one from other streams. This way at the processing end, one can 
distinguish the message from one another and apply the right kind of decoding 
and processing. Similarly while storing, each  [...]
-
diff --git a/site/docs/spark/current/spark-streaming-akka.md 
b/site/docs/spark/current/spark-streaming-akka.md
index 0ede902..edeea88 100644
--- a/site/docs/spark/current/spark-streaming-akka.md
+++ b/site/docs/spark/current/spark-streaming-akka.md
@@ -24,8 +24,27 @@ limitations under the License.
 -->
 
 {% include JB/setup %}
+<!--
+{% comment %}
+Licensed to the Apache Software Foundation (ASF) under one or more
+contributor license agreements.  See the NOTICE file distributed with
+this work for additional information regarding copyright ownership.
+The ASF licenses this file to You under the Apache License, Version 2.0
+(the "License"); you may not use this file except in compliance with
+the License.  You may obtain a copy of the License at
+
+  http://www.apache.org/licenses/LICENSE-2.0
+
+Unless required by applicable law or agreed to in writing, software
+distributed under the License is distributed on an "AS IS" BASIS,
+WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+See the License for the specific language governing permissions and
+limitations under the License.
+{% endcomment %}
+-->
+# Spark Streaming Akka Connector
 
-A library for reading data from Akka Actors using Spark Streaming. 
+A library for reading data from Akka Actors using Spark Streaming.
 
 ## Linking
 
@@ -49,7 +68,7 @@ For example, to include it when starting the spark shell:
 Unlike using `--jars`, using `--packages` ensures that this library and its 
dependencies will be added to the classpath.
 The `--packages` argument can also be used with `bin/spark-submit`.
 
-This library is cross-published for Scala 2.10 and Scala 2.11, so users should 
replace the proper Scala version (2.10 or 2.11) in the commands listed above.
+This library is cross-published for Scala 2.11 and Scala 2.12, so users should 
replace the proper Scala version in the commands listed above.
 
 ## Examples
 
diff --git a/site/docs/spark/current/spark-streaming-mqtt.md 
b/site/docs/spark/current/spark-streaming-mqtt.md
index 7166f02..9910bfa 100644
--- a/site/docs/spark/current/spark-streaming-mqtt.md
+++ b/site/docs/spark/current/spark-streaming-mqtt.md
@@ -25,8 +25,27 @@ limitations under the License.
 
 {% include JB/setup %}
 
+<!--
+{% comment %}
+Licensed to the Apache Software Foundation (ASF) under one or more
+contributor license agreements.  See the NOTICE file distributed with
+this work for additional information regarding copyright ownership.
+The ASF licenses this file to You under the Apache License, Version 2.0
+(the "License"); you may not use this file except in compliance with
+the License.  You may obtain a copy of the License at
 
-[MQTT](http://mqtt.org/) is MQTT is a machine-to-machine (M2M)/"Internet of 
Things" connectivity protocol. It was designed as an extremely lightweight 
publish/subscribe messaging transport. It is useful for connections with remote 
locations where a small code footprint is required and/or network bandwidth is 
at a premium. 
+  http://www.apache.org/licenses/LICENSE-2.0
+
+Unless required by applicable law or agreed to in writing, software
+distributed under the License is distributed on an "AS IS" BASIS,
+WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+See the License for the specific language governing permissions and
+limitations under the License.
+{% endcomment %}
+-->
+# Spark Streaming MQTT Connector
+
+[MQTT](http://mqtt.org/) is MQTT is a machine-to-machine (M2M)/"Internet of 
Things" connectivity protocol. It was designed as an extremely lightweight 
publish/subscribe messaging transport. It is useful for connections with remote 
locations where a small code footprint is required and/or network bandwidth is 
at a premium.
 
 ## Linking
 
@@ -50,7 +69,7 @@ For example, to include it when starting the spark shell:
 Unlike using `--jars`, using `--packages` ensures that this library and its 
dependencies will be added to the classpath.
 The `--packages` argument can also be used with `bin/spark-submit`.
 
-This library is cross-published for Scala 2.10 and Scala 2.11, so users should 
replace the proper Scala version (2.10 or 2.11) in the commands listed above.
+This library is cross-published for Scala 2.11 and Scala 2.12, so users should 
replace the proper Scala version in the commands listed above.
 
 ## Configuration options.
 
@@ -113,4 +132,4 @@ Create a DStream from a list of topics.
 
 ```Python
        MQTTUtils.createPairedStream(ssc, broker_url, topics)
-```
\ No newline at end of file
+```
diff --git a/site/docs/spark/current/spark-streaming-pubnub.md 
b/site/docs/spark/current/spark-streaming-pubnub.md
index 84f7fe8..6cc5dda 100644
--- a/site/docs/spark/current/spark-streaming-pubnub.md
+++ b/site/docs/spark/current/spark-streaming-pubnub.md
@@ -24,6 +24,24 @@ limitations under the License.
 -->
 
 {% include JB/setup %}
+<!--
+{% comment %}
+Licensed to the Apache Software Foundation (ASF) under one or more
+contributor license agreements.  See the NOTICE file distributed with
+this work for additional information regarding copyright ownership.
+The ASF licenses this file to You under the Apache License, Version 2.0
+(the "License"); you may not use this file except in compliance with
+the License.  You may obtain a copy of the License at
+
+  http://www.apache.org/licenses/LICENSE-2.0
+
+Unless required by applicable law or agreed to in writing, software
+distributed under the License is distributed on an "AS IS" BASIS,
+WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+See the License for the specific language governing permissions and
+limitations under the License.
+{% endcomment %}
+-->
 # Spark Streaming PubNub Connector
 
 Library for reading data from real-time messaging infrastructure 
[PubNub](https://www.pubnub.com/) using Spark Streaming.
@@ -89,9 +107,9 @@ For complete code examples, please review _examples_ 
directory.
     config.setSubscribeKey(subscribeKey)
     config.setSecure(true)
     config.setReconnectionPolicy(PNReconnectionPolicy.LINEAR)
-    Set<String> channels = new HashSet<String>() {
+    Set<String> channels = new HashSet<String>() {{
         add("my-channel");
-    };
+    }};
 
     ReceiverInputDStream<SparkPubNubMessage> pubNubStream = 
PubNubUtils.createStream(
       ssc, config, channels, Collections.EMPTY_SET, null,
@@ -100,4 +118,9 @@ For complete code examples, please review _examples_ 
directory.
 
 ## Unit Test
 
-Unit tests take advantage of publicly available _demo_ subscription and and 
publish key, which has limited request rate.
+Unit tests take advantage of publicly available _demo_ subscription and 
publish key, which have limited request rate.
+Anyone playing with PubNub _demo_ credentials may interrupt the tests, 
therefore execution of integration tests
+has to be explicitly enabled by setting environment variable 
_ENABLE_PUBNUB_TESTS_ to _1_.
+
+    cd streaming-pubnub
+    ENABLE_PUBNUB_TESTS=1 mvn clean test
diff --git a/site/docs/spark/current/spark-streaming-pubsub.md 
b/site/docs/spark/current/spark-streaming-pubsub.md
index 4736aca..f50e33b 100644
--- a/site/docs/spark/current/spark-streaming-pubsub.md
+++ b/site/docs/spark/current/spark-streaming-pubsub.md
@@ -24,16 +24,36 @@ limitations under the License.
 -->
 
 {% include JB/setup %}
+<!--
+{% comment %}
+Licensed to the Apache Software Foundation (ASF) under one or more
+contributor license agreements.  See the NOTICE file distributed with
+this work for additional information regarding copyright ownership.
+The ASF licenses this file to You under the Apache License, Version 2.0
+(the "License"); you may not use this file except in compliance with
+the License.  You may obtain a copy of the License at
+
+  http://www.apache.org/licenses/LICENSE-2.0
+
+Unless required by applicable law or agreed to in writing, software
+distributed under the License is distributed on an "AS IS" BASIS,
+WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+See the License for the specific language governing permissions and
+limitations under the License.
+{% endcomment %}
+-->
+# Spark Streaming Google Cloud Pub/Sub Connector
+
 A library for reading data from [Google Cloud 
Pub/Sub](https://cloud.google.com/pubsub/) using Spark Streaming.
 
 ## Linking
 
 Using SBT:
-    
+
     libraryDependencies += "org.apache.bahir" %% "spark-streaming-pubsub" % 
"{{site.SPARK_VERSION}}"
-    
+
 Using Maven:
-    
+
     <dependency>
         <groupId>org.apache.bahir</groupId>
         
<artifactId>spark-streaming-pubsub_{{site.SCALA_BINARY_VERSION}}</artifactId>
@@ -53,20 +73,22 @@ The `--packages` argument can also be used with 
`bin/spark-submit`.
 First you need to create credential by SparkGCPCredentials, it support four 
type of credentials
 * application default
     `SparkGCPCredentials.builder.build()`
-* json type service account
+* JSON type service account (based on file or its binary content)
     `SparkGCPCredentials.builder.jsonServiceAccount(PATH_TO_JSON_KEY).build()`
-* p12 type service account
+    `SparkGCPCredentials.builder.jsonServiceAccount(JSON_KEY_BYTES).build()`
+* P12 type service account
     `SparkGCPCredentials.builder.p12ServiceAccount(PATH_TO_P12_KEY, 
EMAIL_ACCOUNT).build()`
-* metadata service account(running on dataproc)
+    `SparkGCPCredentials.builder.p12ServiceAccount(P12_KEY_BYTES, 
EMAIL_ACCOUNT).build()`
+* Metadata service account (running on dataproc)
     `SparkGCPCredentials.builder.metadataServiceAccount().build()`
 
 ### Scala API
-    
+
     val lines = PubsubUtils.createStream(ssc, projectId, subscriptionName, 
credential, ..)
-    
+
 ### Java API
-    
-    JavaDStream<SparkPubsubMessage> lines = PubsubUtils.createStream(jssc, 
projectId, subscriptionName, credential...) 
+
+    JavaDStream<SparkPubsubMessage> lines = PubsubUtils.createStream(jssc, 
projectId, subscriptionName, credential...)
 
 See end-to-end examples at [Google Cloud Pubsub 
Examples](streaming-pubsub/examples)
 
diff --git a/site/docs/spark/current/spark-streaming-twitter.md 
b/site/docs/spark/current/spark-streaming-twitter.md
index 3efc7f5..8abf744 100644
--- a/site/docs/spark/current/spark-streaming-twitter.md
+++ b/site/docs/spark/current/spark-streaming-twitter.md
@@ -24,8 +24,27 @@ limitations under the License.
 -->
 
 {% include JB/setup %}
+<!--
+{% comment %}
+Licensed to the Apache Software Foundation (ASF) under one or more
+contributor license agreements.  See the NOTICE file distributed with
+this work for additional information regarding copyright ownership.
+The ASF licenses this file to You under the Apache License, Version 2.0
+(the "License"); you may not use this file except in compliance with
+the License.  You may obtain a copy of the License at
 
-A library for reading social data from [twitter](http://twitter.com/) using 
Spark Streaming. 
+  http://www.apache.org/licenses/LICENSE-2.0
+
+Unless required by applicable law or agreed to in writing, software
+distributed under the License is distributed on an "AS IS" BASIS,
+WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+See the License for the specific language governing permissions and
+limitations under the License.
+{% endcomment %}
+-->
+# Spark Streaming Twitter Connector
+
+A library for reading social data from [twitter](http://twitter.com/) using 
Spark Streaming.
 
 ## Linking
 
@@ -49,7 +68,7 @@ For example, to include it when starting the spark shell:
 Unlike using `--jars`, using `--packages` ensures that this library and its 
dependencies will be added to the classpath.
 The `--packages` argument can also be used with `bin/spark-submit`.
 
-This library is cross-published for Scala 2.10 and Scala 2.11, so users should 
replace the proper Scala version (2.10 or 2.11) in the commands listed above.
+This library is cross-published for Scala 2.11 and Scala 2.12, so users should 
replace the proper Scala version in the commands listed above.
 
 
 ## Examples
@@ -70,5 +89,19 @@ can be provided by any of the 
[methods](http://twitter4j.org/en/configuration.ht
     TwitterUtils.createStream(jssc);
 
 
-You can also either get the public stream, or get the filtered stream based on 
keywords. 
-See end-to-end examples at [Twitter 
Examples](https://github.com/apache/bahir/tree/master/streaming-twitter/examples)
\ No newline at end of file
+You can also either get the public stream, or get the filtered stream based on 
keywords.
+See end-to-end examples at [Twitter 
Examples](https://github.com/apache/bahir/tree/master/streaming-twitter/examples).
+
+## Unit Test
+
+Executing integration tests requires users to register custom application at
+[Twitter Developer Portal](https://developer.twitter.com) and obtain private 
OAuth credentials.
+Below listing present how to run complete test suite on local workstation.
+
+    cd streaming-twitter
+    env ENABLE_TWITTER_TESTS=1 \
+        twitter4j.oauth.consumerKey=${customer key} \
+        twitter4j.oauth.consumerSecret=${customer secret} \
+        twitter4j.oauth.accessToken=${access token} \
+        twitter4j.oauth.accessTokenSecret=${access token secret} \
+        mvn clean test
diff --git a/site/docs/spark/current/spark-streaming-zeromq.md 
b/site/docs/spark/current/spark-streaming-zeromq.md
index 034380a..f715e77 100644
--- a/site/docs/spark/current/spark-streaming-zeromq.md
+++ b/site/docs/spark/current/spark-streaming-zeromq.md
@@ -24,9 +24,27 @@ limitations under the License.
 -->
 
 {% include JB/setup %}
+<!--
+{% comment %}
+Licensed to the Apache Software Foundation (ASF) under one or more
+contributor license agreements.  See the NOTICE file distributed with
+this work for additional information regarding copyright ownership.
+The ASF licenses this file to You under the Apache License, Version 2.0
+(the "License"); you may not use this file except in compliance with
+the License.  You may obtain a copy of the License at
+
+  http://www.apache.org/licenses/LICENSE-2.0
+
+Unless required by applicable law or agreed to in writing, software
+distributed under the License is distributed on an "AS IS" BASIS,
+WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+See the License for the specific language governing permissions and
+limitations under the License.
+{% endcomment %}
+-->
 # Spark Streaming ZeroMQ Connector
 
-A library for reading data from [ZeroMQ](http://zeromq.org/) using Spark 
Streaming. 
+A library for reading data from [ZeroMQ](http://zeromq.org/) using Spark 
Streaming.
 
 ## Linking
 
@@ -50,7 +68,7 @@ For example, to include it when starting the spark shell:
 Unlike using `--jars`, using `--packages` ensures that this library and its 
dependencies will be added to the classpath.
 The `--packages` argument can also be used with `bin/spark-submit`.
 
-This library is cross-published for Scala 2.10 and Scala 2.11, so users should 
replace the proper Scala version (2.10 or 2.11) in the commands listed above.
+This library is cross-published for Scala 2.11 and Scala 2.12, so users should 
replace the proper Scala version in the commands listed above.
 
 ## Examples
 
@@ -73,4 +91,4 @@ Review end-to-end examples at [ZeroMQ 
Examples](https://github.com/apache/bahir/
     JavaReceiverInputDStream<String> test1 = ZeroMQUtils.createJavaStream(
         ssc, "tcp://server:5555", true, Arrays.asList("my-topic.getBytes()),
         StorageLevel.MEMORY_AND_DISK_SER_2()
-    );
\ No newline at end of file
+    );

[bahir-website] 05/07: Update documentation for Spark extensions

Reply via email to