[1/2] bahir-website git commit: Update spark current documentation

lresende Thu, 07 Jun 2018 01:44:22 -0700

Repository: bahir-website
Updated Branches:
  refs/heads/master 1a80d15f0 -> c705ec885



Update spark current documentation


Project: http://git-wip-us.apache.org/repos/asf/bahir-website/repo
Commit: http://git-wip-us.apache.org/repos/asf/bahir-website/commit/0eb334b8
Tree: http://git-wip-us.apache.org/repos/asf/bahir-website/tree/0eb334b8
Diff: http://git-wip-us.apache.org/repos/asf/bahir-website/diff/0eb334b8

Branch: refs/heads/master
Commit: 0eb334b8520f0ff4e61f16665876f5f3dab68504
Parents: 1a80d15
Author: Luciano Resende <[email protected]>
Authored: Thu Jun 7 10:41:52 2018 +0200
Committer: Luciano Resende <[email protected]>
Committed: Thu Jun 7 10:41:52 2018 +0200

----------------------------------------------------------------------
 site/docs/spark/current/spark-sql-cloudant.md   | 69 ++++++++++++++------
 .../spark/current/spark-sql-streaming-mqtt.md   |  7 +-
 site/docs/spark/current/spark-streaming-akka.md |  2 +-
 site/docs/spark/current/spark-streaming-mqtt.md | 20 +++++-
 .../spark/current/spark-streaming-pubsub.md     | 25 +++++++
 .../spark/current/spark-streaming-twitter.md    |  6 +-
 .../spark/current/spark-streaming-zeromq.md     |  4 +-
 7 files changed, 102 insertions(+), 31 deletions(-)
----------------------------------------------------------------------


http://git-wip-us.apache.org/repos/asf/bahir-website/blob/0eb334b8/site/docs/spark/current/spark-sql-cloudant.md
----------------------------------------------------------------------
diff --git a/site/docs/spark/current/spark-sql-cloudant.md 
b/site/docs/spark/current/spark-sql-cloudant.md
index b0f9c76..355f10c 100644
--- a/site/docs/spark/current/spark-sql-cloudant.md
+++ b/site/docs/spark/current/spark-sql-cloudant.md
@@ -57,15 +57,14 @@ The `--packages` argument can also be used with 
`bin/spark-submit`.
 
 Submit a job in Python:
     
-    spark-submit  --master local[4] --jars <path to cloudant-spark.jar>  <path 
to python script> 
+    spark-submit  --master local[4] --packages 
org.apache.bahir:spark-sql-cloudant__{{site.SCALA_BINARY_VERSION}}:{{site.SPARK_VERSION}}
  <path to python script>
     
 Submit a job in Scala:
 
-       spark-submit --class "<your class>" --master local[4] --jars <path to 
cloudant-spark.jar> <path to your app jar>
+       spark-submit --class "<your class>" --master local[4] --packages 
org.apache.bahir:spark-sql-cloudant__{{site.SCALA_BINARY_VERSION}}:{{site.SPARK_VERSION}}
 <path to spark-sql-cloudant jar>
 
 This library is compiled for Scala 2.11 only, and intends to support Spark 2.0 
onwards.
 
-
 ## Configuration options       
 The configuration is obtained in the following sequence:
 
@@ -78,25 +77,53 @@ Here each subsequent configuration overrides the previous 
one. Thus, configurati
 
 
 ### Configuration in application.conf
-Default values are defined in 
[here](cloudant-spark-sql/src/main/resources/application.conf).
+Default values are defined in [here](src/main/resources/application.conf).
 
 ### Configuration on SparkConf
 
 Name | Default | Meaning
 --- |:---:| ---
+cloudant.batchInterval|8|number of seconds to set for streaming all documents 
from `_changes` endpoint into Spark dataframe.  See [Setting the right batch 
interval](https://spark.apache.org/docs/latest/streaming-programming-guide.html#setting-the-right-batch-interval)
 for tuning this value.
+cloudant.endpoint|`_all_docs`|endpoint for RelationProvider when loading data 
from Cloudant to DataFrames or SQL temporary tables. Select between the 
Cloudant `_all_docs` or `_changes` API endpoint.  See **Note** below for 
differences between endpoints.
 cloudant.protocol|https|protocol to use to transfer data: http or https
-cloudant.host||cloudant host url
-cloudant.username||cloudant userid
-cloudant.password||cloudant password
-cloudant.useQuery|false|By default, _all_docs endpoint is used if 
configuration 'view' and 'index' (see below) are not set. When useQuery is 
enabled, _find endpoint will be used in place of _all_docs when query condition 
is not on primary key field (_id), so that query predicates may be driven into 
datastore. 
-cloudant.queryLimit|25|The maximum number of results returned when querying 
the _find endpoint.
-jsonstore.rdd.partitions|10|the number of partitions intent used to drive 
JsonStoreRDD loading query result in parallel. The actual number is calculated 
based on total rows returned and satisfying maxInPartition and minInPartition
+cloudant.host| |cloudant host url
+cloudant.username| |cloudant userid
+cloudant.password| |cloudant password
+cloudant.numberOfRetries|3| number of times to replay a request that received 
a 429 `Too Many Requests` response
+cloudant.useQuery|false|by default, `_all_docs` endpoint is used if 
configuration 'view' and 'index' (see below) are not set. When useQuery is 
enabled, `_find` endpoint will be used in place of `_all_docs` when query 
condition is not on primary key field (_id), so that query predicates may be 
driven into datastore. 
+cloudant.queryLimit|25|the maximum number of results returned when querying 
the `_find` endpoint.
+cloudant.storageLevel|MEMORY_ONLY|the storage level for persisting Spark RDDs 
during load when `cloudant.endpoint` is set to `_changes`.  See [RDD 
Persistence 
section](https://spark.apache.org/docs/latest/programming-guide.html#rdd-persistence)
 in Spark's Progamming Guide for all available storage level options.
+cloudant.timeout|60000|stop the response after waiting the defined number of 
milliseconds for data.  Only supported with `changes` endpoint.
+jsonstore.rdd.partitions|10|the number of partitions intent used to drive 
JsonStoreRDD loading query result in parallel. The actual number is calculated 
based on total rows returned and satisfying maxInPartition and minInPartition. 
Only supported with `_all_docs` endpoint.
 jsonstore.rdd.maxInPartition|-1|the max rows in a partition. -1 means unlimited
 jsonstore.rdd.minInPartition|10|the min rows in a partition.
-jsonstore.rdd.requestTimeout|900000| the request timeout in milliseconds
-bulkSize|200| the bulk save size
-schemaSampleSize| "-1" | the sample size for RDD schema discovery. 1 means we 
are using only first document for schema discovery; -1 means all documents; 0 
will be treated as 1; any number N means min(N, total) docs 
-createDBOnSave|"false"| whether to create a new database during save 
operation. If false, a database should already exist. If true, a new database 
will be created. If true, and a database with a provided name already exists, 
an error will be raised. 
+jsonstore.rdd.requestTimeout|900000|the request timeout in milliseconds
+bulkSize|200|the bulk save size
+schemaSampleSize|-1|the sample size for RDD schema discovery. 1 means we are 
using only the first document for schema discovery; -1 means all documents; 0 
will be treated as 1; any number N means min(N, total) docs. Only supported 
with `_all_docs` endpoint.
+createDBOnSave|false|whether to create a new database during save operation. 
If false, a database should already exist. If true, a new database will be 
created. If true, and a database with a provided name already exists, an error 
will be raised. 
+
+The `cloudant.endpoint` option sets ` _changes` or `_all_docs` API endpoint to 
be called while loading Cloudant data into Spark DataFrames or SQL Tables.
+
+**Note:** When using `_changes` API, please consider: 
+1. Results are partially ordered and may not be be presented in order in 
+which documents were updated.
+2. In case of shards' unavailability, you may see duplicate results (changes 
that have been seen already)
+3. Can use `selector` option to filter Cloudant docs during load
+4. Supports a real snapshot of the database and represents it in a single 
point of time.
+5. Only supports a single partition.
+
+
+When using `_all_docs` API:
+1. Supports parallel reads (using offset and range) and partitioning.
+2. Using partitions may not represent the true snapshot of a database.  Some 
docs
+   may be added or deleted in the database between loading data into different 
+   Spark partitions.
+
+If loading Cloudant docs from a database greater than 100 MB, set 
`cloudant.endpoint` to `_changes` and `spark.streaming.unpersist` to `false`.
+This will enable RDD persistence during load against `_changes` endpoint and 
allow the persisted RDDs to be accessible after streaming completes.  
+ 
+See 
[CloudantChangesDFSuite](src/test/scala/org/apache/bahir/cloudant/CloudantChangesDFSuite.scala)
 
+for examples of loading data into a Spark DataFrame with `_changes` API.
 
 ### Configuration on Spark SQL Temporary Table or DataFrame
 
@@ -104,13 +131,14 @@ Besides all the configurations passed to a temporary 
table or dataframe through
 
 Name | Default | Meaning
 --- |:---:| ---
-database||cloudant database name
-view||cloudant view w/o the database name. only used for load.
-index||cloudant search index w/o the database name. only used for load data 
with less than or equal to 200 results.
-path||cloudant: as database name if database is not present
-schemaSampleSize|"-1"| the sample size used to discover the schema for this 
temp table. -1 scans all documents
 bulkSize|200| the bulk save size
-createDBOnSave|"false"| whether to create a new database during save 
operation. If false, a database should already exist. If true, a new database 
will be created. If true, and a database with a provided name already exists, 
an error will be raised. 
+createDBOnSave|false| whether to create a new database during save operation. 
If false, a database should already exist. If true, a new database will be 
created. If true, and a database with a provided name already exists, an error 
will be raised. 
+database| | Cloudant database name
+index| | Cloudant Search index without the database name. Search index queries 
are limited to returning 200 results so can only be used to load data with <= 
200 results.
+path| | Cloudant: as database name if database is not present
+schemaSampleSize|-1| the sample size used to discover the schema for this temp 
table. -1 scans all documents
+selector|all documents| a selector written in Cloudant Query syntax, 
specifying conditions for selecting documents when the `cloudant.endpoint` 
option is set to `_changes`. Only documents satisfying the selector's 
conditions will be retrieved from Cloudant and loaded into Spark.
+view| | Cloudant view w/o the database name. only used for load.
 
 For fast loading, views are loaded without include_docs. Thus, a derived 
schema will always be: `{id, key, value}`, where `value `can be a compount 
field. An example of loading data from a view: 
 
@@ -129,7 +157,6 @@ cloudant.password||cloudant password
 database||cloudant database name
 selector| all documents| a selector written in Cloudant Query syntax, 
specifying conditions for selecting documents. Only documents satisfying the 
selector's conditions will be retrieved from Cloudant and loaded into Spark.
 
-
 ### Configuration in spark-submit using --conf option
 
 The above stated configuration keys can also be set using `spark-submit 
--conf` option. When passing configuration in spark-submit, make sure adding 
"spark." as prefix to the keys.

http://git-wip-us.apache.org/repos/asf/bahir-website/blob/0eb334b8/site/docs/spark/current/spark-sql-streaming-mqtt.md
----------------------------------------------------------------------
diff --git a/site/docs/spark/current/spark-sql-streaming-mqtt.md 
b/site/docs/spark/current/spark-sql-streaming-mqtt.md
index fa533d1..98632df 100644
--- a/site/docs/spark/current/spark-sql-streaming-mqtt.md
+++ b/site/docs/spark/current/spark-sql-streaming-mqtt.md
@@ -25,7 +25,7 @@ limitations under the License.
 
 {% include JB/setup %}
 
-A library for reading data from MQTT Servers using Spark SQL Streaming ( or 
Structured streaming.).
+A library for reading data from MQTT Servers using Spark SQL Streaming ( or 
Structured streaming.). 
 
 ## Linking
 
@@ -89,7 +89,7 @@ This source uses [Eclipse Paho Java 
Client](https://eclipse.org/paho/clients/jav
 
 ### Scala API
 
-An example, for scala API to count words from incoming message stream.
+An example, for scala API to count words from incoming message stream. 
 
     // Create DataFrame representing the stream of input lines from connection 
to mqtt server
     val lines = spark.readStream
@@ -115,7 +115,7 @@ Please see `MQTTStreamWordCount.scala` for full example.
 
 ### Java API
 
-An example, for Java API to count words from incoming message stream.
+An example, for Java API to count words from incoming message stream. 
 
     // Create DataFrame representing the stream of input lines from connection 
to mqtt server.
     Dataset<String> lines = spark
@@ -144,3 +144,4 @@ An example, for Java API to count words from incoming 
message stream.
     query.awaitTermination();
 
 Please see `JavaMQTTStreamWordCount.java` for full example.
+

http://git-wip-us.apache.org/repos/asf/bahir-website/blob/0eb334b8/site/docs/spark/current/spark-streaming-akka.md
----------------------------------------------------------------------
diff --git a/site/docs/spark/current/spark-streaming-akka.md 
b/site/docs/spark/current/spark-streaming-akka.md
index 41cc57e..0ede902 100644
--- a/site/docs/spark/current/spark-streaming-akka.md
+++ b/site/docs/spark/current/spark-streaming-akka.md
@@ -25,7 +25,7 @@ limitations under the License.
 
 {% include JB/setup %}
 
-A library for reading data from Akka Actors using Spark Streaming.
+A library for reading data from Akka Actors using Spark Streaming. 
 
 ## Linking
 

http://git-wip-us.apache.org/repos/asf/bahir-website/blob/0eb334b8/site/docs/spark/current/spark-streaming-mqtt.md
----------------------------------------------------------------------
diff --git a/site/docs/spark/current/spark-streaming-mqtt.md 
b/site/docs/spark/current/spark-streaming-mqtt.md
index 42b8704..7166f02 100644
--- a/site/docs/spark/current/spark-streaming-mqtt.md
+++ b/site/docs/spark/current/spark-streaming-mqtt.md
@@ -26,7 +26,7 @@ limitations under the License.
 {% include JB/setup %}
 
 
-[MQTT](http://mqtt.org/) is MQTT is a machine-to-machine (M2M)/"Internet of 
Things" connectivity protocol. It was designed as an extremely lightweight 
publish/subscribe messaging transport. It is useful for connections with remote 
locations where a small code footprint is required and/or network bandwidth is 
at a premium.
+[MQTT](http://mqtt.org/) is MQTT is a machine-to-machine (M2M)/"Internet of 
Things" connectivity protocol. It was designed as an extremely lightweight 
publish/subscribe messaging transport. It is useful for connections with remote 
locations where a small code footprint is required and/or network bandwidth is 
at a premium. 
 
 ## Linking
 
@@ -79,12 +79,14 @@ this actor can be configured to handle failures, etc.
 
     val lines = MQTTUtils.createStream(ssc, brokerUrl, topic)
     val lines = MQTTUtils.createPairedStream(ssc, brokerUrl, topic)
+    val lines = MQTTUtils.createPairedByteArrayStream(ssc, brokerUrl, topic)
 
 Additional mqtt connection options can be provided:
 
 ```Scala
 val lines = MQTTUtils.createStream(ssc, brokerUrl, topic, storageLevel, 
clientId, username, password, cleanSession, qos, connectionTimeout, 
keepAliveInterval, mqttVersion)
 val lines = MQTTUtils.createPairedStream(ssc, brokerUrl, topics, storageLevel, 
clientId, username, password, cleanSession, qos, connectionTimeout, 
keepAliveInterval, mqttVersion)
+val lines = MQTTUtils.createPairedByteArrayStream(ssc, brokerUrl, topics, 
storageLevel, clientId, username, password, cleanSession, qos, 
connectionTimeout, keepAliveInterval, mqttVersion)
 ```
 
 ### Java API
@@ -94,5 +96,21 @@ this actor can be configured to handle failures, etc.
 
     JavaDStream<String> lines = MQTTUtils.createStream(jssc, brokerUrl, topic);
     JavaReceiverInputDStream<Tuple2<String, String>> lines = 
MQTTUtils.createPairedStream(jssc, brokerUrl, topics);
+    JavaReceiverInputDStream<Tuple2<String, String>> lines = 
MQTTUtils.createPairedByteArrayStream(jssc, brokerUrl, topics);
 
 See end-to-end examples at [MQTT 
Examples](https://github.com/apache/bahir/tree/master/streaming-mqtt/examples)
+
+
+### Python API
+
+Create a DStream from a single topic.
+
+```Python
+       MQTTUtils.createStream(ssc, broker_url, topic)
+```
+
+Create a DStream from a list of topics.
+
+```Python
+       MQTTUtils.createPairedStream(ssc, broker_url, topics)
+```
\ No newline at end of file

http://git-wip-us.apache.org/repos/asf/bahir-website/blob/0eb334b8/site/docs/spark/current/spark-streaming-pubsub.md
----------------------------------------------------------------------
diff --git a/site/docs/spark/current/spark-streaming-pubsub.md 
b/site/docs/spark/current/spark-streaming-pubsub.md
index a448c94..83f2532 100644
--- a/site/docs/spark/current/spark-streaming-pubsub.md
+++ b/site/docs/spark/current/spark-streaming-pubsub.md
@@ -69,3 +69,28 @@ First you need to create credential by SparkGCPCredentials, 
it support four type
     JavaDStream<SparkPubsubMessage> lines = PubsubUtils.createStream(jssc, 
projectId, subscriptionName, credential...) 
 
 See end-to-end examples at [Google Cloud Pubsub 
Examples](streaming-pubsub/examples)
+
+### Unit Test
+
+To run the PubSub test cases, you need to generate **Google API service 
account key files** and set the corresponding environment variable to enable 
the test.
+
+#### To generate a service account key file with PubSub permission
+
+1. Go to [Google API Console](console.cloud.google.com)
+2. Choose the `Credentials` Tab> `Create credentials` button> `Service account 
key`
+3. Fill the account name, assign `Role> Pub/Sub> Pub/Sub Editor` and check the 
option `Furnish a private key` to create one. You need to create one for JSON 
key file, another for P12.
+4. The account email is the `Service account ID`
+
+#### Setting the environment variables and run test
+
+```
+mvn clean package -DskipTests -pl streaming-pubsub
+
+export ENABLE_PUBSUB_TESTS=1
+export GCP_TEST_ACCOUNT="THE_P12_SERVICE_ACCOUNT_ID_MENTIONED_ABOVE"
+export GCP_TEST_PROJECT_ID="YOUR_GCP_PROJECT_ID"
+export 
GCP_TEST_JSON_KEY_PATH=/path/to/pubsub/credential/files/Apache-Bahir-PubSub-1234abcd.json
+export 
GCP_TEST_P12_KEY_PATH=/path/to/pubsub/credential/files/Apache-Bahir-PubSub-5678efgh.p12
+
+mvn test -pl streaming-pubsub
+```

http://git-wip-us.apache.org/repos/asf/bahir-website/blob/0eb334b8/site/docs/spark/current/spark-streaming-twitter.md
----------------------------------------------------------------------
diff --git a/site/docs/spark/current/spark-streaming-twitter.md 
b/site/docs/spark/current/spark-streaming-twitter.md
index 616fd2a..3efc7f5 100644
--- a/site/docs/spark/current/spark-streaming-twitter.md
+++ b/site/docs/spark/current/spark-streaming-twitter.md
@@ -25,7 +25,7 @@ limitations under the License.
 
 {% include JB/setup %}
 
-A library for reading social data from [twitter](http://twitter.com/) using 
Spark Streaming.
+A library for reading social data from [twitter](http://twitter.com/) using 
Spark Streaming. 
 
 ## Linking
 
@@ -70,5 +70,5 @@ can be provided by any of the 
[methods](http://twitter4j.org/en/configuration.ht
     TwitterUtils.createStream(jssc);
 
 
-You can also either get the public stream, or get the filtered stream based on 
keywords.
-See end-to-end examples at [Twitter 
Examples](https://github.com/apache/bahir/tree/master/streaming-twitter/examples)
+You can also either get the public stream, or get the filtered stream based on 
keywords. 
+See end-to-end examples at [Twitter 
Examples](https://github.com/apache/bahir/tree/master/streaming-twitter/examples)
\ No newline at end of file

http://git-wip-us.apache.org/repos/asf/bahir-website/blob/0eb334b8/site/docs/spark/current/spark-streaming-zeromq.md
----------------------------------------------------------------------
diff --git a/site/docs/spark/current/spark-streaming-zeromq.md 
b/site/docs/spark/current/spark-streaming-zeromq.md
index 4543d90..e826b5a 100644
--- a/site/docs/spark/current/spark-streaming-zeromq.md
+++ b/site/docs/spark/current/spark-streaming-zeromq.md
@@ -25,7 +25,7 @@ limitations under the License.
 
 {% include JB/setup %}
 
-A library for reading data from [ZeroMQ](http://zeromq.org/) using Spark 
Streaming.
+A library for reading data from [ZeroMQ](http://zeromq.org/) using Spark 
Streaming. 
 
 ## Linking
 
@@ -62,4 +62,4 @@ This library is cross-published for Scala 2.10 and Scala 
2.11, so users should r
 
     JavaDStream<String> lines = ZeroMQUtils.createStream(jssc, ...);
 
-See end-to-end examples at [ZeroMQ 
Examples](https://github.com/apache/bahir/tree/master/streaming-zeromq/examples)
+See end-to-end examples at [ZeroMQ 
Examples](https://github.com/apache/bahir/tree/master/streaming-zeromq/examples)
\ No newline at end of file

[1/2] bahir-website git commit: Update spark current documentation

Reply via email to