Repository: bahir-website Updated Branches: refs/heads/master 7b626696c -> 87eff6192
http://git-wip-us.apache.org/repos/asf/bahir-website/blob/87eff619/site/docs/spark/2.3.2/spark-streaming-mqtt.md ---------------------------------------------------------------------- diff --git a/site/docs/spark/2.3.2/spark-streaming-mqtt.md b/site/docs/spark/2.3.2/spark-streaming-mqtt.md new file mode 100644 index 0000000..4cd49fd --- /dev/null +++ b/site/docs/spark/2.3.2/spark-streaming-mqtt.md @@ -0,0 +1,116 @@ +--- +layout: page +title: Spark Structured Streaming MQTT +description: Spark Structured Streaming MQTT +group: nav-right +--- +<!-- +{% comment %} +Licensed to the Apache Software Foundation (ASF) under one or more +contributor license agreements. See the NOTICE file distributed with +this work for additional information regarding copyright ownership. +The ASF licenses this file to you under the Apache License, Version 2.0 +(the "License"); you may not use this file except in compliance with +the License. You may obtain a copy of the License at + +http://www.apache.org/licenses/LICENSE-2.0 + +Unless required by applicable law or agreed to in writing, software +distributed under the License is distributed on an "AS IS" BASIS, +WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +See the License for the specific language governing permissions and +limitations under the License. +{% endcomment %} +--> + +{% include JB/setup %} + + +[MQTT](http://mqtt.org/) is MQTT is a machine-to-machine (M2M)/"Internet of Things" connectivity protocol. It was designed as an extremely lightweight publish/subscribe messaging transport. It is useful for connections with remote locations where a small code footprint is required and/or network bandwidth is at a premium. + +## Linking + +Using SBT: + + libraryDependencies += "org.apache.bahir" %% "spark-streaming-mqtt" % "2.3.2" + +Using Maven: + + <dependency> + <groupId>org.apache.bahir</groupId> + <artifactId>spark-streaming-mqtt_2.11</artifactId> + <version>2.3.2</version> + </dependency> + +This library can also be added to Spark jobs launched through `spark-shell` or `spark-submit` by using the `--packages` command line option. +For example, to include it when starting the spark shell: + + $ bin/spark-shell --packages org.apache.bahir:spark-streaming-mqtt_2.11:2.3.2 + +Unlike using `--jars`, using `--packages` ensures that this library and its dependencies will be added to the classpath. +The `--packages` argument can also be used with `bin/spark-submit`. + +This library is cross-published for Scala 2.10 and Scala 2.11, so users should replace the proper Scala version (2.10 or 2.11) in the commands listed above. + +## Configuration options. + +This source uses the [Eclipse Paho Java Client](https://eclipse.org/paho/clients/java/). Client API documentation is located [here](http://www.eclipse.org/paho/files/javadoc/index.html). + + * `brokerUrl` A url MqttClient connects to. Set this as the url of the Mqtt Server. e.g. tcp://localhost:1883. + * `storageLevel` By default it is used for storing incoming messages on disk. + * `topic` Topic MqttClient subscribes to. + * `topics` List of topics MqttClient subscribes to. + * `clientId` clientId, this client is assoicated with. Provide the same value to recover a stopped client. + * `QoS` The maximum quality of service to subscribe each topic at. Messages published at a lower quality of service will be received at the published QoS. Messages published at a higher quality of service will be received using the QoS specified on the subscribe. + * `username` Sets the user name to use for the connection to Mqtt Server. Do not set it, if server does not need this. Setting it empty will lead to errors. + * `password` Sets the password to use for the connection. + * `cleanSession` Setting it true starts a clean session, removes all checkpointed messages by a previous run of this source. This is set to false by default. + * `connectionTimeout` Sets the connection timeout, a value of 0 is interpreted as wait until client connects. See `MqttConnectOptions.setConnectionTimeout` for more information. + * `keepAlive` Same as `MqttConnectOptions.setKeepAliveInterval`. + * `mqttVersion` Same as `MqttConnectOptions.setMqttVersion`. + + +## Examples + +### Scala API + +You need to extend `ActorReceiver` so as to store received data into Spark using `store(...)` methods. The supervisor strategy of +this actor can be configured to handle failures, etc. + + val lines = MQTTUtils.createStream(ssc, brokerUrl, topic) + val lines = MQTTUtils.createPairedStream(ssc, brokerUrl, topic) + val lines = MQTTUtils.createPairedByteArrayStream(ssc, brokerUrl, topic) + +Additional mqtt connection options can be provided: + +```Scala +val lines = MQTTUtils.createStream(ssc, brokerUrl, topic, storageLevel, clientId, username, password, cleanSession, qos, connectionTimeout, keepAliveInterval, mqttVersion) +val lines = MQTTUtils.createPairedStream(ssc, brokerUrl, topics, storageLevel, clientId, username, password, cleanSession, qos, connectionTimeout, keepAliveInterval, mqttVersion) +val lines = MQTTUtils.createPairedByteArrayStream(ssc, brokerUrl, topics, storageLevel, clientId, username, password, cleanSession, qos, connectionTimeout, keepAliveInterval, mqttVersion) +``` + +### Java API + +You need to extend `JavaActorReceiver` so as to store received data into Spark using `store(...)` methods. The supervisor strategy of +this actor can be configured to handle failures, etc. + + JavaDStream<String> lines = MQTTUtils.createStream(jssc, brokerUrl, topic); + JavaReceiverInputDStream<Tuple2<String, String>> lines = MQTTUtils.createPairedStream(jssc, brokerUrl, topics); + JavaReceiverInputDStream<Tuple2<String, String>> lines = MQTTUtils.createPairedByteArrayStream(jssc, brokerUrl, topics); + +See end-to-end examples at [MQTT Examples](https://github.com/apache/bahir/tree/master/streaming-mqtt/examples) + + +### Python API + +Create a DStream from a single topic. + +```Python + MQTTUtils.createStream(ssc, broker_url, topic) +``` + +Create a DStream from a list of topics. + +```Python + MQTTUtils.createPairedStream(ssc, broker_url, topics) +``` \ No newline at end of file http://git-wip-us.apache.org/repos/asf/bahir-website/blob/87eff619/site/docs/spark/2.3.2/spark-streaming-pubnub.md ---------------------------------------------------------------------- diff --git a/site/docs/spark/2.3.2/spark-streaming-pubnub.md b/site/docs/spark/2.3.2/spark-streaming-pubnub.md new file mode 100644 index 0000000..7901044 --- /dev/null +++ b/site/docs/spark/2.3.2/spark-streaming-pubnub.md @@ -0,0 +1,103 @@ +--- +layout: page +title: Spark Streaming Google Pub-Sub +description: Spark Streaming Google Pub-Sub +group: nav-right +--- +<!-- +{% comment %} +Licensed to the Apache Software Foundation (ASF) under one or more +contributor license agreements. See the NOTICE file distributed with +this work for additional information regarding copyright ownership. +The ASF licenses this file to you under the Apache License, Version 2.0 +(the "License"); you may not use this file except in compliance with +the License. You may obtain a copy of the License at + +http://www.apache.org/licenses/LICENSE-2.0 + +Unless required by applicable law or agreed to in writing, software +distributed under the License is distributed on an "AS IS" BASIS, +WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +See the License for the specific language governing permissions and +limitations under the License. +{% endcomment %} +--> + +{% include JB/setup %} +# Spark Streaming PubNub Connector + +Library for reading data from real-time messaging infrastructure [PubNub](https://www.pubnub.com/) using Spark Streaming. + +## Linking + +Using SBT: + + libraryDependencies += "org.apache.bahir" %% "spark-streaming-pubnub" % "2.3.2" + +Using Maven: + + <dependency> + <groupId>org.apache.bahir</groupId> + <artifactId>spark-streaming-pubnub_2.11</artifactId> + <version>2.3.2</version> + </dependency> + +This library can also be added to Spark jobs launched through `spark-shell` or `spark-submit` by using the `--packages` command line option. +For example, to include it when starting the spark shell: + + $ bin/spark-shell --packages org.apache.bahir:spark-streaming-pubnub_2.11:2.3.2 + +Unlike using `--jars`, using `--packages` ensures that this library and its dependencies will be added to the classpath. +The `--packages` argument can also be used with `bin/spark-submit`. + +## Examples + +Connector leverages official Java client for PubNub cloud infrastructure. You can import the `PubNubUtils` +class and create input stream by calling `PubNubUtils.createStream()` as shown below. Security and performance related +features shall be setup inside standard `PNConfiguration` object. We advise to configure reconnection policy so that +temporary network outages do not interrupt processing job. Users may subscribe to multiple channels and channel groups, +as well as specify time token to start receiving messages since given point in time. + +For complete code examples, please review _examples_ directory. + +### Scala API + + import com.pubnub.api.PNConfiguration + import com.pubnub.api.enums.PNReconnectionPolicy + + import org.apache.spark.streaming.pubnub.{PubNubUtils, SparkPubNubMessage} + + val config = new PNConfiguration + config.setSubscribeKey(subscribeKey) + config.setSecure(true) + config.setReconnectionPolicy(PNReconnectionPolicy.LINEAR) + val channel = "my-channel" + + val pubNubStream: ReceiverInputDStream[SparkPubNubMessage] = PubNubUtils.createStream( + ssc, config, Seq(channel), Seq(), None, StorageLevel.MEMORY_AND_DISK_SER_2 + ) + +### Java API + + import com.pubnub.api.PNConfiguration + import com.pubnub.api.enums.PNReconnectionPolicy + + import org.apache.spark.streaming.pubnub.PubNubUtils + import org.apache.spark.streaming.pubnub.SparkPubNubMessage + + PNConfiguration config = new PNConfiguration() + config.setSubscribeKey(subscribeKey) + config.setSecure(true) + config.setReconnectionPolicy(PNReconnectionPolicy.LINEAR) + Set<String> channels = new HashSet<String>() {{ + add("my-channel"); + }}; + + ReceiverInputDStream<SparkPubNubMessage> pubNubStream = PubNubUtils.createStream( + ssc, config, channels, Collections.EMPTY_SET, null, + StorageLevel.MEMORY_AND_DISK_SER_2() + ) + +## Unit Test + +Unit tests take advantage of publicly available _demo_ subscription and and publish key, which has limited request rate. \ No newline at end of file http://git-wip-us.apache.org/repos/asf/bahir-website/blob/87eff619/site/docs/spark/2.3.2/spark-streaming-pubsub.md ---------------------------------------------------------------------- diff --git a/site/docs/spark/2.3.2/spark-streaming-pubsub.md b/site/docs/spark/2.3.2/spark-streaming-pubsub.md new file mode 100644 index 0000000..3b8343f --- /dev/null +++ b/site/docs/spark/2.3.2/spark-streaming-pubsub.md @@ -0,0 +1,96 @@ +--- +layout: page +title: Spark Streaming PubNub +description: Spark Streaming PubNub +group: nav-right +--- +<!-- +{% comment %} +Licensed to the Apache Software Foundation (ASF) under one or more +contributor license agreements. See the NOTICE file distributed with +this work for additional information regarding copyright ownership. +The ASF licenses this file to you under the Apache License, Version 2.0 +(the "License"); you may not use this file except in compliance with +the License. You may obtain a copy of the License at + +http://www.apache.org/licenses/LICENSE-2.0 + +Unless required by applicable law or agreed to in writing, software +distributed under the License is distributed on an "AS IS" BASIS, +WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +See the License for the specific language governing permissions and +limitations under the License. +{% endcomment %} +--> + +{% include JB/setup %} +A library for reading data from [Google Cloud Pub/Sub](https://cloud.google.com/pubsub/) using Spark Streaming. + +## Linking + +Using SBT: + + libraryDependencies += "org.apache.bahir" %% "spark-streaming-pubsub" % "2.3.2" + +Using Maven: + + <dependency> + <groupId>org.apache.bahir</groupId> + <artifactId>spark-streaming-pubsub_2.11</artifactId> + <version>2.3.2</version> + </dependency> + +This library can also be added to Spark jobs launched through `spark-shell` or `spark-submit` by using the `--packages` command line option. +For example, to include it when starting the spark shell: + + $ bin/spark-shell --packages org.apache.bahir:spark-streaming-pubsub_2.11:2.3.2 + +Unlike using `--jars`, using `--packages` ensures that this library and its dependencies will be added to the classpath. +The `--packages` argument can also be used with `bin/spark-submit`. + +## Examples + +First you need to create credential by SparkGCPCredentials, it support four type of credentials +* application default + `SparkGCPCredentials.builder.build()` +* json type service account + `SparkGCPCredentials.builder.jsonServiceAccount(PATH_TO_JSON_KEY).build()` +* p12 type service account + `SparkGCPCredentials.builder.p12ServiceAccount(PATH_TO_P12_KEY, EMAIL_ACCOUNT).build()` +* metadata service account(running on dataproc) + `SparkGCPCredentials.builder.metadataServiceAccount().build()` + +### Scala API + + val lines = PubsubUtils.createStream(ssc, projectId, subscriptionName, credential, ..) + +### Java API + + JavaDStream<SparkPubsubMessage> lines = PubsubUtils.createStream(jssc, projectId, subscriptionName, credential...) + +See end-to-end examples at [Google Cloud Pubsub Examples](streaming-pubsub/examples) + +### Unit Test + +To run the PubSub test cases, you need to generate **Google API service account key files** and set the corresponding environment variable to enable the test. + +#### To generate a service account key file with PubSub permission + +1. Go to [Google API Console](console.cloud.google.com) +2. Choose the `Credentials` Tab> `Create credentials` button> `Service account key` +3. Fill the account name, assign `Role> Pub/Sub> Pub/Sub Editor` and check the option `Furnish a private key` to create one. You need to create one for JSON key file, another for P12. +4. The account email is the `Service account ID` + +#### Setting the environment variables and run test + +``` +mvn clean package -DskipTests -pl streaming-pubsub + +export ENABLE_PUBSUB_TESTS=1 +export GCP_TEST_ACCOUNT="THE_P12_SERVICE_ACCOUNT_ID_MENTIONED_ABOVE" +export GCP_TEST_PROJECT_ID="YOUR_GCP_PROJECT_ID" +export GCP_TEST_JSON_KEY_PATH=/path/to/pubsub/credential/files/Apache-Bahir-PubSub-1234abcd.json +export GCP_TEST_P12_KEY_PATH=/path/to/pubsub/credential/files/Apache-Bahir-PubSub-5678efgh.p12 + +mvn test -pl streaming-pubsub +``` http://git-wip-us.apache.org/repos/asf/bahir-website/blob/87eff619/site/docs/spark/2.3.2/spark-streaming-twitter.md ---------------------------------------------------------------------- diff --git a/site/docs/spark/2.3.2/spark-streaming-twitter.md b/site/docs/spark/2.3.2/spark-streaming-twitter.md new file mode 100644 index 0000000..348d55e --- /dev/null +++ b/site/docs/spark/2.3.2/spark-streaming-twitter.md @@ -0,0 +1,74 @@ +--- +layout: page +title: Spark Streaming Twitter +description: Spark Streaming Twitter +group: nav-right +--- +<!-- +{% comment %} +Licensed to the Apache Software Foundation (ASF) under one or more +contributor license agreements. See the NOTICE file distributed with +this work for additional information regarding copyright ownership. +The ASF licenses this file to you under the Apache License, Version 2.0 +(the "License"); you may not use this file except in compliance with +the License. You may obtain a copy of the License at + +http://www.apache.org/licenses/LICENSE-2.0 + +Unless required by applicable law or agreed to in writing, software +distributed under the License is distributed on an "AS IS" BASIS, +WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +See the License for the specific language governing permissions and +limitations under the License. +{% endcomment %} +--> + +{% include JB/setup %} + +A library for reading social data from [twitter](http://twitter.com/) using Spark Streaming. + +## Linking + +Using SBT: + + libraryDependencies += "org.apache.bahir" %% "spark-streaming-twitter" % "2.3.2" + +Using Maven: + + <dependency> + <groupId>org.apache.bahir</groupId> + <artifactId>spark-streaming-twitter_2.11</artifactId> + <version>2.3.2</version> + </dependency> + +This library can also be added to Spark jobs launched through `spark-shell` or `spark-submit` by using the `--packages` command line option. +For example, to include it when starting the spark shell: + + $ bin/spark-shell --packages org.apache.bahir:spark-streaming-twitter_2.11:2.3.2 + +Unlike using `--jars`, using `--packages` ensures that this library and its dependencies will be added to the classpath. +The `--packages` argument can also be used with `bin/spark-submit`. + +This library is cross-published for Scala 2.10 and Scala 2.11, so users should replace the proper Scala version (2.10 or 2.11) in the commands listed above. + + +## Examples + +`TwitterUtils` uses Twitter4j to get the public stream of tweets using [Twitter's Streaming API](https://dev.twitter.com/docs/streaming-apis). Authentication information +can be provided by any of the [methods](http://twitter4j.org/en/configuration.html) supported by Twitter4J library. You can import the `TwitterUtils` class and create a DStream with `TwitterUtils.createStream` as shown below. + +### Scala API + + import org.apache.spark.streaming.twitter._ + + TwitterUtils.createStream(ssc, None) + +### Java API + + import org.apache.spark.streaming.twitter.*; + + TwitterUtils.createStream(jssc); + + +You can also either get the public stream, or get the filtered stream based on keywords. +See end-to-end examples at [Twitter Examples](https://github.com/apache/bahir/tree/master/streaming-twitter/examples) \ No newline at end of file http://git-wip-us.apache.org/repos/asf/bahir-website/blob/87eff619/site/docs/spark/2.3.2/spark-streaming-zeromq.md ---------------------------------------------------------------------- diff --git a/site/docs/spark/2.3.2/spark-streaming-zeromq.md b/site/docs/spark/2.3.2/spark-streaming-zeromq.md new file mode 100644 index 0000000..530a54c --- /dev/null +++ b/site/docs/spark/2.3.2/spark-streaming-zeromq.md @@ -0,0 +1,76 @@ +--- +layout: page +title: Spark Streaming ZeroMQ +description: Spark Streaming ZeroMQ +group: nav-right +--- +<!-- +{% comment %} +Licensed to the Apache Software Foundation (ASF) under one or more +contributor license agreements. See the NOTICE file distributed with +this work for additional information regarding copyright ownership. +The ASF licenses this file to you under the Apache License, Version 2.0 +(the "License"); you may not use this file except in compliance with +the License. You may obtain a copy of the License at + +http://www.apache.org/licenses/LICENSE-2.0 + +Unless required by applicable law or agreed to in writing, software +distributed under the License is distributed on an "AS IS" BASIS, +WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +See the License for the specific language governing permissions and +limitations under the License. +{% endcomment %} +--> + +{% include JB/setup %} +# Spark Streaming ZeroMQ Connector + +A library for reading data from [ZeroMQ](http://zeromq.org/) using Spark Streaming. + +## Linking + +Using SBT: + + libraryDependencies += "org.apache.bahir" %% "spark-streaming-zeromq" % "2.3.2" + +Using Maven: + + <dependency> + <groupId>org.apache.bahir</groupId> + <artifactId>spark-streaming-zeromq_2.11</artifactId> + <version>2.3.2</version> + </dependency> + +This library can also be added to Spark jobs launched through `spark-shell` or `spark-submit` by using the `--packages` command line option. +For example, to include it when starting the spark shell: + + $ bin/spark-shell --packages org.apache.bahir:spark-streaming-zeromq_2.11:2.3.2 + +Unlike using `--jars`, using `--packages` ensures that this library and its dependencies will be added to the classpath. +The `--packages` argument can also be used with `bin/spark-submit`. + +This library is cross-published for Scala 2.10 and Scala 2.11, so users should replace the proper Scala version (2.10 or 2.11) in the commands listed above. + +## Examples + +Review end-to-end examples at [ZeroMQ Examples](https://github.com/apache/bahir/tree/master/streaming-zeromq/examples). + +### Scala API + + import org.apache.spark.streaming.zeromq.ZeroMQUtils + + val lines = ZeroMQUtils.createTextStream( + ssc, "tcp://server:5555", true, Seq("my-topic".getBytes) + ) + +### Java API + + import org.apache.spark.storage.StorageLevel; + import org.apache.spark.streaming.api.java.JavaReceiverInputDStream; + import org.apache.spark.streaming.zeromq.ZeroMQUtils; + + JavaReceiverInputDStream<String> test1 = ZeroMQUtils.createJavaStream( + ssc, "tcp://server:5555", true, Arrays.asList("my-topic.getBytes()), + StorageLevel.MEMORY_AND_DISK_SER_2() + ); \ No newline at end of file http://git-wip-us.apache.org/repos/asf/bahir-website/blob/87eff619/site/docs/spark/overview.md ---------------------------------------------------------------------- diff --git a/site/docs/spark/overview.md b/site/docs/spark/overview.md index 4a2b6a6..23d95f3 100644 --- a/site/docs/spark/overview.md +++ b/site/docs/spark/overview.md @@ -28,6 +28,9 @@ limitations under the License. ### Apache Bahir Extensions for Apache Spark - [Current - 2.3.0-SNAPSHOT](/docs/spark/current/documentation) + - [2.3.2](/docs/spark/2.3.2/documentation) + - [2.3.1](/docs/spark/2.3.1/documentation) + - [2.3.0](/docs/spark/2.3.0/documentation) - [2.2.2](/docs/spark/2.2.2/documentation) - [2.2.1](/docs/spark/2.2.1/documentation) - [2.2.0](/docs/spark/2.2.0/documentation)
