Github user ejwhite922 commented on a diff in the pull request: https://github.com/apache/incubator-rya/pull/273#discussion_r170082068 --- Diff: extras/rya.manual/src/site/markdown/rya-streams.md --- @@ -0,0 +1,385 @@ +<!-- +[comment]: # Licensed to the Apache Software Foundation (ASF) under one +[comment]: # or more contributor license agreements. See the NOTICE file +[comment]: # distributed with this work for additional information +[comment]: # regarding copyright ownership. The ASF licenses this file +[comment]: # to you under the Apache License, Version 2.0 (the +[comment]: # "License"); you may not use this file except in compliance +[comment]: # with the License. You may obtain a copy of the License at +[comment]: # +[comment]: # http://www.apache.org/licenses/LICENSE-2.0 +[comment]: # +[comment]: # Unless required by applicable law or agreed to in writing, +[comment]: # software distributed under the License is distributed on an +[comment]: # "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY +[comment]: # KIND, either express or implied. See the License for the +[comment]: # specific language governing permissions and limitations +[comment]: # under the License. +--> + +# Rya Streams + +Introduced in 3.2.12 + +## Disclaimer +This is a Beta feature. We do not guarantee newer versions of Rya Streams +will be compatible with this version. You may need to remove all of your +queries and their associated data from your Rya Streams system and then +reprocess them using the upgraded system. + +# Table of Contents +- [Introduction](#introduction) +- [Architecture](#architecture) +- [Quick Start](#quick-start) +- [Use Cases](#use-cases) +- [Future Work](#future-work) + +<div id='introduction'/> + +## Introduction +Rya Streams is a system that processes SPARQL queries over streams of RDF +Statements that may have Visibilities attached to them. It does this by +utilizing Kafka as a data processing platform. + +There are three basic building blocks that the system depends on: + +* **Streams Query** - This is a SPARQL query that is registered wth Rya Streams. + It is associated with a specific Rya instance because that Rya instance + determines which Statements the query will evaluate. It has an ID that + uniquely identifies it across all of the queries that are managed by the + system, whether or not the system should be processing it, and whether or not + the results of the query needs to be inserted back into the Rya instance the + source statements come from. + +* **Query Change Log** - A list of changes that have been performed to the + Streams Queries of a specific Rya Instance. This log contains the absolute + truth about what queries are registered, which are running, and which generate + new statements that need to be inserted back into Rya. + +* **Query Manager** - A daemon application that reacts to new/deleted Query + Change Logs as well as new entries within those logs. It starts and stops + Kafka Streams processing jobs for the active Streams Queries. + +The Quick Start section explains how the Rya Streams Client is used to interact +with the system using a simple SPARQL query and some sample Statements. + +<div id='architecture'/> + +## Architecture ## + +The following image is a high level view of how Rya Streams interacts with +Rya to process queries. + +![alt text](../images/rya-streams-high-level.png) + +1. The Rya Streams client is used to register/update/delete queries. +2. Rya Streams notices the change starts/stops processing a query based on + what the change was. +3. Statements are discovered for ingest by whatever application is loading data + into Rya. +4. Those Statements are written to Rya Streams so that the Streams Queries may + process them and produce results. +5. Those Statements are also written to the Rya instance for ad-hoc querying. +6. Rya Streams produces Visibility Binding Sets and/or Visibility Statements + that are written back to Rya. +7. Those same Visibility Binding Sets and/or Visibility Statements are made + available to other systems as well. + +<div id='quick-start'/> + +## Quick Start ## +This tutorial demonstrates how to install and start the Rya Streams system. It +must be configured and running on its own before any Rya instances may use it. +After performing the steps of this quick start, you will have installed the +system, demonstrated that it is functioning properly, and then may use the Rya +Shell to configure Rya instances to use it. + +This tutorial assumes you are starting fresh and have no existing Kafka, or +Zookeeper data. The two services must already be installed and running. + +### Step 1: Download the applications ### + +You can fetch the artifacts you need to follow this Quick Start from our +[downloads page](http://rya.apache.org/download/). Click on the release of +interest and follow the "Central repository for Maven and other dependency +managers" URL. + +Fetch the following two artifacts: + +Artifact Id | Type +--- | --- +rya.streams.client | shaded jar +rya.streams.query-manager | rpm + +### Step 2: Install the Query Manager ### + +Copy the RPM to the CentOS 7 machine the Query Manager will be installed on. +Install it using the following command: + +``` +yum install -y rya.streams.query-manager-3.2.12-incubating.noarch.rpm +``` + +It will install the program to **/opt/rya-streams-query-manager-3.2.12**. Follow +the directions that are in the README.txt file within that directory to finish +configuration. + +### Step 3: Register a Streams Query with Rya Streams ### + +Use the Rya Streams Client to register the following Streams Query: + +``` +SELECT * +WHERE { + ?person <urn:talksTo> ?employee . + ?employee <urn:worksAt> ?business +} +``` +We assume Kafka is running on the local machine using the standard Kafka port. +Issue the following command: + +``` +java -jar rya.streams.client-3.2.12-incubating-SNAPSHOT-shaded.jar add-query \ + -i localhost -p 9092 -r rya-streams-quick-start -a true \ + -q "SELECT * WHERE { ?person <urn:talksTo> ?employee .?employee <urn:worksAt> ?business }" +``` +The Query Manager should eventually see that this query was registered and +start a Rya Streams job that will begin processing it using any Visibility +Statements that have been loaded into Rya Streams. If no results are observed +in step 6 , then verify the Query Manager is working properly. It's logs can be +found in **/opt/rya-streams-query-manager-3.2.12/logs**. + +### Step 4: Start watching for results ### + +We need to fetch the Query ID of the query we want to watch. This can be looked +up by issuing the following command: + +``` +java -jar rya.streams.client-3.2.12-incubating-SNAPSHOT-shaded.jar list-queries \ + -i localhost -p 9092 -r rya-streams-quick-start" +``` + +The client will print something that looks like this: + +``` +Queries in Rya Streams: +--------------------------------------------------------- +ID: 8dd689ee-9d16-4aa7-91c0-667cdb3ed81a Is Active: true Query: SELECT * WHERE { ?person <urn:talksTo> ?employee .?employee <urn:worksAt> ?business } +``` + +Now we know that if we want to reference the query we registered is step 3, we +need to use the Query ID **8dd689ee-9d16-4aa7-91c0-667cdb3ed81a**. Start +watching the Stream Query's output using the following command: + +``` +java -jar rya.streams.client-3.2.12-incubating-SNAPSHOT-shaded.jar stream-results \ + -i localhost -p 9092 -r rya-streams-quick-start" -q 8dd689ee-9d16-4aa7-91c0-667cdb3ed81a +``` + +The command will not finish. The console will print results once they are produced +by the Rya Streams job that is processing the query. Results will appear once +Statements have been loaded. + +### Step 5: Load data into the input topic ### + +In a new terminal, create a file named **quick-start-data.nt** that contains +the following text: + +``` +<urn:Bob> <urn:worksAt> <urn:TacoPlace> . +<urn:Charlie> <urn:worksAt> <urn:BurngerJoint> . --- End diff -- typo. BurgerJoint
---