Github user kchilton2 commented on a diff in the pull request: https://github.com/apache/incubator-rya/pull/273#discussion_r170710673 --- Diff: extras/rya.manual/src/site/markdown/rya-streams.md --- @@ -0,0 +1,385 @@ +<!-- +[comment]: # Licensed to the Apache Software Foundation (ASF) under one +[comment]: # or more contributor license agreements. See the NOTICE file +[comment]: # distributed with this work for additional information +[comment]: # regarding copyright ownership. The ASF licenses this file +[comment]: # to you under the Apache License, Version 2.0 (the +[comment]: # "License"); you may not use this file except in compliance +[comment]: # with the License. You may obtain a copy of the License at +[comment]: # +[comment]: # http://www.apache.org/licenses/LICENSE-2.0 +[comment]: # +[comment]: # Unless required by applicable law or agreed to in writing, +[comment]: # software distributed under the License is distributed on an +[comment]: # "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY +[comment]: # KIND, either express or implied. See the License for the +[comment]: # specific language governing permissions and limitations +[comment]: # under the License. +--> + +# Rya Streams + +Introduced in 3.2.12 + +## Disclaimer +This is a Beta feature. We do not guarantee newer versions of Rya Streams +will be compatible with this version. You may need to remove all of your +queries and their associated data from your Rya Streams system and then +reprocess them using the upgraded system. + +# Table of Contents +- [Introduction](#introduction) +- [Architecture](#architecture) +- [Quick Start](#quick-start) +- [Use Cases](#use-cases) +- [Future Work](#future-work) + +<div id='introduction'/> + +## Introduction +Rya Streams is a system that processes SPARQL queries over streams of RDF +Statements that may have Visibilities attached to them. It does this by +utilizing Kafka as a data processing platform. + +There are three basic building blocks that the system depends on: + +* **Streams Query** - This is a SPARQL query that is registered wth Rya Streams. + It is associated with a specific Rya instance because that Rya instance + determines which Statements the query will evaluate. It has an ID that + uniquely identifies it across all of the queries that are managed by the + system, whether or not the system should be processing it, and whether or not + the results of the query needs to be inserted back into the Rya instance the + source statements come from. + +* **Query Change Log** - A list of changes that have been performed to the + Streams Queries of a specific Rya Instance. This log contains the absolute + truth about what queries are registered, which are running, and which generate + new statements that need to be inserted back into Rya. + +* **Query Manager** - A daemon application that reacts to new/deleted Query + Change Logs as well as new entries within those logs. It starts and stops + Kafka Streams processing jobs for the active Streams Queries. + +The Quick Start section explains how the Rya Streams Client is used to interact +with the system using a simple SPARQL query and some sample Statements. + +<div id='architecture'/> + +## Architecture ## + +The following image is a high level view of how Rya Streams interacts with +Rya to process queries. + +![alt text](../images/rya-streams-high-level.png) + +1. The Rya Streams client is used to register/update/delete queries. +2. Rya Streams notices the change starts/stops processing a query based on + what the change was. +3. Statements are discovered for ingest by whatever application is loading data + into Rya. +4. Those Statements are written to Rya Streams so that the Streams Queries may + process them and produce results. +5. Those Statements are also written to the Rya instance for ad-hoc querying. +6. Rya Streams produces Visibility Binding Sets and/or Visibility Statements + that are written back to Rya. +7. Those same Visibility Binding Sets and/or Visibility Statements are made + available to other systems as well. + +<div id='quick-start'/> + +## Quick Start ## +This tutorial demonstrates how to install and start the Rya Streams system. It +must be configured and running on its own before any Rya instances may use it. +After performing the steps of this quick start, you will have installed the +system, demonstrated that it is functioning properly, and then may use the Rya +Shell to configure Rya instances to use it. + +This tutorial assumes you are starting fresh and have no existing Kafka, or +Zookeeper data. The two services must already be installed and running. + +### Step 1: Download the applications ### + +You can fetch the artifacts you need to follow this Quick Start from our +[downloads page](http://rya.apache.org/download/). Click on the release of +interest and follow the "Central repository for Maven and other dependency +managers" URL. + +Fetch the following two artifacts: + +Artifact Id | Type +--- | --- +rya.streams.client | shaded jar +rya.streams.query-manager | rpm + +### Step 2: Install the Query Manager ### + +Copy the RPM to the CentOS 7 machine the Query Manager will be installed on. +Install it using the following command: + +``` +yum install -y rya.streams.query-manager-3.2.12-incubating.noarch.rpm +``` + +It will install the program to **/opt/rya-streams-query-manager-3.2.12**. Follow +the directions that are in the README.txt file within that directory to finish +configuration. + +### Step 3: Register a Streams Query with Rya Streams ### + +Use the Rya Streams Client to register the following Streams Query: + +``` +SELECT * +WHERE { + ?person <urn:talksTo> ?employee . + ?employee <urn:worksAt> ?business +} +``` +We assume Kafka is running on the local machine using the standard Kafka port. +Issue the following command: + +``` +java -jar rya.streams.client-3.2.12-incubating-SNAPSHOT-shaded.jar add-query \ + -i localhost -p 9092 -r rya-streams-quick-start -a true \ + -q "SELECT * WHERE { ?person <urn:talksTo> ?employee .?employee <urn:worksAt> ?business }" +``` +The Query Manager should eventually see that this query was registered and +start a Rya Streams job that will begin processing it using any Visibility +Statements that have been loaded into Rya Streams. If no results are observed +in step 6 , then verify the Query Manager is working properly. It's logs can be +found in **/opt/rya-streams-query-manager-3.2.12/logs**. + +### Step 4: Start watching for results ### + +We need to fetch the Query ID of the query we want to watch. This can be looked +up by issuing the following command: + +``` +java -jar rya.streams.client-3.2.12-incubating-SNAPSHOT-shaded.jar list-queries \ + -i localhost -p 9092 -r rya-streams-quick-start" +``` + +The client will print something that looks like this: + +``` +Queries in Rya Streams: +--------------------------------------------------------- +ID: 8dd689ee-9d16-4aa7-91c0-667cdb3ed81a Is Active: true Query: SELECT * WHERE { ?person <urn:talksTo> ?employee .?employee <urn:worksAt> ?business } +``` + +Now we know that if we want to reference the query we registered is step 3, we +need to use the Query ID **8dd689ee-9d16-4aa7-91c0-667cdb3ed81a**. Start +watching the Stream Query's output using the following command: + +``` +java -jar rya.streams.client-3.2.12-incubating-SNAPSHOT-shaded.jar stream-results \ + -i localhost -p 9092 -r rya-streams-quick-start" -q 8dd689ee-9d16-4aa7-91c0-667cdb3ed81a +``` + +The command will not finish. The console will print results once they are produced +by the Rya Streams job that is processing the query. Results will appear once +Statements have been loaded. + +### Step 5: Load data into the input topic ### + +In a new terminal, create a file named **quick-start-data.nt** that contains +the following text: + +``` +<urn:Bob> <urn:worksAt> <urn:TacoPlace> . +<urn:Charlie> <urn:worksAt> <urn:BurngerJoint> . +<urn:Eve> <urn:worksAt> <urn:CoffeeShop> . +<urn:Bob> <urn:worksAt> <urn:BurgerJoint> . +<urn:Alice> <urn:talksTo> <urn:Bob> . +``` +For the sake of simplicity within this quick start, we aren't going to use +visibilities when we load the statements. Load the file using the following +command: + +``` +java -jar rya.streams.client-3.2.12-incubating-SNAPSHOT-shaded.jar load-statements \ + -i localhost -p 9092 -r rya-streams-quick-start" -v "" -f ./quick-start-data.nt +``` + +### Step 6: Observe the results that appear ### + +Go back to the terminal that was listening for results. The following results +will have appeared: + +``` + names: + [name]: person --- [value]: urn:Alice + [name]: employee --- [value]: urn:Bob + [name]: business --- [value]: urn:BurgerJoint + Visibility: + + names: + [name]: person --- [value]: urn:Alice + [name]: employee --- [value]: urn:Bob + [name]: business --- [value]: urn:TacoPlace + Visibility: +``` + +<div id='use-cases'/> + +## Use Cases ## +### Alerting ### +An alerting system's job is to notify people, or another application, when +something of interest has been observed. Alert latency is the amount of time +it takes for a system to issue the alert after the observations required to do +so have been made. How the alerts are used determines how much latency is able +to be tolerated. For example, if the alerting system is used to discover when a +fire has started in a building, then that latency period needs to be short in +order to save the building. However, some systems do not require low latency. +Status updates typically aren't time critical. An alert that you package has +been shipped doens't need to be timely since your package still hasn't arrived. + +Rya Streams is able to be used to build alerting systems. The conditions that +generate alerts are encapsulated within a SPARQL query, the observations that +are watched are the RDF Statements, and the alert is a Binding Set or +constructed Statement that is output by the query. It being low/high latency +will depend on how much data needs to be processed and how performant the +alerting queries are. + +### Inference Engine - Forward Chaining ### + +An inference engine that uses forward chaining starts with a set of inference +rules and some data. It uses the rules to generate new data until it reasons +its way to some end goal/answer. For example, suppose that the goal is to +conclude the color of a pet named Fritz, given that he croaks and eats flies, +and that the rule base contains the following four rules: + + 1. If X croaks and X eat flies - Then X is a frog + 2. If X chirps and X sings - Then X is a canary + 3. If X is a frog - Then X is green + 4. if X is a canary - Then X is yellow + +If the following two facts are evaluated using this rule set: +* Fritz croaks +* Fritz eats flies + +Then the following facts are inferred about Fritz: +* Fritz is a frog +* Fritz is green + +Rya Streams is able to be used to build an inference engine that executes that +rule set. SPARQL queries are used to generate new Statements from the original +set of Statements. For example: + +``` +INSERT { + ?x <urn:isA> <urn:frog> +} WHERE { + ?x <urn:makesSound> <urn:croaks> . + ?x <urn:eats> <urn:flies> . +} + +INSERT { + ?x <urn:hasColor> <urn:green> +} WHERE { + ?x <urn:isA> <urn:frog> +} +``` +As long as the first query's results are ingested into the streams application, +the second query will infer that ?x is green based on the previous query's +inferred statement. + +### PCJ Maintenance ### + +Pre-Computed Joins are a feature within Rya where a query's results are +maintained and used as an index to speed up query evaluation. See the PCJ +section of the manual for more information about the specifics of how that +index works. + +Rya Streams is able to create the Visibility Binding Sets that need to be loaded +into the PCJ index. It just needs to see the same Statements that the core Rya +instance has been loaded with. + +<div id='future-work'/> + +## Future Work ## +Each of the following subsections cover an improvement that could be made to +the Rya Streams system. These improvements would push this feature towards a +production ready state. + +### Historic Statement Processing ### +As currently implemented, a program needs to write new Statements to the core +Rya data store as well as the input topic for Rya Streams. Every time a query +is added to Rya Streams for processing, it will only see newly added Statements. +We could write a Kafka Connect Rya source that import historic data into the +statements topic so that it, too, will be processed. + +### PCJ Maintenance ### +The Rya Streams system is capable of creating PCJ results over streamed +statements, but there isn't anything written to import those results from the +QueryResult topic into the source Rya's PCJ index. We could use Kakfa Connect +to write those binidng sets into the index. --- End diff -- Done.
---