Repository: incubator-s2graph Updated Branches: refs/heads/master 33e3d267e -> f0f1081b1
add REAME.md to explaining movielens example. Project: http://git-wip-us.apache.org/repos/asf/incubator-s2graph/repo Commit: http://git-wip-us.apache.org/repos/asf/incubator-s2graph/commit/d587538d Tree: http://git-wip-us.apache.org/repos/asf/incubator-s2graph/tree/d587538d Diff: http://git-wip-us.apache.org/repos/asf/incubator-s2graph/diff/d587538d Branch: refs/heads/master Commit: d587538d61c2add23c241f832e39d6ca739ed979 Parents: 33e3d26 Author: DO YUNG YOON <steams...@apache.org> Authored: Tue May 15 16:07:07 2018 +0900 Committer: DO YUNG YOON <steams...@apache.org> Committed: Tue May 15 19:06:01 2018 +0900 ---------------------------------------------------------------------- example/movielens/README.md | 453 +++++++++++++++++++++++++++++++++++++++ 1 file changed, 453 insertions(+) ---------------------------------------------------------------------- http://git-wip-us.apache.org/repos/asf/incubator-s2graph/blob/d587538d/example/movielens/README.md ---------------------------------------------------------------------- diff --git a/example/movielens/README.md b/example/movielens/README.md new file mode 100644 index 0000000..04bfaf7 --- /dev/null +++ b/example/movielens/README.md @@ -0,0 +1,453 @@ +<!--- +/* + * Licensed to the Apache Software Foundation (ASF) under one + * or more contributor license agreements. See the NOTICE file + * distributed with this work for additional information + * regarding copyright ownership. The ASF licenses this file + * to you under the Apache License, Version 2.0 (the + * "License"); you may not use this file except in compliance + * with the License. You may obtain a copy of the License at + * + * http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, + * software distributed under the License is distributed on an + * "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY + * KIND, either express or implied. See the License for the + * specific language governing permissions and limitations + * under the License. + */ +---> + +# Movie Recommendation with Apache S2Graph(incubating) And Spark MLLib + +We will briefly go through the example of building movie recommendation service using the public dataset from Movielens. + +There are plenty of materials on the collaborative filtering algorithm and process to build recommendation dataset, +so we will focus on how to integrate your trained machine learning model with property graph model. + + --------- + +## The technologies we'll use + ### [Apache S2Graph](https://s2graph.apache.org/) + +The graph database that stores all movielens dataset. Also, S2Graph provide S2GraphQL which is unified REST Interface for not only graph query, but also serving trained model. + +### [Apache Spark](https://spark.apache.org/) + We process movielens dataset with Apache Spark and most importantly, Apache Spark's MLLib is used to build the model by training movielens data. + +### [Annoy4s](https://github.com/annoy4s/annoy4s) + +After Spark build model by running ALS algorithm, use annoy4s to build the index to find approximate nearest neighbors. + +## The architecture + +![screen shot 2018-05-15 at 2 05 25 pm](https://user-images.githubusercontent.com/1264825/40040654-1389e7ba-5856-11e8-8823-5ab982a30ffc.png) + + +This example will set up local HBase, local Spark, local S2GraphQL server as the environment, and use [graphiql](https://github.com/graphql/graphiql) as the client. + +## The abstraction + +Followings are the representation of movielens dataset as property graph model. + +### 1. Service + +Service represent namespace or database for this example. In this example, we will use movielens as service and all schema and data will be under this namespace. + +```graphql +mutation{ + Management{ + createService( + name:"movielens" + ){ + isSuccess + message + object{ + id + name + } + } + } +} + ``` + +### 2. Vertex Schema + +Represent Node in movielens dataset. Each Node can store multiple properties on it if properties are configured on vertex schema. +Schemas must be registered under service correctly to mutate and query actual vertex/edge from S2Graph. + +#### 2.1. Movie + +Data is under `movies.csv` file and followings are an example of data. + +``` +movieId,title,genres +1,Toy Story (1995),Adventure|Animation|Children|Comedy|Fantasy +2,Jumanji (1995),Adventure|Children|Fantasy +3,Grumpier Old Men (1995),Comedy|Romance +... +``` + +Following is mutation defined as S2GraphQL. + +```graphql +mutation{ + Management{ + createServiceColumn( + serviceName:movielens + columnName:"Movie" + columnType: long + props: [ + { + name: "title" + dataType: string + defaultValue: "" + storeInGlobalIndex: true + }, + { + name: "genres" + dataType: string + defaultValue: "" + storeInGlobalIndex: true + } + ] + ){ + isSuccess + message + object{ + id + name + } + } + } +} +``` + +Note that S2Graph use **user provided id**, which is usually primary key in RDBMS, as vertexId. +S2Graph guarantee the uniqueness of vertexId by using composite of (service, serviceColumn, vertexId). + +Also, note that "storeInGlobalIndex" which let S2Graph build the global index on "title" property. +When the user does not know vertexId in advance and still want to start graph query on vertices that meet certain search criteria, then this global index can be helpful. + +#### 2.2 User + +In the real world, User vertex can have various property, such as age, gender, occupation, location, etc, but in movielens dataset, userId is only available. + +```graphql +mutation{ + Management{ + createServiceColumn( + serviceName:movielens + columnName:"User" + columnType: long + ){ + isSuccess + message + object{ + id + name + } + } + } +} +``` + +### 3. Edge + +Once we create vertex schema for Movie and User, it is time to create edge schema to model the relation between User and Movie. + +#### 3.1. rated + +The data is under `ratings.csv` file and this data represent which user rated which movie. + +``` +userId,movieId,rating,timestamp +1,31,2.5,1260759144 +1,1029,3.0,1260759179 +1,1061,3.0,1260759182 +... +``` + +```graphql +mutation{ + Management{ + createLabel( + name:"rated" + sourceService: { + movielens: { + columnName: User + } + } + targetService: { + movielens: { + columnName: Movie + } + } + serviceName: movielens + consistencyLevel: strong + props:[ + { + name: "score" + dataType: double + defaultValue: "0.0" + storeInGlobalIndex: true + } + ] + indices:{ + name:"_PK" + propNames:["score"] + } + ) { + isSuccess + message + object{ + id + name + props{ + name + } + } + } + } +} +``` + +Since S2Graph support vertex-centric index, which is specific to a vertex, we create primary vertex-centric index "_PK" to be sorted by their score. + +#### 3.2. tagged + +`tags.csv` file contains following data. + +``` +userId,movieId,tag,timestamp +15,339,sandra 'boring' bullock,1138537770 +15,1955,dentist,1193435061 +... +``` + +```graphql +mutation{ + Management{ + createLabel( + name:"tagged" + sourceService: { + movielens: { + columnName: User + } + } + targetService: { + movielens: { + columnName: Movie + } + } + serviceName: movielens + consistencyLevel: weak + props:[ + { + name: "tag" + dataType: string + defaultValue: "" + storeInGlobalIndex: true + } + ] + ) { + isSuccess + message + object{ + id + name + props{ + name + } + } + } + } +} +``` + +#### 3.3. similar_movie +This represents similar movie relation, which actually not stored in S2Graph, but obtained by asking ALS model. +Since S2Graph provide pluggable interface how to fetch/mutate from storage, it is possible to provide the custom model implementation. +[S2GRAPH-206](https://issues.apache.org/jira/projects/S2GRAPH/issues/S2GRAPH-206?filter=allopenissues) issue contains few popular implementations on this interface, such as Annoy, FastText, TensorFlow. + +```graphql +mutation{ + Management{ + createLabel( + name:"similar_movie" + sourceService: { + movielens: { + columnName: Movie + } + } + targetService: { + movielens: { + columnName: Movie + } + } + serviceName: movielens + consistencyLevel: strong + props:[ + { + name: "score" + dataType: double + defaultValue: "0.0" + storeInGlobalIndex: false + } + ] + indices:{ + name:"_PK" + propNames:["score"] + } + ) { + isSuccess + message + object{ + id + name + props{ + name + } + } + } + } +} +``` + +Note that there are no actual edges exist in the S2Graph system, but S2Graph knows which model to ask when user query "similar_movie" edges. +Also note that instead of considering entire ALS model, we use Annoy to support k approximate nearest neighbor search to make prediction fast. + +### Schema Summary + +![graphql-erd](https://user-images.githubusercontent.com/1264825/40039268-b8dcb9b4-5850-11e8-8c41-7ea651b25e02.png) + +-------------- + +## Running this example + +### Setup + +1. checkout [apache s2graph master](https://github.com/apache/incubator-s2graph) on local. +2. install [apache spark](https://spark.apache.org/downloads.html)( >= v2.2.0) on local. +3. export **SPARK_HOME** to pointing to installed spark. +4. `cd example; sh run.sh` + +### Description + +#### 1. Prepare + +Prepare all pre-requisites to run this example. + +- S2GraphQL server start + - package S2Graph. + - start standalone hbase. + - start s2graphql server on localhost port 8000. + - conf located under `target/apache-s2graph-*-incubating-bin/conf/` + - s2graphql log located under `target/apache-s2graph-*-incubating-bin/log/` + +- S2Jobs jar build + - create fat jar using `sbt project/s2jobs assembly` + - fat jar located under `s2jobs/target/scala-2.11/` + +- check SPARK_HOME is setup correctly + + +#### 2. Create Schema + +- download movielens dataset(ml-latest-small.zip) under `example/movielens/input/` +- create all schema that explained above by sending mutation request to s2graphql server. + +#### 3. Import Data + +- load movielens data as vertices and edges into S2Graph. +- train ALS model on `ratings.csv`. +- build annoy index from dense matrix `itemFactors` in trained ALS model. + +#### 4. Post Process + +- Bind trained annoy index from 3 to "similar_movie" edge schema by update "similar_movie". + +#### 5. Have fun with GraphQL +- go to [graphiql](localhost:8000) and start traversing movielens graph. + +We provide few example queries that can show how to traverse not only graph data but also serving trained model. + +##### 5.1. Item Based Recommendation. + +This is the very basic kind of item-based collaborative filtering recommendation. +Recommendations are **similar movies to movies that each user rated**. + +Note that we ask our model to find k nearest neighbor on the trained model to get similar_movie. + +```graphql +query { + movielens { + User(id: 1) { + rated { + Movie { + title + similar_movie(limit: 5) { + Movie { + title + } + } + } + } + } + } +} +``` + +##### 5.2. Vertex Property Search. +This shows S2Graph's global index feature, which answer **"movies that contain Toy in their title".** +Note how intuitive the GraphQL syntax represent graph traversal. + +```graphql +query { + movielens { + Movie(search: "title: *Toy*", limit: 5) { + title + tagged(limit: 10) { + User { + id + rated(limit: 5) { + Movie { + title + } + } + } + } + } + } +} +``` + +We can mix **model serving and graph traversal** as follow. + +```graphql +query { + movielens { + Movie(search: "genres: *Comedy* AND title: *1995*", limit: 5) { + title + genres + similar_movie(limit: 5) { + Movie { + title + genres + } + } + } + } +} +``` + +Note that we only need the trained model to traverse "similar_movie" relation. + +## Summary + +We show how to serving not only graph data that is actually stored in the graph database but also data that can be obtained from the pre-trained model. + +In general, S2Graph abstract **the pre-trained model as an immutable graph that can produce vertices/edges** for input vertex. + +By using this abstraction, there is no distinction between model serving and graph data from client side. \ No newline at end of file