intro-cooccurrence-spark.mdtext

pat Fri, 10 Jul 2015 10:50:59 -0700

Author: pat
Date: Fri Jul 10 17:50:35 2015
New Revision: 1690298

URL: http://svn.apache.org/r1690298
Log:
CMS commit to mahout by pat


Modified:
    
mahout/site/mahout_cms/trunk/content/users/recommender/intro-cooccurrence-spark.mdtext

Modified: 
mahout/site/mahout_cms/trunk/content/users/recommender/intro-cooccurrence-spark.mdtext
URL: 
http://svn.apache.org/viewvc/mahout/site/mahout_cms/trunk/content/users/recommender/intro-cooccurrence-spark.mdtext?rev=1690298&r1=1690297&r2=1690298&view=diff
==============================================================================
--- 
mahout/site/mahout_cms/trunk/content/users/recommender/intro-cooccurrence-spark.mdtext
 (original)
+++ 
mahout/site/mahout_cms/trunk/content/users/recommender/intro-cooccurrence-spark.mdtext
 Fri Jul 10 17:50:35 2015
@@ -5,14 +5,6 @@ be used to create "other people also lik
 personalize recommendations for individual users. *spark-rowsimilarity* can 
provide non-personalized content based 
 recommendations and when paired with a search engine can be used to 
personalize content based recommendations.
 
-![image](http://s6.postimg.org/r0m8bpjw1/recommender_architecture.png)
-
-This is a simplified Lambda architecture with Mahout's *spark-itemsimilarity* 
playing the batch model building role and a search engine playing the realtime 
serving role.
-
-You will create two collections, one for user history and one for item 
"indicators". Indicators are user interactions that lead to the wished for 
interaction. So for example if you wish a user to purchase something and you 
collect all users purchase interactions *spark-itemsimilarity* will create a 
purchase indicator from them. But you can also use other user interactions in a 
cross-cooccurrence calculation, to create purchase indicators. 
-
-User history is used as a query on the item collection with its cooccurrence 
and cross-cooccurrence indicators (there may be several indicators). The 
primary interaction or action is picked to be the thing you want to recommend, 
other actions are believed to be corelated but may not indicate exactly the 
same user intent. For instance in an ecom recommender a purchase is a very good 
primary action, but you may also know product detail-views, or 
additions-to-wishlists. These can be considered secondary actions which may all 
be used to calculate cross-cooccurrence indicators. The user history that forms 
the recommendations query will contain recorded primary and secondary actions 
all targetted towards the correct indicator fields.
-
 ##References
 
 1. A free ebook, which talks about the general idea: [Practical Machine 
Learning](https://www.mapr.com/practical-machine-learning)
@@ -30,7 +22,7 @@ Mahout's mapreduce version of itemsimila
 Mahout's ID requirements--they are non-negative integers that can be viewed as 
row and column numbers in a matrix.
 
 *spark-itemsimilarity* also extends the notion of cooccurrence to 
cross-cooccurrence, in other words the Spark version will 
-account for multi-modal interactions and create cross-cooccurrence indicator 
matrices allowing the use of much more data in 
+account for multi-modal interactions and create indicator matrices allowing 
the use of much more data in 
 creating recommendations or similar item lists. People try to do this by 
mixing different actions and giving them weights. 
 For instance they might say an item-view is 0.2 of an item purchase. In 
practice this is often not helpful. Spark-itemsimilarity's
 cross-cooccurrence is a more principled way to handle this case. In effect it 
scrubs secondary actions with the action you want
@@ -185,7 +177,7 @@ cross-cooccurrence so a primary indicato
       |-- similarity-matrix - TDF part files
       \-- cross-similarity-matrix - TDF part-files
 
-The similarity-matrix will contain the lines:
+The indicator matrix will contain the lines:
 
     galaxy\tnexus:1.7260924347106847
     ipad\tiphone:1.7260924347106847
@@ -193,7 +185,7 @@ The similarity-matrix will contain the l
     iphone\tipad:1.7260924347106847
     surface
 
-The cross-similarity-matrix will contain:
+The cross-cooccurrence indicator matrix will contain:
 
     iphone\tnexus:1.7260924347106847 iphone:1.7260924347106847 
ipad:1.7260924347106847 galaxy:1.7260924347106847
     ipad\tnexus:0.6795961471815897 iphone:0.6795961471815897 
ipad:0.6795961471815897 galaxy:0.6795961471815897
@@ -313,15 +305,15 @@ the only similarity method supported thi
 LLR is used more as a quality filter than as a similarity measure. However 
*spark-rowsimilarity* will produce 
 lists of similar docs for every doc if input is docs with lists of terms. The 
Apache [Lucene](http://lucene.apache.org) project provides several methods of 
[analyzing and 
tokenizing](http://lucene.apache.org/core/4_9_0/core/org/apache/lucene/analysis/package-summary.html#package_description)
 documents.
 
-#<a name="unified-recommender">4. Creating a Multimodal Recommender</a>
+#<a name="unified-recommender">4. Creating a Unified Recommender</a>
 
-Using the output of *spark-itemsimilarity* and *spark-rowsimilarity* you can 
build a miltimodal cooccurrence and content based
+Using the output of *spark-itemsimilarity* and *spark-rowsimilarity* you can 
build a unified cooccurrence and content based
  recommender that can be used in both or either mode depending on indicators 
available and the history available at 
-runtime for a user. Some slide describing this method can be found 
[here](http://occamsmachete.com/ml/2014/10/07/creating-a-unified-recommender-with-mahout-and-a-search-engine/)
+runtime for a user.
 
 ##Requirements
 
-1. Mahout SNAPSHOT-1.0 or later
+1. Mahout 0.10.0 or later
 2. Hadoop
 3. Spark, the correct version for your version of Mahout and Hadoop
 4. A search engine like Solr or Elasticsearch
@@ -359,7 +351,7 @@ We will need to create 1 cooccurrence in
 from the secondary action (view) 
 and 1 content indicator (tags). We'll have to run *spark-itemsimilarity* once 
and *spark-rowsimilarity* once.
 
-We have described how to create the collaborative filtering indicators for 
purchase and view (the [How to use Multiple User 
+We have described how to create the collaborative filtering indicator and 
cross-cooccurrence indicator for purchase and view (the [How to use Multiple 
User 
 Actions](#multiple-actions) section) but tags will be a slightly different 
process. We want to use the fact that 
 certain items have tags similar to the ones associated with a user's 
purchases. This is not a collaborative filtering indicator 
 but rather a "content" or "metadata" type indicator since you are not using 
other users' history, only the 
@@ -374,8 +366,6 @@ items with the most similar tags. Notice
 content or metadata indicator. They are used when you want to find items that 
are similar to other items by using their 
 content or metadata, not by which users interacted with them.
 
-**Note**: It may be advisable to treat tags as cross-cooccurrence indicators 
but for the sake of an example they are treated here as content only.
-
 For this we need input of the form:
 
     itemID<tab>list-of-tags
@@ -388,7 +378,7 @@ The full collection will look like the t
     ...
 
 We'll use *spark-rowimilairity* because we are looking for similar rows, which 
encode items in this case. As with the 
-collaborative filtering indicators we use the --omitStrength option. The 
strengths created are 
+collaborative filtering indicator and cross-cooccurrence indicator we use the 
--omitStrength option. The strengths created are 
 probabilistic log-likelihood ratios and so are used to filter unimportant 
similarities. Once the filtering or downsampling 
 is finished we no longer need the strengths. We will get an indicator matrix 
of the form:
 
@@ -403,9 +393,10 @@ This is a content indicator since it has
     
 We now have three indicators, two collaborative filtering type and one content 
type.
 
-##Multimodal Recommender Query
+##Unified Recommender Query
 
-The actual form of the query for recommendations will vary depending on your 
search engine but the intent is the same. For a given user, map their history 
of an action or content to the correct indicator field and perform an OR'd 
query. 
+The actual form of the query for recommendations will vary depending on your 
search engine but the intent is the same. 
+For a given user, map their history of an action or content to the correct 
indicator field and perform an OR'd query. 
 
 We have 3 indicators, these are indexed by the search engine into 3 fields, 
we'll call them "purchase", "view", and "tags". 
 We take the user's history that corresponds to each indicator and create a 
query of the form:

svn commit: r1690298 - /mahout/site/mahout_cms/trunk/content/users/recommender/intro-cooccurrence-spark.mdtext

Reply via email to