intro-cooccurrence-spark.mdtext

pat Sun, 08 Mar 2015 17:12:06 -0700

Author: pat
Date: Mon Mar  9 00:11:38 2015
New Revision: 1665098

URL: http://svn.apache.org/r1665098
Log:
better intro


Modified:
    
mahout/site/mahout_cms/trunk/content/users/recommender/intro-cooccurrence-spark.mdtext

Modified: 
mahout/site/mahout_cms/trunk/content/users/recommender/intro-cooccurrence-spark.mdtext
URL: 
http://svn.apache.org/viewvc/mahout/site/mahout_cms/trunk/content/users/recommender/intro-cooccurrence-spark.mdtext?rev=1665098&r1=1665097&r2=1665098&view=diff
==============================================================================
--- 
mahout/site/mahout_cms/trunk/content/users/recommender/intro-cooccurrence-spark.mdtext
 (original)
+++ 
mahout/site/mahout_cms/trunk/content/users/recommender/intro-cooccurrence-spark.mdtext
 Mon Mar  9 00:11:38 2015
@@ -11,6 +11,14 @@ be used to create "other people also lik
 personalize multimodal recommendations for individual users. 
*spark-rowsimilarity* can provide non-personalized content based 
 recommendations and when paired with a search engine can be used to 
personalize content based recommendations.
 
+![image](http://s6.postimg.org/r0m8bpjw1/recommender_architecture.png)
+
+This is a simplified Lambda architecture with Mahout's *spark-itemsimilarity* 
playing the batch model building role and a search engine playing the realtime 
serving role.
+
+You will create two collections, one for user history and one for item 
"indicators". Indicators are user interactions that lead to the wished for 
interaction. So for example if you wish a user to purchase something and you 
collect all users purchase interactions *spark-itemsimilarity* will create a 
purchase indicator from them. But you can also use other user interactions in a 
cross-cooccurrence calculation, to create purchase indicators. 
+
+User history is used as a query on the item collection with its cooccurrence 
and cross-cooccurrence indicators (there may be several indicators). The 
primary interaction or action is picked to be the thing you want to recommend, 
other actions are believed to be corelated but may not indicate exactly the 
same user intent. For instance in an ecom recommender a purchase is a very good 
primary action, but you may also know product detail-views, or 
additions-to-wishlists. These can be considered secondary actions which may all 
be used to calculate cross-cooccurrence indicators. The user history that forms 
the recommendations query will contain recorded primary and secondary actions 
all targetted towards the correct indicator fields.
+
 ##References
 
 1. A free ebook, which talks about the general idea: [Practical Machine 
Learning](https://www.mapr.com/practical-machine-learning)
@@ -29,7 +37,7 @@ Mahout's mapreduce version of itemsimila
 Mahout's ID requirements--they are non-negative integers that can be viewed as 
row and column numbers in a matrix.
 
 *spark-itemsimilarity* also extends the notion of cooccurrence to 
cross-cooccurrence, in other words the Spark version will 
-account for multi-modal interactions and create cross-indicator matrices 
allowing the use of much more data in 
+account for multi-modal interactions and create cross-cooccurrence indicator 
matrices allowing the use of much more data in 
 creating recommendations or similar item lists. People try to do this by 
mixing different actions and giving them weights. 
 For instance they might say an item-view is 0.2 of an item purchase. In 
practice this is often not helpful. Spark-itemsimilarity's
 cross-cooccurrence is a more principled way to handle this case. In effect it 
scrubs secondary actions with the action you want
@@ -132,7 +140,7 @@ For instance a view of an item does not
 might even make worse recommendations. It is tempting though since there are 
so many more views than purchases. With *spark-itemsimilarity*
 we can now use both actions. Mahout will use cross-action cooccurrence 
analysis to limit the views to ones that do predict purchases.
 We do this by treating the primary action (purchase) as data for the indicator 
matrix and use the secondary action (view) 
-to calculate the cross-indicator matrix.  
+to calculate the cross-cooccurrence indicator matrix.  
 
 *spark-itemsimilarity* can read separate actions from separate files or from a 
mixed action log by filtering certain lines. For a mixed 
 action log of the form:
@@ -178,13 +186,13 @@ Use the following options:
 ###Output
 
 The output of the job will be the standard text version of two Mahout DRMs. 
This is a case where we are calculating 
-cross-cooccurrence so a primary indicator matrix and cross-indicator matrix 
will be created
+cross-cooccurrence so a primary indicator matrix and cross-cooccurrence 
indicator matrix will be created
 
     out-path
-      |-- indicator-matrix - TDF part files
-      \-- cross-indicator-matrix - TDF part-files
+      |-- similarity-matrix - TDF part files
+      \-- cross-similarity-matrix - TDF part-files
 
-The indicator matrix will contain the lines:
+The similarity-matrix will contain the lines:
 
     galaxy\tnexus:1.7260924347106847
     ipad\tiphone:1.7260924347106847
@@ -192,7 +200,7 @@ The indicator matrix will contain the li
     iphone\tipad:1.7260924347106847
     surface
 
-The cross-indicator matrix will contain:
+The cross-similarity-matrix will contain:
 
     iphone\tnexus:1.7260924347106847 iphone:1.7260924347106847 
ipad:1.7260924347106847 galaxy:1.7260924347106847
     ipad\tnexus:0.6795961471815897 iphone:0.6795961471815897 
ipad:0.6795961471815897 galaxy:0.6795961471815897
@@ -201,7 +209,7 @@ The cross-indicator matrix will contain:
     surface\tsurface:4.498681156950466 nexus:0.6795961471815897
 
 **Note:** You can run this multiple times to use more than two actions or you 
can use the underlying 
-SimilarityAnalysis.cooccurrence API, which will more efficiently calculate any 
number of cross-indicators.
+SimilarityAnalysis.cooccurrence API, which will more efficiently calculate any 
number of cross-cooccurrence indicators.
 
 ###Log File Input
 
@@ -358,7 +366,7 @@ We will need to create 1 cooccurrence in
 from the secondary action (view) 
 and 1 content indicator (tags). We'll have to run *spark-itemsimilarity* once 
and *spark-rowsimilarity* once.
 
-We have described how to create the collaborative filtering indicator and 
cross-indicator for purchase and view (the [How to use Multiple User 
+We have described how to create the collaborative filtering indicators for 
purchase and view (the [How to use Multiple User 
 Actions](#multiple-actions) section) but tags will be a slightly different 
process. We want to use the fact that 
 certain items have tags similar to the ones associated with a user's 
purchases. This is not a collaborative filtering indicator 
 but rather a "content" or "metadata" type indicator since you are not using 
other users' history, only the 
@@ -385,7 +393,7 @@ The full collection will look like the t
     ...
 
 We'll use *spark-rowimilairity* because we are looking for similar rows, which 
encode items in this case. As with the 
-collaborative filtering indicator and cross-indicator we use the 
--omitStrength option. The strengths created are 
+collaborative filtering indicators we use the --omitStrength option. The 
strengths created are 
 probabilistic log-likelihood ratios and so are used to filter unimportant 
similarities. Once the filtering or downsampling 
 is finished we no longer need the strengths. We will get an indicator matrix 
of the form:
 
@@ -431,7 +439,7 @@ on the popularity field. If we use the e
 This will return recommendations favoring ones that have the intrinsic 
indicator "hot".
 
 ##Notes
-1. Use as much user action history as you can gather. Choose a primary action 
that is closest to what you want to recommend and the others will be used to 
create cross-indicators. Using more data in this fashion will almost always 
produce better recommendations.
+1. Use as much user action history as you can gather. Choose a primary action 
that is closest to what you want to recommend and the others will be used to 
create cross-cooccurrence indicators. Using more data in this fashion will 
almost always produce better recommendations.
 2. Content can be used where there is no recorded user behavior or when items 
change too quickly to get much interaction history. They can be used alone or 
mixed with other indicators.
 3. Most search engines support "boost" factors so you can favor one or more 
indicators. In the example query, if you want tags to only have a small effect 
you could boost the CF indicators.
 4. In the examples we have used space delimited strings for lists of IDs in 
indicators and in queries. It may be better to use arrays of strings if your 
storage system and search engine support them. For instance Solr allows 
multi-valued fields, which correspond to arrays.

svn commit: r1665098 - /mahout/site/mahout_cms/trunk/content/users/recommender/intro-cooccurrence-spark.mdtext

Reply via email to