mahout git commit: fixed markdown and typos in Intro to Correlated Cross-Occurrence Recommenders with Spark

pat Tue, 26 Jun 2018 13:01:50 -0700

Repository: mahout
Updated Branches:
  refs/heads/master 5a1d85f59 -> 0a51eb0f5



fixed markdown and typos in Intro to Correlated Cross-Occurrence Recommenders 
with Spark


Project: http://git-wip-us.apache.org/repos/asf/mahout/repo
Commit: http://git-wip-us.apache.org/repos/asf/mahout/commit/0a51eb0f
Tree: http://git-wip-us.apache.org/repos/asf/mahout/tree/0a51eb0f
Diff: http://git-wip-us.apache.org/repos/asf/mahout/diff/0a51eb0f

Branch: refs/heads/master
Commit: 0a51eb0f5bf3e70575b8b5527de23d05650d8bdd
Parents: 5a1d85f
Author: pferrel <[email protected]>
Authored: Tue Jun 26 13:01:23 2018 -0700
Committer: pferrel <[email protected]>
Committed: Tue Jun 26 13:01:23 2018 -0700

----------------------------------------------------------------------
 .../algorithms/intro-cooccurrence-spark.md      | 178 +++++++++----------
 1 file changed, 86 insertions(+), 92 deletions(-)
----------------------------------------------------------------------


http://git-wip-us.apache.org/repos/asf/mahout/blob/0a51eb0f/website/users/algorithms/intro-cooccurrence-spark.md
----------------------------------------------------------------------
diff --git a/website/users/algorithms/intro-cooccurrence-spark.md 
b/website/users/algorithms/intro-cooccurrence-spark.md
index 72eeecf..a9ee71d 100644
--- a/website/users/algorithms/intro-cooccurrence-spark.md
+++ b/website/users/algorithms/intro-cooccurrence-spark.md
@@ -5,12 +5,11 @@ title: Intro to Cooccurrence Recommenders with Spark
     
 ---
 
-#Intro to Cooccurrence Recommenders with Spark
+# Intro to Correlated Cross-Occurrence Recommenders with Spark
 
-Mahout provides several important building blocks for creating recommendations 
using Spark. *spark-itemsimilarity* can 
-be used to create "other people also liked these things" type recommendations 
and paired with a search engine can 
-personalize recommendations for individual users. *spark-rowsimilarity* can 
provide non-personalized content based 
-recommendations and when paired with a search engine can be used to 
personalize content based recommendations.
+Mahout's CCO algorithm is one of a new breed of "Multimodal" recommenders that 
can use input of many types in very flexible ways. 
+
+Mahout provides several important building blocks for creating recommendations 
using Spark. *spark-itemsimilarity* can be used to create "other people also 
liked these things" type recommendations and paired with a search engine can 
personalize recommendations for individual users. *spark-rowsimilarity* can 
provide non-personalized content based recommendations and when paired with a 
search engine can be used to personalize content based recommendations.
 
 ![image](http://s6.postimg.org/r0m8bpjw1/recommender_architecture.png)
 
@@ -18,19 +17,19 @@ This is a simplified Lambda architecture with Mahout's 
*spark-itemsimilarity* pl
 
 You will create two collections, one for user history and one for item 
"indicators". Indicators are user interactions that lead to the wished for 
interaction. So for example if you wish a user to purchase something and you 
collect all users purchase interactions *spark-itemsimilarity* will create a 
purchase indicator from them. But you can also use other user interactions in a 
cross-cooccurrence calculation, to create purchase indicators. 
 
-User history is used as a query on the item collection with its cooccurrence 
and cross-cooccurrence indicators (there may be several indicators). The 
primary interaction or action is picked to be the thing you want to recommend, 
other actions are believed to be corelated but may not indicate exactly the 
same user intent. For instance in an ecom recommender a purchase is a very good 
primary action, but you may also know product detail-views, or 
additions-to-wishlists. These can be considered secondary actions which may all 
be used to calculate cross-cooccurrence indicators. The user history that forms 
the recommendations query will contain recorded primary and secondary actions 
all targetted towards the correct indicator fields.
+User history is used as a query on the item collection with its cooccurrence 
and cross-cooccurrence indicators (there may be several indicators). The 
primary interaction or indicator is picked to be the thing you want to 
recommend, other action / indicators are believed to be correlated but may not 
indicate exactly the same user intent. For instance in an ecom recommender a 
purchase is a very good primary action / indicator, but you may also know 
product detail-views, or additions-to-wishlists. These can be considered 
secondary actions / indicators which may all be used to calculate 
cross-cooccurrence indicators. The user history that forms the recommendations 
query will contain recorded primary and secondary indicators all targeted 
towards the correct indicator fields.
 
-##References
+## References
 
 1. A free ebook, which talks about the general idea: [Practical Machine 
Learning](https://www.mapr.com/practical-machine-learning)
-2. A slide deck, which talks about mixing actions or other indicators: 
[Creating a Unified 
Recommender](http://occamsmachete.com/ml/2014/10/07/creating-a-unified-recommender-with-mahout-and-a-search-engine/)
+2. A slide deck, which talks about mixing indicators or other indicators: 
[Creating a Unified 
Recommender](http://occamsmachete.com/ml/2014/10/07/creating-a-unified-recommender-with-mahout-and-a-search-engine/)
 3. Two blog posts: [What's New in Recommenders: part 
#1](http://occamsmachete.com/ml/2014/08/11/mahout-on-spark-whats-new-in-recommenders/)
 and  [What's New in Recommenders: part 
#2](http://occamsmachete.com/ml/2014/09/09/mahout-on-spark-whats-new-in-recommenders-part-2/)
 3. A post describing the loglikelihood ratio:  [Surprise and 
Coinsidense](http://tdunning.blogspot.com/2008/03/surprise-and-coincidence.html)
  LLR is used to reduce noise in the data while keeping the calculations O(n) 
complexity.
 
 Below are the command line jobs but the drivers and associated code can also 
be customized and accessed from the Scala APIs.
 
-##1. spark-itemsimilarity
+## 1. spark-itemsimilarity
 *spark-itemsimilarity* is the Spark counterpart of the of the Mahout mapreduce 
job called *itemsimilarity*. It takes in elements of interactions, which have 
userID, itemID, and optionally a value. It will produce one of more indicator 
matrices created by comparing every user's interactions with every other user. 
The indicator matrix is an item x item matrix where the values are 
log-likelihood ratio strengths. For the legacy mapreduce version, there were 
several possible similarity measures but these are being deprecated in favor of 
LLR because in practice it performs the best.
 
 Mahout's mapreduce version of itemsimilarity takes a text file that is 
expected to have user and item IDs that conform to 
@@ -38,9 +37,9 @@ Mahout's ID requirements--they are non-negative integers that 
can be viewed as r
 
 *spark-itemsimilarity* also extends the notion of cooccurrence to 
cross-cooccurrence, in other words the Spark version will 
 account for multi-modal interactions and create cross-cooccurrence indicator 
matrices allowing the use of much more data in 
-creating recommendations or similar item lists. People try to do this by 
mixing different actions and giving them weights. 
+creating recommendations or similar item lists. People try to do this by 
mixing different indicators and giving them weights. 
 For instance they might say an item-view is 0.2 of an item purchase. In 
practice this is often not helpful. Spark-itemsimilarity's
-cross-cooccurrence is a more principled way to handle this case. In effect it 
scrubs secondary actions with the action you want
+cross-cooccurrence is a more principled way to handle this case. In effect it 
scrubs secondary indicators with the indicator you want
 to recommend.   
 
 
@@ -111,9 +110,9 @@ to recommend.
 
 This looks daunting but defaults to simple fairly sane values to take exactly 
the same input as legacy code and is pretty flexible. It allows the user to 
point to a single text file, a directory full of files, or a tree of 
directories to be traversed recursively. The files included can be specified 
with either a regex-style pattern or filename. The schema for the file is 
defined by column numbers, which map to the important bits of data including 
IDs and values. The files can even contain filters, which allow unneeded rows 
to be discarded or used for cross-cooccurrence calculations.
 
-See ItemSimilarityDriver.scala in Mahout's spark module if you want to 
customize the code. 
+See `ItemSimilarityDriver.scala` in Mahout's spark module if you want to 
customize the code. 
 
-###Defaults in the _**spark-itemsimilarity**_ CLI
+### Defaults in the _**spark-itemsimilarity**_ CLI
 
 If all defaults are used the input can be as simple as:
 
@@ -131,19 +130,16 @@ This will use the "local" Spark context and will output 
the standard text versio
 
     itemID1<tab>itemID2:value2<space>itemID10:value10...
 
-###<a name="multiple-actions">How To Use Multiple User Actions</a>
+### <a name="multiple-actions">How To Use Multiple User Indicators</a>
 
-Often we record various actions the user takes for later analytics. These can 
now be used to make recommendations. 
-The idea of a recommender is to recommend the action you want the user to 
make. For an ecom app this might be 
-a purchase action. It is usually not a good idea to just treat other actions 
the same as the action you want to recommend. 
-For instance a view of an item does not indicate the same intent as a purchase 
and if you just mixed the two together you 
-might even make worse recommendations. It is tempting though since there are 
so many more views than purchases. With *spark-itemsimilarity*
-we can now use both actions. Mahout will use cross-action cooccurrence 
analysis to limit the views to ones that do predict purchases.
-We do this by treating the primary action (purchase) as data for the indicator 
matrix and use the secondary action (view) 
+Often we record various indicators the user takes for later analytics. These 
can now be used to make recommendations. 
+The idea of a recommender is to recommend the action you want the user to 
make. For an ecom app this might be a purchase action recorded in a "purchase" 
indicator. It is usually not a good idea to just treat other indicators the 
same as the indicator you want to recommend. For example is you have user 
purchase and view data, never treat a view as a purchase it will never increase 
the quality of recommendations, instead use the view data as a secondary 
indicator so the CCO algorithm will find meaningful correlated 
cross-occurrences. Without this the views will be so noisy they will almost 
surely reduce the performance of the recommender. Too many people have fallen 
into this mistake. With *spark-itemsimilarity*
+we can now use both indicators. Mahout will use cross-occurrence analysis to 
limit the views to ones that do predict purchases.
+We do this by treating the primary indicator (purchase) as data for the 
indicator matrix and use the secondary indicator (view) 
 to calculate the cross-cooccurrence indicator matrix.  
 
-*spark-itemsimilarity* can read separate actions from separate files or from a 
mixed action log by filtering certain lines. For a mixed 
-action log of the form:
+*spark-itemsimilarity* can read separate indicators from separate files or 
from a mixed indicator log by filtering certain lines. For a mixed 
+indicator log of the form:
 
     u1,purchase,iphone
     u1,purchase,ipad
@@ -166,7 +162,7 @@ action log of the form:
     u4,view,ipad
     u4,view,galaxy
 
-###Command Line
+### Command Line
 
 
 Use the following options:
@@ -175,15 +171,15 @@ Use the following options:
        --input in-file \     # where to look for data
         --output out-path \   # root dir for output
         --master masterUrl \  # URL of the Spark master server
-        --filter1 purchase \  # word that flags input for the primary action
-        --filter2 view \      # word that flags input for the secondary action
+        --filter1 purchase \  # word that flags input for the primary indicator
+        --filter2 view \      # word that flags input for the secondary 
indicator
         --itemIDPosition 2 \  # column that has the item ID
         --rowIDPosition 0 \   # column that has the user ID
         --filterPosition 1    # column that has the filter word
 
 
 
-###Output
+### Output
 
 The output of the job will be the standard text version of two Mahout DRMs. 
This is a case where we are calculating 
 cross-cooccurrence so a primary indicator matrix and cross-cooccurrence 
indicator matrix will be created
@@ -194,49 +190,48 @@ cross-cooccurrence so a primary indicator matrix and 
cross-cooccurrence indicato
 
 The similarity-matrix will contain the lines:
 
-    galaxy\tnexus:1.7260924347106847
-    ipad\tiphone:1.7260924347106847
-    nexus\tgalaxy:1.7260924347106847
-    iphone\tipad:1.7260924347106847
+    galaxy<tab>nexus:1.7260924347106847
+    ipad<tab>iphone:1.7260924347106847
+    nexus<tab>galaxy:1.7260924347106847
+    iphone<tab>ipad:1.7260924347106847
     surface
 
 The cross-similarity-matrix will contain:
 
-    iphone\tnexus:1.7260924347106847 iphone:1.7260924347106847 
ipad:1.7260924347106847 galaxy:1.7260924347106847
-    ipad\tnexus:0.6795961471815897 iphone:0.6795961471815897 
ipad:0.6795961471815897 galaxy:0.6795961471815897
-    nexus\tnexus:0.6795961471815897 iphone:0.6795961471815897 
ipad:0.6795961471815897 galaxy:0.6795961471815897
-    galaxy\tnexus:1.7260924347106847 iphone:1.7260924347106847 
ipad:1.7260924347106847 galaxy:1.7260924347106847
-    surface\tsurface:4.498681156950466 nexus:0.6795961471815897
+    iphone<tab>nexus:1.7260924347106847 iphone:1.7260924347106847 
ipad:1.7260924347106847 galaxy:1.7260924347106847
+    ipad<tab>nexus:0.6795961471815897 iphone:0.6795961471815897 
ipad:0.6795961471815897 galaxy:0.6795961471815897
+    nexus<tab>nexus:0.6795961471815897 iphone:0.6795961471815897 
ipad:0.6795961471815897 galaxy:0.6795961471815897
+    galaxy<tab>nexus:1.7260924347106847 iphone:1.7260924347106847 
ipad:1.7260924347106847 galaxy:1.7260924347106847
+    surface<tab>surface:4.498681156950466 nexus:0.6795961471815897
 
-**Note:** You can run this multiple times to use more than two actions or you 
can use the underlying 
-SimilarityAnalysis.cooccurrence API, which will more efficiently calculate any 
number of cross-cooccurrence indicators.
+**Note:** You can run this multiple times to use more than two indicators or 
you can use the underlying SimilarityAnalysis.cooccurrence API in you own 
application as a library, which will more efficiently calculate any number of 
cross-cooccurrence indicators.
 
-###Log File Input
+### Log File Input
 
 A common method of storing data is in log files. If they are written using 
some delimiter they can be consumed directly by spark-itemsimilarity. For 
instance input of the form:
 
-    2014-06-23 14:46:53.115\tu1\tpurchase\trandom text\tiphone
-    2014-06-23 14:46:53.115\tu1\tpurchase\trandom text\tipad
-    2014-06-23 14:46:53.115\tu2\tpurchase\trandom text\tnexus
-    2014-06-23 14:46:53.115\tu2\tpurchase\trandom text\tgalaxy
-    2014-06-23 14:46:53.115\tu3\tpurchase\trandom text\tsurface
-    2014-06-23 14:46:53.115\tu4\tpurchase\trandom text\tiphone
-    2014-06-23 14:46:53.115\tu4\tpurchase\trandom text\tgalaxy
-    2014-06-23 14:46:53.115\tu1\tview\trandom text\tiphone
-    2014-06-23 14:46:53.115\tu1\tview\trandom text\tipad
-    2014-06-23 14:46:53.115\tu1\tview\trandom text\tnexus
-    2014-06-23 14:46:53.115\tu1\tview\trandom text\tgalaxy
-    2014-06-23 14:46:53.115\tu2\tview\trandom text\tiphone
-    2014-06-23 14:46:53.115\tu2\tview\trandom text\tipad
-    2014-06-23 14:46:53.115\tu2\tview\trandom text\tnexus
-    2014-06-23 14:46:53.115\tu2\tview\trandom text\tgalaxy
-    2014-06-23 14:46:53.115\tu3\tview\trandom text\tsurface
-    2014-06-23 14:46:53.115\tu3\tview\trandom text\tnexus
-    2014-06-23 14:46:53.115\tu4\tview\trandom text\tiphone
-    2014-06-23 14:46:53.115\tu4\tview\trandom text\tipad
-    2014-06-23 14:46:53.115\tu4\tview\trandom text\tgalaxy    
-
-Can be parsed with the following CLI and run on the cluster producing the same 
output as the above example.
+    2014-06-23 14:46:53.115<tab>u1<tab>purchase<tab>random text<tab>iphone
+    2014-06-23 14:46:53.115<tab>u1<tab>purchase<tab>random text<tab>ipad
+    2014-06-23 14:46:53.115<tab>u2<tab>purchase<tab>random text<tab>nexus
+    2014-06-23 14:46:53.115<tab>u2<tab>purchase<tab>random text<tab>galaxy
+    2014-06-23 14:46:53.115<tab>u3<tab>purchase<tab>random text<tab>surface
+    2014-06-23 14:46:53.115<tab>u4<tab>purchase<tab>random text<tab>iphone
+    2014-06-23 14:46:53.115<tab>u4<tab>purchase<tab>random text<tab>galaxy
+    2014-06-23 14:46:53.115<tab>u1<tab>view<tab>random text<tab>iphone
+    2014-06-23 14:46:53.115<tab>u1<tab>view<tab>random text<tab>ipad
+    2014-06-23 14:46:53.115<tab>u1<tab>view<tab>random text<tab>nexus
+    2014-06-23 14:46:53.115<tab>u1<tab>view<tab>random text<tab>galaxy
+    2014-06-23 14:46:53.115<tab>u2<tab>view<tab>random text<tab>iphone
+    2014-06-23 14:46:53.115<tab>u2<tab>view<tab>random text<tab>ipad
+    2014-06-23 14:46:53.115<tab>u2<tab>view<tab>random text<tab>nexus
+    2014-06-23 14:46:53.115<tab>u2<tab>view<tab>random text<tab>galaxy
+    2014-06-23 14:46:53.115<tab>u3<tab>view<tab>random text<tab>surface
+    2014-06-23 14:46:53.115<tab>u3<tab>view<tab>random text<tab>nexus
+    2014-06-23 14:46:53.115<tab>u4<tab>view<tab>random text<tab>iphone
+    2014-06-23 14:46:53.115<tab>u4<tab>view<tab>random text<tab>ipad
+    2014-06-23 14:46:53.115<tab>u4<tab>view<tab>random text<tab>galaxy    
+
+Can be parsed with the following CLI and run on the cluster producing the same 
output as the above example. The important bit of information in the example 
tab delimited file are user-id, indicator-name, and item-id. The rest is 
ignored.
 
     bash$ mahout spark-itemsimilarity \
         --input in-file \
@@ -249,20 +244,20 @@ Can be parsed with the following CLI and run on the 
cluster producing the same o
         --rowIDPosition 1 \
         --filterPosition 2
 
-##2. spark-rowsimilarity
+## 2. spark-rowsimilarity
 
 *spark-rowsimilarity* is the companion to *spark-itemsimilarity* the primary 
difference is that it takes a text file version of 
 a matrix of sparse vectors with optional application specific IDs and it finds 
similar rows rather than items (columns). Its use is
 not limited to collaborative filtering. The input is in text-delimited form 
where there are three delimiters used. By 
-default it reads 
(rowID&lt;tab>columnID1:strength1&lt;space>columnID2:strength2...) Since this 
job only supports LLR similarity,
+default it reads 
`(rowID<tab>columnID1:strength1<space>columnID2:strength2...)` Since this job 
only supports LLR similarity,
  which does not use the input strengths, they may be omitted in the input. It 
writes 
-(rowID&lt;tab>rowID1:strength1&lt;space>rowID2:strength2...) 
+`(rowID<tab>rowID1:strength1<space>rowID2:strength2...)` 
 The output is sorted by strength descending. The output can be interpreted as 
a row ID from the primary input followed 
 by a list of the most similar rows.
 
 The command line interface is:
 
-    spark-rowsimilarity Mahout 1.0
+    spark-rowsimilarity Mahout 0.x
     Usage: spark-rowsimilarity [options]
     
     Input, output options
@@ -314,67 +309,65 @@ See RowSimilarityDriver.scala in Mahout's spark module if 
you want to customize
 #3. Using *spark-rowsimilarity* with Text Data
 
 Another use case for *spark-rowsimilarity* is in finding similar textual 
content. For instance given the tags associated with 
-a blog post,
- which other posts have similar tags. In this case the columns are tags and 
the rows are posts. Since LLR is 
+a blog post, which other posts have similar tags. In this case the columns are 
tags and the rows are posts. Since LLR is 
 the only similarity method supported this is not the optimal way to determine 
general "bag-of-words" document similarity. 
 LLR is used more as a quality filter than as a similarity measure. However 
*spark-rowsimilarity* will produce 
-lists of similar docs for every doc if input is docs with lists of terms. The 
Apache [Lucene](http://lucene.apache.org) project provides several methods of 
[analyzing and 
tokenizing](http://lucene.apache.org/core/4_9_0/core/org/apache/lucene/analysis/package-summary.html#package_description)
 documents.
+lists of similar docs for every doc if input is docs with lists of terms. The 
Apache [Lucene](http://lucene.apache.org) project provides several methods of 
analyzing and tokenizing documents.
 
-#<a name="unified-recommender">4. Creating a Multimodal Recommender</a>
+# <a name="unified-recommender">4. Creating a Multimodal Recommender</a>
 
 Using the output of *spark-itemsimilarity* and *spark-rowsimilarity* you can 
build a miltimodal cooccurrence and content based
  recommender that can be used in both or either mode depending on indicators 
available and the history available at 
 runtime for a user. Some slide describing this method can be found 
[here](http://occamsmachete.com/ml/2014/10/07/creating-a-unified-recommender-with-mahout-and-a-search-engine/)
 
-##Requirements
+## Requirements
 
-1. Mahout SNAPSHOT-1.0 or later
+1. Mahout 0.13.0 or later
 2. Hadoop
 3. Spark, the correct version for your version of Mahout and Hadoop
 4. A search engine like Solr or Elasticsearch
 
-##Indicators
+## Indicators
 
 Indicators come in 3 types
 
-1. **Cooccurrence**: calculated with *spark-itemsimilarity* from user actions
+1. **Correlated Cross-Occurrence**: calculated with *spark-itemsimilarity* 
from user indicators
 2. **Content**: calculated from item metadata or content using 
*spark-rowsimilarity*
-3. **Intrinsic**: assigned to items as metadata. Can be anything that 
describes the item.
+3. **Intrinsic**: assigned to items as metadata. Can be anything that 
describes the item. These will be used in search engine queries to implement 
business rules.
 
 The query for recommendations will be a mix of values meant to match one of 
your indicators. The query can be constructed 
-from user history and values derived from context (category being viewed for 
instance) or special precalculated data 
+from user history and values derived from context (category being viewed for 
instance) or special pre-calculated data 
 (popularity rank for instance). This blending of indicators allows for 
creating many flavors or recommendations to fit 
 a very wide variety of circumstances.
 
 With the right mix of indicators developers can construct a single query that 
works for completely new items and new users 
-while working well for items with lots of interactions and users with many 
recorded actions. In other words by adding in content and intrinsic 
-indicators developers can create a solution for the "cold-start" problem that 
gracefully improves with more user history
+while working well for items with lots of interactions and users with many 
recorded indicators. In other words by adding in content and intrinsic 
indicators developers can create a solution for the "cold-start" problem that 
gracefully improves with more user history
 and as items have more interactions. It is also possible to create a 
completely content-based recommender that personalizes 
 recommendations.
 
-##Example with 3 Indicators
+## Example with 3 Indicators
 
-You will need to decide how you store user action data so they can be 
processed by the item and row similarity jobs and 
+You will need to decide how you store user indicator data so they can be 
processed by the item and row similarity jobs and 
 this is most easily done by using text files as described above. The data that 
is processed by these jobs is considered the 
 training data. You will need some amount of user history in your recs query. 
It is typical to use the most recent user history 
 but need not be exactly what is in the training set, which may include a 
greater volume of historical data. Keeping the user 
 history for query purposes could be done with a database by storing it in a 
users table. In the example above the two 
-collaborative filtering actions are "purchase" and "view", but let's also add 
tags (taken from catalog categories or other 
+collaborative filtering indicators are "purchase" and "view", but let's also 
add tags (taken from catalog categories or other 
 descriptive metadata). 
 
-We will need to create 1 cooccurrence indicator from the primary action 
(purchase) 1 cross-action cooccurrence indicator 
-from the secondary action (view) 
+We will need to create 1 cooccurrence indicator from the primary indicator 
(purchase) 1 cross-occurrence indicator 
+from the secondary indicator (view) 
 and 1 content indicator (tags). We'll have to run *spark-itemsimilarity* once 
and *spark-rowsimilarity* once.
 
 We have described how to create the collaborative filtering indicators for 
purchase and view (the [How to use Multiple User 
-Actions](#multiple-actions) section) but tags will be a slightly different 
process. We want to use the fact that 
+Indicators](#multiple-actions) section) but tags will be a slightly different 
process. We want to use the fact that 
 certain items have tags similar to the ones associated with a user's 
purchases. This is not a collaborative filtering indicator 
 but rather a "content" or "metadata" type indicator since you are not using 
other users' history, only the 
 individual that you are making recs for. This means that this method will make 
recommendations for items that have 
 no collaborative filtering data, as happens with new items in a catalog. New 
items may have tags assigned but no one
  has purchased or viewed them yet. In the final query we will mix all 3 
indicators.
 
-##Content Indicator
+## Content Indicator
 
 To create a content-indicator we'll make use of the fact that the user has 
purchased items with certain tags. We want to find 
 items with the most similar tags. Notice that other users' behavior is not 
considered--only other item's tags. This defines a 
@@ -410,9 +403,9 @@ This is a content indicator since it has found other items 
with similar content
     
 We now have three indicators, two collaborative filtering type and one content 
type.
 
-##Multimodal Recommender Query
+## Multimodal Recommender Query
 
-The actual form of the query for recommendations will vary depending on your 
search engine but the intent is the same. For a given user, map their history 
of an action or content to the correct indicator field and perform an OR'd 
query. 
+The actual form of the query for recommendations will vary depending on your 
search engine but the intent is the same. For a given user, map their history 
of an indicator or content to the correct indicator field and perform an OR'd 
query. 
 
 We have 3 indicators, these are indexed by the search engine into 3 fields, 
we'll call them "purchase", "view", and "tags". 
 We take the user's history that corresponds to each indicator and create a 
query of the form:
@@ -425,7 +418,7 @@ We take the user's history that corresponds to each 
indicator and create a query
 The query will result in an ordered list of items recommended for purchase but 
skewed towards items with similar tags to 
 the ones the user has already purchased. 
 
-This is only an example and not necessarily the optimal way to create recs. It 
illustrates how business decisions can be 
+This is only an example and not necessarily the optimal way to create recs. It 
illustrates how business rules can be 
 translated into recommendations. This technique can be used to skew 
recommendations towards intrinsic indicators also. 
 For instance you may want to put personalized popular item recs in a special 
place in the UI. Create a popularity indicator 
 by tagging items with some category of popularity (hot, warm, cold for 
instance) then
@@ -439,8 +432,9 @@ on the popularity field. If we use the ecom example but use 
the query to get "ho
 
 This will return recommendations favoring ones that have the intrinsic 
indicator "hot".
 
-##Notes
-1. Use as much user action history as you can gather. Choose a primary action 
that is closest to what you want to recommend and the others will be used to 
create cross-cooccurrence indicators. Using more data in this fashion will 
almost always produce better recommendations.
-2. Content can be used where there is no recorded user behavior or when items 
change too quickly to get much interaction history. They can be used alone or 
mixed with other indicators.
-3. Most search engines support "boost" factors so you can favor one or more 
indicators. In the example query, if you want tags to only have a small effect 
you could boost the CF indicators.
-4. In the examples we have used space delimited strings for lists of IDs in 
indicators and in queries. It may be better to use arrays of strings if your 
storage system and search engine support them. For instance Solr allows 
multi-valued fields, which correspond to arrays.
+## Notes
+
+ 1. Use as much user indicator history as you can gather. Choose a primary 
indicator that is closest to what you want to recommend and the others will be 
used to create cross-cooccurrence indicators. Using more data in this fashion 
will almost always produce better recommendations.
+ 2. Content can be used where there is no recorded user behavior or when items 
change too quickly to get much interaction history. They can be used alone or 
mixed with other indicators.
+ 3. Most search engines support "boost" factors so you can favor one or more 
indicators. In the example query, if you want tags to only have a small effect 
you could boost the CF indicators.
+ 4. In the examples we have used space delimited strings for lists of IDs in 
indicators and in queries. It may be better to use arrays of strings if your 
storage system and search engine support them. For instance Solr allows 
multi-valued fields, which correspond to arrays.

mahout git commit: fixed markdown and typos in Intro to Correlated Cross-Occurrence Recommenders with Spark

Reply via email to