intro-cooccurrence-spark.html

buildbot Thu, 02 Oct 2014 13:48:07 -0700

Author: buildbot
Date: Thu Oct  2 20:47:33 2014
New Revision: 924450

Log:
Staging update by buildbot for mahout


Modified:
    websites/staging/mahout/trunk/content/   (props changed)
    
websites/staging/mahout/trunk/content/users/recommender/intro-cooccurrence-spark.html

Propchange: websites/staging/mahout/trunk/content/
------------------------------------------------------------------------------
--- cms:source-revision (original)
+++ cms:source-revision Thu Oct  2 20:47:33 2014
@@ -1 +1 @@
-1628847
+1629066

Modified: 
websites/staging/mahout/trunk/content/users/recommender/intro-cooccurrence-spark.html
==============================================================================
--- 
websites/staging/mahout/trunk/content/users/recommender/intro-cooccurrence-spark.html
 (original)
+++ 
websites/staging/mahout/trunk/content/users/recommender/intro-cooccurrence-spark.html
 Thu Oct  2 20:47:33 2014
@@ -525,9 +525,16 @@ by a list of the most similar rows.</p>
 
 <p>See RowSimilarityDriver.scala in Mahout's spark module if you want to 
customize the code. </p>
 <h1 id="3-using-spark-rowsimilarity-with-text-data">3. Using 
<em>spark-rowsimilarity</em> with Text Data</h1>
-<p>Another use case for <em>spark-rowsimilarity</em> is in finding similar 
textual content. For instance given the content of a blog post, which other 
posts are similar. In this case the columns are terms and the rows are 
documents. Since LLR is the only similarity method supported this is not the 
optimal way to determine document similarity. LLR is used more as a quality of 
similarity filter than as a similarity measure. However 
<em>spark-rowsimilarity</em> will produce lists of similar docs for every doc. 
The Apache <a href="http://lucene.apache.org";>Lucene</a> project provides 
several methods of <a 
href="http://lucene.apache.org/core/4_9_0/core/org/apache/lucene/analysis/package-summary.html#package_description";>analyzing
 and tokenizing</a> documents.</p>
+<p>Another use case for <em>spark-rowsimilarity</em> is in finding similar 
textual content. For instance given the tags associated with 
+a blog post,
+ which other posts have similar tags. In this case the columns are tags and 
the rows are posts. Since LLR is 
+the only similarity method supported this is not the optimal way to determine 
general "bag-of-words" document similarity. 
+LLR is used more as a quality filter than as a similarity measure. However 
<em>spark-rowsimilarity</em> will produce 
+lists of similar docs for every doc if input is docs with lists of terms. The 
Apache <a href="http://lucene.apache.org";>Lucene</a> project provides several 
methods of <a 
href="http://lucene.apache.org/core/4_9_0/core/org/apache/lucene/analysis/package-summary.html#package_description";>analyzing
 and tokenizing</a> documents.</p>
 <h1 id="wzxhzdk234-creating-a-unified-recommenderwzxhzdk24"><a 
name="unified-recommender">4. Creating a Unified Recommender</a></h1>
-<p>Using the output of <em>spark-itemsimilarity</em> and 
<em>spark-rowsimilarity</em> you can build a unified cooccurrnce and content 
based recommender that can be used in both or either mode depending on 
indicators available and the history available at runtime for a user.</p>
+<p>Using the output of <em>spark-itemsimilarity</em> and 
<em>spark-rowsimilarity</em> you can build a unified cooccurrence and content 
based
+ recommender that can be used in both or either mode depending on indicators 
available and the history available at 
+runtime for a user.</p>
 <h2 id="requirements">Requirements</h2>
 <ol>
 <li>Mahout SNAPSHOT-1.0 or later</li>
@@ -535,6 +542,23 @@ by a list of the most similar rows.</p>
 <li>Spark, the correct version for your version of Mahout and Hadoop</li>
 <li>A search engine like Solr or Elasticsearch</li>
 </ol>
+<h2 id="indicators">Indicators</h2>
+<p>Indicators come in 3 types</p>
+<ol>
+<li><strong>Cooccurrence</strong>: calculated with 
<em>spark-itemsimilarity</em> from user actions</li>
+<li><strong>Content</strong>: calculated from item metadata or content using 
<em>spark-rowsimilarity</em></li>
+<li><strong>Intrinsic</strong>: assigned to items as metadata. Can be anything 
that describes the item.</li>
+</ol>
+<p>The query for recommendations will be a mix of values meant to match one of 
your indicators. The query can be constructed 
+from user history and values derived from context (category being viewed for 
instance) or special precalculated data 
+(popularity rank for instance). This blending of indicators allows for 
creating many flavors or recommendations to fit 
+a very wide variety of circumstances. It allows recommendations to be made for 
items with no usage data and even allows 
+for gracefully degrading recommendations based on how much user history is 
available. </p>
+<p>With the right mix of indicators developers can construct a single query 
that works for completely new items and new users 
+while working well for items with lots of interactions and users with many 
recorded actions. In other words adding in content and intrinsic 
+indicators allows developers to create a solution for the "cold-start" problem 
that gracefully improves with more user history
+and as items have more interactions. It is also possible to create a 
completely content-based recommender that personalizes 
+recommendations.</p>
 <h2 id="example-with-3-indicators">Example with 3 Indicators</h2>
 <p>You will need to decide how you store user action data so they can be 
processed by the item and row similarity jobs and this is most easily done by 
using text files as described above. The data that is processed by these jobs 
is considered the <strong>training data</strong>. You will need some amount of 
user history in your recs query. It is typical to use the most recent user 
history but need not be exactly what is in the training set, which may include 
more historical data. Keeping the user history for query purposes could be done 
with a database by referencing some history from a users table. In the example 
above the two collaborative filtering actions are "purchase" and "view", but 
let's also add tags (taken from catalog categories or other descriptive 
metadata). </p>
 <p>We will need to create 1 indicator from the primary action (purchase) 1 
cross-indicator from the secondary action (view) and 1 content-indicator for 
(tags). We'll have to run <em>spark-itemsimilarity</em> once and 
<em>spark-rowsimilarity</em> once.</p>
@@ -576,7 +600,10 @@ no collaborative filtering data, as happ
 
 <p>We now have three indicators, two collaborative filtering type and one 
content type. Notice that purchase, view, and tags can all be recorded for 
users and so can be used in a recommendations query.</p>
 <h2 id="unified-recommender-query">Unified Recommender Query</h2>
-<p>The actual form of the query for recommendations will vary depending on 
your search engine but the intent is the same. For a given user, map their 
history of an action or content to the correct indicator field and preform and 
OR'd the query. This will allow matches from any indicator where AND queries 
require that an item have some similarity to all indicator fields.</p>
+<p>The actual form of the query for recommendations will vary depending on 
your search engine but the intent is the same. 
+For a given user, map their history of an action or content to the correct 
indicator field and perform an OR'd query. 
+This will allow matches from any indicator where AND queries require that an 
item have some similarity to all indicator 
+fields.</p>
 <p>We have 3 indicators, these are indexed by the search engine into 3 fields, 
we'll call them "purchase", "view", and "tags". We take the user's history that 
corresponds to each indicator and create a query of the form:</p>
 <div class="codehilite"><pre><span class="n">Query</span><span 
class="o">:</span>
   <span class="n">field</span><span class="o">:</span> <span 
class="n">purchase</span><span class="o">;</span> <span class="n">q</span><span 
class="o">:</span><span class="n">user</span><span 
class="s1">&#39;s-purchase-history</span>
@@ -586,7 +613,17 @@ no collaborative filtering data, as happ
 
 
 <p>The query will result in an ordered list of items recommended for purchase 
but skewed towards items with similar tags to the ones the user has already 
purchased. </p>
-<p>This is only an example and not necessarily the optimal way to create recs. 
It illustrates how business decisions can be translated into recommendations. 
This technique can be used to skew recommendations towards intrinsic indicators 
also. For instance you may want to put personalized popular item recs in a 
special place in the UI. Create a popularity indicator using whatever method 
you want and index that as a new indicator field and include the corresponding 
value in a query on the popularity field. </p>
+<p>This is only an example and not necessarily the optimal way to create recs. 
It illustrates how business decisions can be 
+translated into recommendations. This technique can be used to skew 
recommendations towards intrinsic indicators also. 
+For instance you may want to put personalized popular item recs in a special 
place in the UI. Create a popularity indicator 
+by tagging items with some category of popularity (hot, warm, cold for 
instance) then
+index that as a new indicator field and include the corresponding value in a 
query 
+on the popularity field. If we use the ecom example but use the query to get 
"hot" recommendations it might look like this:</p>
+<p>Query:
+      field: purchase; q:user's-purchase-history
+      field: view; q:user's view-history
+      field: popularity; q:"hot"</p>
+<p>This will return recommendations favoring ones that have the intrinsic 
indicator "hot".</p>
 <h2 id="notes">Notes</h2>
 <ol>
 <li>Use as much user action history as you can gather. Choose a primary action 
that is closest to what you want to recommend and the others will be used to 
create cross-indicators. Using more data in this fashion will almost always 
produce better recommendations.</li>

svn commit: r924450 - in /websites/staging/mahout/trunk/content: ./ users/recommender/intro-cooccurrence-spark.html

Reply via email to