intro-cooccurrence-spark.html

buildbot Wed, 01 Oct 2014 09:54:18 -0700

Author: buildbot
Date: Wed Oct  1 16:52:54 2014
New Revision: 924306

Log:
Staging update by buildbot for mahout


Modified:
    websites/staging/mahout/trunk/content/   (props changed)
    
websites/staging/mahout/trunk/content/users/recommender/intro-cooccurrence-spark.html

Propchange: websites/staging/mahout/trunk/content/
------------------------------------------------------------------------------
--- cms:source-revision (original)
+++ cms:source-revision Wed Oct  1 16:52:54 2014
@@ -1 +1 @@
-1626592
+1628770

Modified: 
websites/staging/mahout/trunk/content/users/recommender/intro-cooccurrence-spark.html
==============================================================================
--- 
websites/staging/mahout/trunk/content/users/recommender/intro-cooccurrence-spark.html
 (original)
+++ 
websites/staging/mahout/trunk/content/users/recommender/intro-cooccurrence-spark.html
 Wed Oct  1 16:52:54 2014
@@ -348,7 +348,7 @@ to recommend.   </p>
 </pre></div>
 
 
-<h3 id="how-to-use-multiple-user-actions">How to use Multiple User Actions</h3>
+<h3 id="wzxhzdk17how-to-use-multiple-user-actionswzxhzdk18"><a 
name="multiple-actions">How to use Multiple User Actions</a></h3>
 <p>Often we record various actions the user takes for later analytics. These 
can now be used to make recommendations. 
 The idea of a recommender is to recommend the action you want the user to 
make. For an ecom app this might be 
 a purchase action. It is usually not a good idea to just treat other actions 
the same as the action you want to recommend. 
@@ -524,26 +524,76 @@ by a list of the most similar rows.</p>
 
 
 <p>See RowSimilarityDriver.scala in Mahout's spark module if you want to 
customize the code. </p>
-<h1 id="3-creating-a-recommender">3. Creating a Recommender</h1>
-<p>One significant output option for the spark-itemsimilarity job is 
--omitStrength. This creates a tab-delimited file containing a itemID token 
followed by a space delimited string of tokens of the form:</p>
-<div class="codehilite"><pre><span class="n">itemID</span><span 
class="o">&lt;</span><span class="n">tab</span><span class="o">&gt;</span><span 
class="n">itemsIDs</span><span class="o">-</span><span 
class="n">from</span><span class="o">-</span><span class="n">the</span><span 
class="o">-</span><span class="n">indicator</span><span class="o">-</span><span 
class="n">matrix</span>
-</pre></div>
-
-
-<p>To create a cooccurrence type collaborative filtering recommender using a 
search engine simply index this output created with --omitStrength. Then at 
runtime query the indexed data with the current user's history of the primary 
action on the index field that contains the primary indicator tokens. The 
result will be an ordered list of itemIDs as recommendations.</p>
-<p>It is possible to include the indicator strengths by attaching them to the 
tokens before indexing but that is engine specific and beyond this description. 
Using without weights generally provides good results since the indicators have 
been downsampled by strength so the indicator matrix has some degree of quality 
guarantee. </p>
-<h2 id="multi-action-recommendations">Multi-action Recommendations</h2>
-<p>Optionally the query can contain the user's history of a secondary action 
(input with --input2) against the cross-indicator tokens as a second field.</p>
-<p>In this case the indicator-matrix and the cross-indicator-matrix should be 
combined and indexed as two fields. The data will be of the form:</p>
-<div class="codehilite"><pre><span class="n">itemID</span><span 
class="p">,</span> <span class="n">itemIDs</span><span class="o">-</span><span 
class="n">from</span><span class="o">-</span><span 
class="n">indicator</span><span class="o">-</span><span 
class="n">matrix</span><span class="p">,</span> <span 
class="n">itemIDs</span><span class="o">-</span><span 
class="n">from</span><span class="o">-</span><span class="nb">cross</span><span 
class="o">-</span><span class="n">indicator</span><span class="o">-</span><span 
class="n">matrix</span>
-</pre></div>
-
-
-<p>Now the query will have one string of the user's primary action history and 
a second of the user's secondary action history against two fields in the 
index.</p>
-<p>It is probably better to index the two (or more) fields as multi-valued 
fields (arrays) and query them as such but the above works in much the same way 
if the indexed tokens are space delimited as is the query string. </p>
-<p><strong>Note:</strong> Using the underlying code it is possible to use as 
many actions as you have data for to create a multi-action recommender that 
makes the most of available data. The CLI only supports two actions.</p>
-<h1 id="4-using-spark-rowsimilarity-with-text-data">4. Using 
<em>spark-rowsimilarity</em> with Text Data</h1>
-<p>Another use case for these jobs is in finding similar textual content. For 
instance given the content of a blog post, which other posts are similar. In 
this case the columns are tokenized words and the rows are documents. Since LLR 
is being used there is no need to attach TF-IDF weights to the 
tokens&mdash;they will not be used. The Apache <a 
href="http://lucene.apache.org";>Lucene</a> project provides several methods of 
<a 
href="http://lucene.apache.org/core/4_9_0/core/org/apache/lucene/analysis/package-summary.html#package_description";>analyzing
 and tokenizing</a> documents.</p>
+<h1 id="3-using-spark-rowsimilarity-with-text-data">3. Using 
<em>spark-rowsimilarity</em> with Text Data</h1>
+<p>Another use case for <em>spark-rowsimilarity</em> is in finding similar 
textual content. For instance given the content of a blog post, which other 
posts are similar. In this case the columns are terms and the rows are 
documents. Since LLR is the only similarity method supported this is not the 
optimal way to determine document similarity. LLR is used more as a quality of 
similarity filter than as a similarity measure. However 
<em>spark-rowsimilarity</em> will produce lists of similar docs for every doc. 
The Apache <a href="http://lucene.apache.org";>Lucene</a> project provides 
several methods of <a 
href="http://lucene.apache.org/core/4_9_0/core/org/apache/lucene/analysis/package-summary.html#package_description";>analyzing
 and tokenizing</a> documents.</p>
+<h1 id="4-creating-a-unified-recommender">4. Creating a Unified 
Recommender</h1>
+<p>Using the output of <em>spark-itemsimilarity</em> and 
<em>spark-rowsimilarity</em> you can build a unified cooccurrnce and content 
based recommender that can be used in both or either mode depending on 
indicators available and the history available at runtime for a user.</p>
+<h2 id="requirements">Requirements</h2>
+<ol>
+<li>Mahout SNAPSHOT-1.0 or later</li>
+<li>Hadoop</li>
+<li>Spark, the correct version for your version of Mahout and Hadoop</li>
+<li>A search engine like Solr or Elasticsearch</li>
+</ol>
+<h2 id="example-with-3-indicators">Example with 3 Indicators</h2>
+<p>You will need to decide how you store user action data so they can be 
processed by the item and row similarity jobs and this is most easily done by 
using text files as described above. The data that is processed by these jobs 
is considered the <strong>training data</strong>. You will need some amount of 
user history in your recs query. It is typical to use the most recent user 
history but need not be exactly what is in the training set, which may include 
more historical data. Keeping the user history for query purposes could be done 
with a database by referencing some history from a users table. In the example 
above the two collaborative filtering actions are "purchase" and "view", but 
let's also add tags (taken from catalog categories or other descriptive 
metadata). </p>
+<p>We will need to create 1 indicator from the primary action (purchase) 1 
cross-indicator from the secondary action (view) and 1 content-indicator for 
(tags). We'll have to run <em>spark-itemsimilarity</em> once and 
<em>spark-rowsimilarity</em> once.</p>
+<p>We have described how to create the indicator and cross-indicator for 
purchase and view (the <a href="#multiple-actions">How to use Multiple User 
+Actions</a> section) but tags will be a slightly different process. We want to 
use the fact that 
+certain items have tags similar to the ones associated with a user's 
purchases. This is not a collaborative filtering indicator 
+but rather a "content" or "metadata" type indicator since you are not using 
other users' tag viewing history, only the 
+individual that you are making recs for. This means that this method will make 
recommendations for items that have 
+no collaborative filtering data, as happens with new items in a catalog. New 
items may have tags assigned but no one
+ has purchased or viewed them yet. </p>
+<p>We could have treated viewing tags as a collaborative filtering 
cross-indicator by recording other users tag viewing history and that would 
probably give better results but here we are trying to illustrate recommending 
without CF data and using content-indicators. In the final query we will mix 
all 3 indicators.</p>
+<h2 id="content-indicator">Content Indicator</h2>
+<p>To create a content-indicator we'll make use of the fact that the user has 
purchased items with certain tags. We want to find items with the most similar 
tags. Notice that other users' behavior is not considered--only other item's 
tags. This defines a content or metadata indicator. They are used when you want 
to find items that are similar to other items by using their content or 
metadata, not by which users interacted with them.</p>
+<p>For this we need input of the form:</p>
+<div class="codehilite"><pre><span class="n">itemID</span><span 
class="o">&lt;</span><span class="n">tab</span><span class="o">&gt;</span><span 
class="n">list</span><span class="o">-</span><span class="n">of</span><span 
class="o">-</span><span class="n">tags</span>
+<span class="p">...</span>
+</pre></div>
+
+
+<p>The full collection will look like the tags column from a catalog DB. For 
our ecom example it might be:</p>
+<div class="codehilite"><pre>3459860<span class="n">b</span><span 
class="o">&lt;</span><span class="n">tab</span><span class="o">&gt;</span><span 
class="n">men</span> <span class="n">long</span><span class="o">-</span><span 
class="n">sleeve</span> <span class="n">chambray</span> <span 
class="n">clothing</span> <span class="n">casual</span>
+9446577<span class="n">d</span><span class="o">&lt;</span><span 
class="n">tab</span><span class="o">&gt;</span><span class="n">women</span> 
<span class="n">tops</span> <span class="n">chambray</span> <span 
class="n">clothing</span> <span class="n">casual</span>
+<span class="p">...</span>
+</pre></div>
+
+
+<p>We'll use <em>spark-rowimilairity</em> because we are looking for similar 
rows, which encode items in this case. As with the indicator and 
cross-indicator we use the --omitStrength option. The strengths created are 
probabilistic log-likelihood ratios and so are used to filter unimportant 
similarities. Once the filtering or downsampling are finished we no longer need 
the strengths. We will get an indicator matrix of the form:</p>
+<div class="codehilite"><pre><span class="n">itemID</span><span 
class="o">&lt;</span><span class="n">tab</span><span class="o">&gt;</span><span 
class="n">list</span><span class="o">-</span><span class="n">of</span><span 
class="o">-</span><span class="n">item</span> <span class="n">IDs</span>
+<span class="p">...</span>
+</pre></div>
+
+
+<p>This is a content indicator since it has found other items with similar 
content or metadata.</p>
+<div class="codehilite"><pre>3459860<span class="n">b</span><span 
class="o">&lt;</span><span class="n">tab</span><span 
class="o">&gt;</span>3459860<span class="n">b</span> 3459860<span 
class="n">b</span> 6749860<span class="n">c</span> 5959860<span 
class="n">a</span> 3434860<span class="n">a</span> 3477860<span 
class="n">a</span>
+9446577<span class="n">d</span><span class="o">&lt;</span><span 
class="n">tab</span><span class="o">&gt;</span>9446577<span class="n">d</span> 
9496577<span class="n">d</span> 0943577<span class="n">d</span> 8346577<span 
class="n">d</span> 9442277<span class="n">d</span> 9446577<span 
class="n">e</span>
+<span class="p">...</span>
+</pre></div>
+
+
+<p>We now have three indicators, two collaborative filtering type and one 
content type. Notice that purchase, view, and tags can all be recorded for 
users and so can be used in a recommendations query.</p>
+<h2 id="unified-recommender-query">Unified Recommender Query</h2>
+<p>The actual form of the query for recommendations will vary depending on 
your search engine but the intent is the same. For a given user, map their 
history of an action or content to the correct indicator field and preform and 
OR'd the query. This will allow matches from any indicator where AND queries 
require that an item have some similarity to all indicator fields.</p>
+<p>We have 3 indicators, these are indexed by the search engine into 3 fields, 
we'll call them "purchase", "view", and "tags". We take the user's history that 
corresponds to each indicator and create a query of the form:</p>
+<div class="codehilite"><pre><span class="n">Query</span><span 
class="o">:</span>
+  <span class="n">field</span><span class="o">:</span> <span 
class="n">purchase</span><span class="o">;</span> <span class="n">q</span><span 
class="o">:</span><span class="n">user</span><span 
class="s1">&#39;s-purchase-history</span>
+<span class="s1">  field: view; q:user&#39;</span><span class="n">s</span> 
<span class="n">view</span><span class="o">-</span><span 
class="n">history</span>
+  <span class="n">field</span><span class="o">:</span> <span 
class="n">tags</span><span class="o">;</span> <span class="n">q</span><span 
class="o">:</span><span class="n">user</span><span 
class="err">&#39;</span><span class="n">s</span><span class="o">-</span><span 
class="n">tags</span><span class="o">-</span><span 
class="n">associated</span><span class="o">-</span><span 
class="k">with</span><span class="o">-</span><span class="n">purchases</span>
+</pre></div>
+
+
+<p>The query will result in an ordered list of items recommended for purchase 
but skewed towards items with similar tags to the ones the user has already 
purchased. </p>
+<p>This is only an example and not necessarily the optimal way to create recs. 
It illustrates how business decisions can be translated into recommendations. 
This technique can be used to skew recommendations towards intrinsic indicators 
also. For instance you may want to put personalized popular item recs in a 
special place in the UI. Create a popularity indicator using whatever method 
you want and index that as a new indicator field and include the corresponding 
value in a query on the popularity field. </p>
+<h2 id="notes">Notes</h2>
+<ol>
+<li>Use as much user action history as you can gather. Choose a primary action 
that is closest to what you want to recommend and the others will be used to 
create cross-indicators. Using more data in this fashion will almost always 
produce better recommendations.</li>
+<li>Content can be used where there is no recorded user behavior or when items 
change too quickly to get much interaction history. They can be used alone or 
mixed with other indicators.</li>
+<li>Most search engines support "boost" factors so you can favor one or more 
indicators. In the example query, if you want tags to only have a small effect 
you could boost the CF indicators.</li>
+<li>In the examples we have used space delimited strings for lists of IDs in 
indicators and in queries. It may be better to use arrays of strings if your 
storage system and search engine support them. For instance Solr allows 
multi-valued fields, which correspond to arrays.</li>
+</ol>
    </div>
   </div>     
 </div>

svn commit: r924306 - in /websites/staging/mahout/trunk/content: ./ users/recommender/intro-cooccurrence-spark.html

Reply via email to