Author: buildbot
Date: Mon Mar 9 00:11:42 2015
New Revision: 942934
Log:
Staging update by buildbot for mahout
Modified:
websites/staging/mahout/trunk/content/ (props changed)
websites/staging/mahout/trunk/content/users/recommender/intro-cooccurrence-spark.html
Propchange: websites/staging/mahout/trunk/content/
------------------------------------------------------------------------------
--- cms:source-revision (original)
+++ cms:source-revision Mon Mar 9 00:11:42 2015
@@ -1 +1 @@
-1665091
+1665098
Modified:
websites/staging/mahout/trunk/content/users/recommender/intro-cooccurrence-spark.html
==============================================================================
---
websites/staging/mahout/trunk/content/users/recommender/intro-cooccurrence-spark.html
(original)
+++
websites/staging/mahout/trunk/content/users/recommender/intro-cooccurrence-spark.html
Mon Mar 9 00:11:42 2015
@@ -255,6 +255,10 @@ the query without recalculating the mode
be used to create "other people also liked these things" type recommendations
and paired with a search engine can
personalize multimodal recommendations for individual users.
<em>spark-rowsimilarity</em> can provide non-personalized content based
recommendations and when paired with a search engine can be used to
personalize content based recommendations.</p>
+<p><img alt="image"
src="http://s6.postimg.org/r0m8bpjw1/recommender_architecture.png" /></p>
+<p>This is a simplified Lambda architecture with Mahout's
<em>spark-itemsimilarity</em> playing the batch model building role and a
search engine playing the realtime serving role.</p>
+<p>You will create two collections, one for user history and one for item
"indicators". Indicators are user interactions that lead to the wished for
interaction. So for example if you wish a user to purchase something and you
collect all users purchase interactions <em>spark-itemsimilarity</em> will
create a purchase indicator from them. But you can also use other user
interactions in a cross-cooccurrence calculation, to create purchase
indicators. </p>
+<p>User history is used as a query on the item collection with its
cooccurrence and cross-cooccurrence indicators (there may be several
indicators). The primary interaction or action is picked to be the thing you
want to recommend, other actions are believed to be corelated but may not
indicate exactly the same user intent. For instance in an ecom recommender a
purchase is a very good primary action, but you may also know product
detail-views, or additions-to-wishlists. These can be considered secondary
actions which may all be used to calculate cross-cooccurrence indicators. The
user history that forms the recommendations query will contain recorded primary
and secondary actions all targetted towards the correct indicator fields.</p>
<h2 id="references">References</h2>
<ol>
<li>A free ebook, which talks about the general idea: <a
href="https://www.mapr.com/practical-machine-learning">Practical Machine
Learning</a></li>
@@ -270,7 +274,7 @@ and <a href="http://occamsmachete.com/m
<p>Mahout's mapreduce version of itemsimilarity takes a text file that is
expected to have user and item IDs that conform to
Mahout's ID requirements--they are non-negative integers that can be viewed as
row and column numbers in a matrix.</p>
<p><em>spark-itemsimilarity</em> also extends the notion of cooccurrence to
cross-cooccurrence, in other words the Spark version will
-account for multi-modal interactions and create cross-indicator matrices
allowing the use of much more data in
+account for multi-modal interactions and create cross-cooccurrence indicator
matrices allowing the use of much more data in
creating recommendations or similar item lists. People try to do this by
mixing different actions and giving them weights.
For instance they might say an item-view is 0.2 of an item purchase. In
practice this is often not helpful. Spark-itemsimilarity's
cross-cooccurrence is a more principled way to handle this case. In effect it
scrubs secondary actions with the action you want
@@ -370,7 +374,7 @@ For instance a view of an item does not
might even make worse recommendations. It is tempting though since there are
so many more views than purchases. With <em>spark-itemsimilarity</em>
we can now use both actions. Mahout will use cross-action cooccurrence
analysis to limit the views to ones that do predict purchases.
We do this by treating the primary action (purchase) as data for the indicator
matrix and use the secondary action (view)
-to calculate the cross-indicator matrix. </p>
+to calculate the cross-cooccurrence indicator matrix. </p>
<p><em>spark-itemsimilarity</em> can read separate actions from separate files
or from a mixed action log by filtering certain lines. For a mixed
action log of the form:</p>
<div class="codehilite"><pre><span class="n">u1</span><span
class="p">,</span><span class="n">purchase</span><span class="p">,</span><span
class="n">iphone</span>
@@ -412,14 +416,14 @@ action log of the form:</p>
<h3 id="output">Output</h3>
<p>The output of the job will be the standard text version of two Mahout DRMs.
This is a case where we are calculating
-cross-cooccurrence so a primary indicator matrix and cross-indicator matrix
will be created</p>
+cross-cooccurrence so a primary indicator matrix and cross-cooccurrence
indicator matrix will be created</p>
<div class="codehilite"><pre><span class="n">out</span><span
class="o">-</span><span class="n">path</span>
- <span class="o">|--</span> <span class="n">indicator</span><span
class="o">-</span><span class="n">matrix</span> <span class="o">-</span> <span
class="n">TDF</span> <span class="n">part</span> <span class="n">files</span>
- <span class="o">\--</span> <span class="nb">cross</span><span
class="o">-</span><span class="n">indicator</span><span class="o">-</span><span
class="n">matrix</span> <span class="o">-</span> <span class="n">TDF</span>
<span class="n">part</span><span class="o">-</span><span class="n">files</span>
+ <span class="o">|--</span> <span class="n">similarity</span><span
class="o">-</span><span class="n">matrix</span> <span class="o">-</span> <span
class="n">TDF</span> <span class="n">part</span> <span class="n">files</span>
+ <span class="o">\--</span> <span class="nb">cross</span><span
class="o">-</span><span class="n">similarity</span><span
class="o">-</span><span class="n">matrix</span> <span class="o">-</span> <span
class="n">TDF</span> <span class="n">part</span><span class="o">-</span><span
class="n">files</span>
</pre></div>
-<p>The indicator matrix will contain the lines:</p>
+<p>The similarity-matrix will contain the lines:</p>
<div class="codehilite"><pre><span class="n">galaxy</span><span
class="o">\</span><span class="n">tnexus</span><span class="p">:</span>1<span
class="p">.</span>7260924347106847
<span class="n">ipad</span><span class="o">\</span><span
class="n">tiphone</span><span class="p">:</span>1<span
class="p">.</span>7260924347106847
<span class="n">nexus</span><span class="o">\</span><span
class="n">tgalaxy</span><span class="p">:</span>1<span
class="p">.</span>7260924347106847
@@ -428,7 +432,7 @@ cross-cooccurrence so a primary indicato
</pre></div>
-<p>The cross-indicator matrix will contain:</p>
+<p>The cross-similarity-matrix will contain:</p>
<div class="codehilite"><pre><span class="n">iphone</span><span
class="o">\</span><span class="n">tnexus</span><span class="p">:</span>1<span
class="p">.</span>7260924347106847 <span class="n">iphone</span><span
class="p">:</span>1<span class="p">.</span>7260924347106847 <span
class="n">ipad</span><span class="p">:</span>1<span
class="p">.</span>7260924347106847 <span class="n">galaxy</span><span
class="p">:</span>1<span class="p">.</span>7260924347106847
<span class="n">ipad</span><span class="o">\</span><span
class="n">tnexus</span><span class="p">:</span>0<span
class="p">.</span>6795961471815897 <span class="n">iphone</span><span
class="p">:</span>0<span class="p">.</span>6795961471815897 <span
class="n">ipad</span><span class="p">:</span>0<span
class="p">.</span>6795961471815897 <span class="n">galaxy</span><span
class="p">:</span>0<span class="p">.</span>6795961471815897
<span class="n">nexus</span><span class="o">\</span><span
class="n">tnexus</span><span class="p">:</span>0<span
class="p">.</span>6795961471815897 <span class="n">iphone</span><span
class="p">:</span>0<span class="p">.</span>6795961471815897 <span
class="n">ipad</span><span class="p">:</span>0<span
class="p">.</span>6795961471815897 <span class="n">galaxy</span><span
class="p">:</span>0<span class="p">.</span>6795961471815897
@@ -438,7 +442,7 @@ cross-cooccurrence so a primary indicato
<p><strong>Note:</strong> You can run this multiple times to use more than two
actions or you can use the underlying
-SimilarityAnalysis.cooccurrence API, which will more efficiently calculate any
number of cross-indicators.</p>
+SimilarityAnalysis.cooccurrence API, which will more efficiently calculate any
number of cross-cooccurrence indicators.</p>
<h3 id="log-file-input">Log File Input</h3>
<p>A common method of storing data is in log files. If they are written using
some delimiter they can be consumed directly by spark-itemsimilarity. For
instance input of the form:</p>
<div class="codehilite"><pre>2014<span class="o">-</span>06<span
class="o">-</span>23 14<span class="p">:</span>46<span
class="p">:</span>53<span class="p">.</span>115<span class="o">\</span><span
class="n">tu1</span><span class="o">\</span><span
class="n">tpurchase</span><span class="o">\</span><span
class="n">trandom</span> <span class="n">text</span><span
class="o">\</span><span class="n">tiphone</span>
@@ -583,7 +587,7 @@ descriptive metadata). </p>
<p>We will need to create 1 cooccurrence indicator from the primary action
(purchase) 1 cross-action cooccurrence indicator
from the secondary action (view)
and 1 content indicator (tags). We'll have to run
<em>spark-itemsimilarity</em> once and <em>spark-rowsimilarity</em> once.</p>
-<p>We have described how to create the collaborative filtering indicator and
cross-indicator for purchase and view (the <a href="#multiple-actions">How to
use Multiple User
+<p>We have described how to create the collaborative filtering indicators for
purchase and view (the <a href="#multiple-actions">How to use Multiple User
Actions</a> section) but tags will be a slightly different process. We want to
use the fact that
certain items have tags similar to the ones associated with a user's
purchases. This is not a collaborative filtering indicator
but rather a "content" or "metadata" type indicator since you are not using
other users' history, only the
@@ -609,7 +613,7 @@ content or metadata, not by which users
<p>We'll use <em>spark-rowimilairity</em> because we are looking for similar
rows, which encode items in this case. As with the
-collaborative filtering indicator and cross-indicator we use the
--omitStrength option. The strengths created are
+collaborative filtering indicators we use the --omitStrength option. The
strengths created are
probabilistic log-likelihood ratios and so are used to filter unimportant
similarities. Once the filtering or downsampling
is finished we no longer need the strengths. We will get an indicator matrix
of the form:</p>
<div class="codehilite"><pre><span class="n">itemID</span><span
class="o"><</span><span class="n">tab</span><span class="o">></span><span
class="n">list</span><span class="o">-</span><span class="n">of</span><span
class="o">-</span><span class="n">item</span> <span class="n">IDs</span>
@@ -655,7 +659,7 @@ on the popularity field. If we use the e
<p>This will return recommendations favoring ones that have the intrinsic
indicator "hot".</p>
<h2 id="notes">Notes</h2>
<ol>
-<li>Use as much user action history as you can gather. Choose a primary action
that is closest to what you want to recommend and the others will be used to
create cross-indicators. Using more data in this fashion will almost always
produce better recommendations.</li>
+<li>Use as much user action history as you can gather. Choose a primary action
that is closest to what you want to recommend and the others will be used to
create cross-cooccurrence indicators. Using more data in this fashion will
almost always produce better recommendations.</li>
<li>Content can be used where there is no recorded user behavior or when items
change too quickly to get much interaction history. They can be used alone or
mixed with other indicators.</li>
<li>Most search engines support "boost" factors so you can favor one or more
indicators. In the example query, if you want tags to only have a small effect
you could boost the CF indicators.</li>
<li>In the examples we have used space delimited strings for lists of IDs in
indicators and in queries. It may be better to use arrays of strings if your
storage system and search engine support them. For instance Solr allows
multi-valued fields, which correspond to arrays.</li>