Author: buildbot
Date: Fri Jul 10 17:50:40 2015
New Revision: 957805
Log:
Staging update by buildbot for mahout
Modified:
websites/staging/mahout/trunk/content/ (props changed)
websites/staging/mahout/trunk/content/users/recommender/intro-cooccurrence-spark.html
Propchange: websites/staging/mahout/trunk/content/
------------------------------------------------------------------------------
--- cms:source-revision (original)
+++ cms:source-revision Fri Jul 10 17:50:40 2015
@@ -1 +1 @@
-1689946
+1690298
Modified:
websites/staging/mahout/trunk/content/users/recommender/intro-cooccurrence-spark.html
==============================================================================
---
websites/staging/mahout/trunk/content/users/recommender/intro-cooccurrence-spark.html
(original)
+++
websites/staging/mahout/trunk/content/users/recommender/intro-cooccurrence-spark.html
Fri Jul 10 17:50:40 2015
@@ -263,16 +263,23 @@
<div id="content-wrap" class="clearfix">
<div id="main">
- <h1 id="intro-to-cooccurrence-recommenders-with-spark">Intro to
Cooccurrence Recommenders with Spark</h1>
+ <style type="text/css">
+/* The following code is added by mdx_elementid.py
+ It was originally lifted from http://subversion.apache.org/style/site.css */
+/*
+ * Hide class="elementid-permalink", except when an enclosing heading
+ * has the :hover property.
+ */
+.headerlink, .elementid-permalink {
+ visibility: hidden;
+}
+h2:hover > .headerlink, h3:hover > .headerlink, h1:hover > .headerlink,
h6:hover > .headerlink, h4:hover > .headerlink, h5:hover > .headerlink,
dt:hover > .elementid-permalink { visibility: visible }</style>
+<h1 id="intro-to-cooccurrence-recommenders-with-spark">Intro to Cooccurrence
Recommenders with Spark<a class="headerlink"
href="#intro-to-cooccurrence-recommenders-with-spark" title="Permanent
link">¶</a></h1>
<p>Mahout provides several important building blocks for creating
recommendations using Spark. <em>spark-itemsimilarity</em> can
be used to create "other people also liked these things" type recommendations
and paired with a search engine can
personalize recommendations for individual users. <em>spark-rowsimilarity</em>
can provide non-personalized content based
recommendations and when paired with a search engine can be used to
personalize content based recommendations.</p>
-<p><img alt="image"
src="http://s6.postimg.org/r0m8bpjw1/recommender_architecture.png" /></p>
-<p>This is a simplified Lambda architecture with Mahout's
<em>spark-itemsimilarity</em> playing the batch model building role and a
search engine playing the realtime serving role.</p>
-<p>You will create two collections, one for user history and one for item
"indicators". Indicators are user interactions that lead to the wished for
interaction. So for example if you wish a user to purchase something and you
collect all users purchase interactions <em>spark-itemsimilarity</em> will
create a purchase indicator from them. But you can also use other user
interactions in a cross-cooccurrence calculation, to create purchase
indicators. </p>
-<p>User history is used as a query on the item collection with its
cooccurrence and cross-cooccurrence indicators (there may be several
indicators). The primary interaction or action is picked to be the thing you
want to recommend, other actions are believed to be corelated but may not
indicate exactly the same user intent. For instance in an ecom recommender a
purchase is a very good primary action, but you may also know product
detail-views, or additions-to-wishlists. These can be considered secondary
actions which may all be used to calculate cross-cooccurrence indicators. The
user history that forms the recommendations query will contain recorded primary
and secondary actions all targetted towards the correct indicator fields.</p>
-<h2 id="references">References</h2>
+<h2 id="references">References<a class="headerlink" href="#references"
title="Permanent link">¶</a></h2>
<ol>
<li>A free ebook, which talks about the general idea: <a
href="https://www.mapr.com/practical-machine-learning">Practical Machine
Learning</a></li>
<li>A slide deck, which talks about mixing actions or other indicators: <a
href="http://occamsmachete.com/ml/2014/10/07/creating-a-unified-recommender-with-mahout-and-a-search-engine/">Creating
a Unified Recommender</a></li>
@@ -281,12 +288,12 @@ and <a href="http://occamsmachete.com/m
<li>A post describing the loglikelihood ratio: <a
href="http://tdunning.blogspot.com/2008/03/surprise-and-coincidence.html">Surprise
and Coinsidense</a> LLR is used to reduce noise in the data while keeping the
calculations O(n) complexity.</li>
</ol>
<p>Below are the command line jobs but the drivers and associated code can
also be customized and accessed from the Scala APIs.</p>
-<h2 id="1-spark-itemsimilarity">1. spark-itemsimilarity</h2>
+<h2 id="1-spark-itemsimilarity">1. spark-itemsimilarity<a class="headerlink"
href="#1-spark-itemsimilarity" title="Permanent link">¶</a></h2>
<p><em>spark-itemsimilarity</em> is the Spark counterpart of the of the Mahout
mapreduce job called <em>itemsimilarity</em>. It takes in elements of
interactions, which have userID, itemID, and optionally a value. It will
produce one of more indicator matrices created by comparing every user's
interactions with every other user. The indicator matrix is an item x item
matrix where the values are log-likelihood ratio strengths. For the legacy
mapreduce version, there were several possible similarity measures but these
are being deprecated in favor of LLR because in practice it performs the
best.</p>
<p>Mahout's mapreduce version of itemsimilarity takes a text file that is
expected to have user and item IDs that conform to
Mahout's ID requirements--they are non-negative integers that can be viewed as
row and column numbers in a matrix.</p>
<p><em>spark-itemsimilarity</em> also extends the notion of cooccurrence to
cross-cooccurrence, in other words the Spark version will
-account for multi-modal interactions and create cross-cooccurrence indicator
matrices allowing the use of much more data in
+account for multi-modal interactions and create indicator matrices allowing
the use of much more data in
creating recommendations or similar item lists. People try to do this by
mixing different actions and giving them weights.
For instance they might say an item-view is 0.2 of an item purchase. In
practice this is often not helpful. Spark-itemsimilarity's
cross-cooccurrence is a more principled way to handle this case. In effect it
scrubs secondary actions with the action you want
@@ -360,7 +367,7 @@ to recommend. </p>
<p>This looks daunting but defaults to simple fairly sane values to take
exactly the same input as legacy code and is pretty flexible. It allows the
user to point to a single text file, a directory full of files, or a tree of
directories to be traversed recursively. The files included can be specified
with either a regex-style pattern or filename. The schema for the file is
defined by column numbers, which map to the important bits of data including
IDs and values. The files can even contain filters, which allow unneeded rows
to be discarded or used for cross-cooccurrence calculations.</p>
<p>See ItemSimilarityDriver.scala in Mahout's spark module if you want to
customize the code. </p>
-<h3 id="defaults-in-the-spark-itemsimilarity-cli">Defaults in the
<em><strong>spark-itemsimilarity</strong></em> CLI</h3>
+<h3 id="defaults-in-the-spark-itemsimilarity-cli">Defaults in the
<em><strong>spark-itemsimilarity</strong></em> CLI<a class="headerlink"
href="#defaults-in-the-spark-itemsimilarity-cli" title="Permanent
link">¶</a></h3>
<p>If all defaults are used the input can be as simple as:</p>
<div class="codehilite"><pre><span class="n">userID1</span><span
class="p">,</span><span class="n">itemID1</span>
<span class="n">userID2</span><span class="p">,</span><span
class="n">itemID2</span>
@@ -378,7 +385,7 @@ to recommend. </p>
</pre></div>
-<h3 id="wzxhzdk18how-to-use-multiple-user-actionswzxhzdk19"><a
name="multiple-actions">How To Use Multiple User Actions</a></h3>
+<h3 id="how-to-use-multiple-user-actions"><a name="multiple-actions">How To
Use Multiple User Actions</a><a class="headerlink"
href="#how-to-use-multiple-user-actions" title="Permanent link">¶</a></h3>
<p>Often we record various actions the user takes for later analytics. These
can now be used to make recommendations.
The idea of a recommender is to recommend the action you want the user to
make. For an ecom app this might be
a purchase action. It is usually not a good idea to just treat other actions
the same as the action you want to recommend.
@@ -412,7 +419,7 @@ action log of the form:</p>
</pre></div>
-<h3 id="command-line">Command Line</h3>
+<h3 id="command-line">Command Line<a class="headerlink" href="#command-line"
title="Permanent link">¶</a></h3>
<p>Use the following options:</p>
<div class="codehilite"><pre><span class="n">bash</span>$ <span
class="n">mahout</span> <span class="n">spark</span><span
class="o">-</span><span class="n">itemsimilarity</span> <span class="o">\</span>
<span class="o">--</span><span class="n">input</span> <span
class="n">in</span><span class="o">-</span><span class="n">file</span> <span
class="o">\</span> # <span class="n">where</span> <span class="n">to</span>
<span class="n">look</span> <span class="k">for</span> <span
class="n">data</span>
@@ -426,7 +433,7 @@ action log of the form:</p>
</pre></div>
-<h3 id="output">Output</h3>
+<h3 id="output">Output<a class="headerlink" href="#output" title="Permanent
link">¶</a></h3>
<p>The output of the job will be the standard text version of two Mahout DRMs.
This is a case where we are calculating
cross-cooccurrence so a primary indicator matrix and cross-cooccurrence
indicator matrix will be created</p>
<div class="codehilite"><pre><span class="n">out</span><span
class="o">-</span><span class="n">path</span>
@@ -435,7 +442,7 @@ cross-cooccurrence so a primary indicato
</pre></div>
-<p>The similarity-matrix will contain the lines:</p>
+<p>The indicator matrix will contain the lines:</p>
<div class="codehilite"><pre><span class="n">galaxy</span><span
class="o">\</span><span class="n">tnexus</span><span class="p">:</span>1<span
class="p">.</span>7260924347106847
<span class="n">ipad</span><span class="o">\</span><span
class="n">tiphone</span><span class="p">:</span>1<span
class="p">.</span>7260924347106847
<span class="n">nexus</span><span class="o">\</span><span
class="n">tgalaxy</span><span class="p">:</span>1<span
class="p">.</span>7260924347106847
@@ -444,7 +451,7 @@ cross-cooccurrence so a primary indicato
</pre></div>
-<p>The cross-similarity-matrix will contain:</p>
+<p>The cross-cooccurrence indicator matrix will contain:</p>
<div class="codehilite"><pre><span class="n">iphone</span><span
class="o">\</span><span class="n">tnexus</span><span class="p">:</span>1<span
class="p">.</span>7260924347106847 <span class="n">iphone</span><span
class="p">:</span>1<span class="p">.</span>7260924347106847 <span
class="n">ipad</span><span class="p">:</span>1<span
class="p">.</span>7260924347106847 <span class="n">galaxy</span><span
class="p">:</span>1<span class="p">.</span>7260924347106847
<span class="n">ipad</span><span class="o">\</span><span
class="n">tnexus</span><span class="p">:</span>0<span
class="p">.</span>6795961471815897 <span class="n">iphone</span><span
class="p">:</span>0<span class="p">.</span>6795961471815897 <span
class="n">ipad</span><span class="p">:</span>0<span
class="p">.</span>6795961471815897 <span class="n">galaxy</span><span
class="p">:</span>0<span class="p">.</span>6795961471815897
<span class="n">nexus</span><span class="o">\</span><span
class="n">tnexus</span><span class="p">:</span>0<span
class="p">.</span>6795961471815897 <span class="n">iphone</span><span
class="p">:</span>0<span class="p">.</span>6795961471815897 <span
class="n">ipad</span><span class="p">:</span>0<span
class="p">.</span>6795961471815897 <span class="n">galaxy</span><span
class="p">:</span>0<span class="p">.</span>6795961471815897
@@ -455,7 +462,7 @@ cross-cooccurrence so a primary indicato
<p><strong>Note:</strong> You can run this multiple times to use more than two
actions or you can use the underlying
SimilarityAnalysis.cooccurrence API, which will more efficiently calculate any
number of cross-cooccurrence indicators.</p>
-<h3 id="log-file-input">Log File Input</h3>
+<h3 id="log-file-input">Log File Input<a class="headerlink"
href="#log-file-input" title="Permanent link">¶</a></h3>
<p>A common method of storing data is in log files. If they are written using
some delimiter they can be consumed directly by spark-itemsimilarity. For
instance input of the form:</p>
<div class="codehilite"><pre>2014<span class="o">-</span>06<span
class="o">-</span>23 14<span class="p">:</span>46<span
class="p">:</span>53<span class="p">.</span>115<span class="o">\</span><span
class="n">tu1</span><span class="o">\</span><span
class="n">tpurchase</span><span class="o">\</span><span
class="n">trandom</span> <span class="n">text</span><span
class="o">\</span><span class="n">tiphone</span>
2014<span class="o">-</span>06<span class="o">-</span>23 14<span
class="p">:</span>46<span class="p">:</span>53<span class="p">.</span>115<span
class="o">\</span><span class="n">tu1</span><span class="o">\</span><span
class="n">tpurchase</span><span class="o">\</span><span
class="n">trandom</span> <span class="n">text</span><span
class="o">\</span><span class="n">tipad</span>
@@ -494,7 +501,7 @@ SimilarityAnalysis.cooccurrence API, whi
</pre></div>
-<h2 id="2-spark-rowsimilarity">2. spark-rowsimilarity</h2>
+<h2 id="2-spark-rowsimilarity">2. spark-rowsimilarity<a class="headerlink"
href="#2-spark-rowsimilarity" title="Permanent link">¶</a></h2>
<p><em>spark-rowsimilarity</em> is the companion to
<em>spark-itemsimilarity</em> the primary difference is that it takes a text
file version of
a matrix of sparse vectors with optional application specific IDs and it finds
similar rows rather than items (columns). Its use is
not limited to collaborative filtering. The input is in text-delimited form
where there are three delimiters used. By
@@ -554,25 +561,25 @@ by a list of the most similar rows.</p>
<p>See RowSimilarityDriver.scala in Mahout's spark module if you want to
customize the code. </p>
-<h1 id="3-using-spark-rowsimilarity-with-text-data">3. Using
<em>spark-rowsimilarity</em> with Text Data</h1>
+<h1 id="3-using-spark-rowsimilarity-with-text-data">3. Using
<em>spark-rowsimilarity</em> with Text Data<a class="headerlink"
href="#3-using-spark-rowsimilarity-with-text-data" title="Permanent
link">¶</a></h1>
<p>Another use case for <em>spark-rowsimilarity</em> is in finding similar
textual content. For instance given the tags associated with
a blog post,
which other posts have similar tags. In this case the columns are tags and
the rows are posts. Since LLR is
the only similarity method supported this is not the optimal way to determine
general "bag-of-words" document similarity.
LLR is used more as a quality filter than as a similarity measure. However
<em>spark-rowsimilarity</em> will produce
lists of similar docs for every doc if input is docs with lists of terms. The
Apache <a href="http://lucene.apache.org">Lucene</a> project provides several
methods of <a
href="http://lucene.apache.org/core/4_9_0/core/org/apache/lucene/analysis/package-summary.html#package_description">analyzing
and tokenizing</a> documents.</p>
-<h1 id="wzxhzdk244-creating-a-multimodal-recommenderwzxhzdk25"><a
name="unified-recommender">4. Creating a Multimodal Recommender</a></h1>
-<p>Using the output of <em>spark-itemsimilarity</em> and
<em>spark-rowsimilarity</em> you can build a miltimodal cooccurrence and
content based
+<h1 id="4-creating-a-unified-recommender"><a name="unified-recommender">4.
Creating a Unified Recommender</a><a class="headerlink"
href="#4-creating-a-unified-recommender" title="Permanent link">¶</a></h1>
+<p>Using the output of <em>spark-itemsimilarity</em> and
<em>spark-rowsimilarity</em> you can build a unified cooccurrence and content
based
recommender that can be used in both or either mode depending on indicators
available and the history available at
-runtime for a user. Some slide describing this method can be found <a
href="http://occamsmachete.com/ml/2014/10/07/creating-a-unified-recommender-with-mahout-and-a-search-engine/">here</a></p>
-<h2 id="requirements">Requirements</h2>
+runtime for a user.</p>
+<h2 id="requirements">Requirements<a class="headerlink" href="#requirements"
title="Permanent link">¶</a></h2>
<ol>
-<li>Mahout SNAPSHOT-1.0 or later</li>
+<li>Mahout 0.10.0 or later</li>
<li>Hadoop</li>
<li>Spark, the correct version for your version of Mahout and Hadoop</li>
<li>A search engine like Solr or Elasticsearch</li>
</ol>
-<h2 id="indicators">Indicators</h2>
+<h2 id="indicators">Indicators<a class="headerlink" href="#indicators"
title="Permanent link">¶</a></h2>
<p>Indicators come in 3 types</p>
<ol>
<li><strong>Cooccurrence</strong>: calculated with
<em>spark-itemsimilarity</em> from user actions</li>
@@ -588,7 +595,7 @@ while working well for items with lots o
indicators developers can create a solution for the "cold-start" problem that
gracefully improves with more user history
and as items have more interactions. It is also possible to create a
completely content-based recommender that personalizes
recommendations.</p>
-<h2 id="example-with-3-indicators">Example with 3 Indicators</h2>
+<h2 id="example-with-3-indicators">Example with 3 Indicators<a
class="headerlink" href="#example-with-3-indicators" title="Permanent
link">¶</a></h2>
<p>You will need to decide how you store user action data so they can be
processed by the item and row similarity jobs and
this is most easily done by using text files as described above. The data that
is processed by these jobs is considered the
training data. You will need some amount of user history in your recs query.
It is typical to use the most recent user history
@@ -599,19 +606,18 @@ descriptive metadata). </p>
<p>We will need to create 1 cooccurrence indicator from the primary action
(purchase) 1 cross-action cooccurrence indicator
from the secondary action (view)
and 1 content indicator (tags). We'll have to run
<em>spark-itemsimilarity</em> once and <em>spark-rowsimilarity</em> once.</p>
-<p>We have described how to create the collaborative filtering indicators for
purchase and view (the <a href="#multiple-actions">How to use Multiple User
+<p>We have described how to create the collaborative filtering indicator and
cross-cooccurrence indicator for purchase and view (the <a
href="#multiple-actions">How to use Multiple User
Actions</a> section) but tags will be a slightly different process. We want to
use the fact that
certain items have tags similar to the ones associated with a user's
purchases. This is not a collaborative filtering indicator
but rather a "content" or "metadata" type indicator since you are not using
other users' history, only the
individual that you are making recs for. This means that this method will make
recommendations for items that have
no collaborative filtering data, as happens with new items in a catalog. New
items may have tags assigned but no one
has purchased or viewed them yet. In the final query we will mix all 3
indicators.</p>
-<h2 id="content-indicator">Content Indicator</h2>
+<h2 id="content-indicator">Content Indicator<a class="headerlink"
href="#content-indicator" title="Permanent link">¶</a></h2>
<p>To create a content-indicator we'll make use of the fact that the user has
purchased items with certain tags. We want to find
items with the most similar tags. Notice that other users' behavior is not
considered--only other item's tags. This defines a
content or metadata indicator. They are used when you want to find items that
are similar to other items by using their
content or metadata, not by which users interacted with them.</p>
-<p><strong>Note</strong>: It may be advisable to treat tags as
cross-cooccurrence indicators but for the sake of an example they are treated
here as content only.</p>
<p>For this we need input of the form:</p>
<div class="codehilite"><pre><span class="n">itemID</span><span
class="o"><</span><span class="n">tab</span><span class="o">></span><span
class="n">list</span><span class="o">-</span><span class="n">of</span><span
class="o">-</span><span class="n">tags</span>
<span class="p">...</span>
@@ -626,7 +632,7 @@ content or metadata, not by which users
<p>We'll use <em>spark-rowimilairity</em> because we are looking for similar
rows, which encode items in this case. As with the
-collaborative filtering indicators we use the --omitStrength option. The
strengths created are
+collaborative filtering indicator and cross-cooccurrence indicator we use the
--omitStrength option. The strengths created are
probabilistic log-likelihood ratios and so are used to filter unimportant
similarities. Once the filtering or downsampling
is finished we no longer need the strengths. We will get an indicator matrix
of the form:</p>
<div class="codehilite"><pre><span class="n">itemID</span><span
class="o"><</span><span class="n">tab</span><span class="o">></span><span
class="n">list</span><span class="o">-</span><span class="n">of</span><span
class="o">-</span><span class="n">item</span> <span class="n">IDs</span>
@@ -642,8 +648,9 @@ is finished we no longer need the streng
<p>We now have three indicators, two collaborative filtering type and one
content type.</p>
-<h2 id="multimodal-recommender-query">Multimodal Recommender Query</h2>
-<p>The actual form of the query for recommendations will vary depending on
your search engine but the intent is the same. For a given user, map their
history of an action or content to the correct indicator field and perform an
OR'd query. </p>
+<h2 id="unified-recommender-query">Unified Recommender Query<a
class="headerlink" href="#unified-recommender-query" title="Permanent
link">¶</a></h2>
+<p>The actual form of the query for recommendations will vary depending on
your search engine but the intent is the same.
+For a given user, map their history of an action or content to the correct
indicator field and perform an OR'd query. </p>
<p>We have 3 indicators, these are indexed by the search engine into 3 fields,
we'll call them "purchase", "view", and "tags".
We take the user's history that corresponds to each indicator and create a
query of the form:</p>
<div class="codehilite"><pre><span class="n">Query</span><span
class="o">:</span>
@@ -669,7 +676,7 @@ on the popularity field. If we use the e
<p>This will return recommendations favoring ones that have the intrinsic
indicator "hot".</p>
-<h2 id="notes">Notes</h2>
+<h2 id="notes">Notes<a class="headerlink" href="#notes" title="Permanent
link">¶</a></h2>
<ol>
<li>Use as much user action history as you can gather. Choose a primary action
that is closest to what you want to recommend and the others will be used to
create cross-cooccurrence indicators. Using more data in this fashion will
almost always produce better recommendations.</li>
<li>Content can be used where there is no recorded user behavior or when items
change too quickly to get much interaction history. They can be used alone or
mixed with other indicators.</li>