Added: 
websites/staging/mahout/trunk/content/users/mapreduce/recommender/intro-cooccurrence-spark.html
==============================================================================
--- 
websites/staging/mahout/trunk/content/users/mapreduce/recommender/intro-cooccurrence-spark.html
 (added)
+++ 
websites/staging/mahout/trunk/content/users/mapreduce/recommender/intro-cooccurrence-spark.html
 Thu Mar 19 21:21:45 2015
@@ -0,0 +1,690 @@
+<!DOCTYPE html>
+<!--
+
+    Licensed to the Apache Software Foundation (ASF) under one or more
+    contributor license agreements.  See the NOTICE file distributed with
+    this work for additional information regarding copyright ownership.
+    The ASF licenses this file to You under the Apache License, Version 2.0
+    (the "License"); you may not use this file except in compliance with
+    the License.  You may obtain a copy of the License at
+
+       http://www.apache.org/licenses/LICENSE-2.0
+
+    Unless required by applicable law or agreed to in writing, software
+    distributed under the License is distributed on an "AS IS" BASIS,
+    WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+    See the License for the specific language governing permissions and
+    limitations under the License.
+-->
+
+<html xmlns="http://www.w3.org/1999/xhtml"; xml:lang="en" lang="en"><head><meta 
http-equiv="Content-Type" content="text/html; charset=UTF-8">
+  <title>Apache Mahout: Scalable machine learning and data mining</title>
+  <meta http-equiv="Content-Type" content="text/html; charset=utf-8">
+  <meta name="Distribution" content="Global">
+  <meta name="Robots" content="index,follow">
+  <meta name="keywords" content="apache, apache hadoop, apache lucene,
+        business data mining, cluster analysis,
+        collaborative filtering, data extraction, data filtering, data 
framework, data integration,
+        data matching, data mining, data mining algorithms, data mining 
analysis, data mining data,
+        data mining introduction, data mining software,
+        data mining techniques, data representation, data set, datamining,
+        feature extraction, fuzzy k means, genetic algorithm, hadoop,
+        hierarchical clustering, high dimensional, introduction to data 
mining, kmeans,
+        knowledge discovery, learning approach, learning approaches, learning 
methods,
+        learning techniques, lucene, machine learning, machine translation, 
mahout apache,
+        mahout taste, map reduce hadoop, mining data, mining methods, naive 
bayes,
+        natural language processing,
+        supervised, text mining, time series data, unsupervised, web data 
mining">
+  <link rel="shortcut icon" type="image/x-icon" 
href="http://mahout.apache.org/images/favicon.ico";>
+  <script type="text/javascript" src="/js/prototype.js"></script>
+  <script type="text/javascript" src="/js/effects.js"></script>
+  <script type="text/javascript" src="/js/search.js"></script>
+  <script type="text/javascript" src="/js/slides.js"></script>
+
+  <link href="/css/bootstrap.min.css" rel="stylesheet" media="screen">
+  <link href="/css/bootstrap-responsive.css" rel="stylesheet">
+  <link rel="stylesheet" href="/css/global.css" type="text/css">
+
+  <!-- mathJax stuff -- use `\(...\)` for inline style math in markdown -->
+  <script type="text/x-mathjax-config">
+  MathJax.Hub.Config({
+    tex2jax: {
+      skipTags: ['script', 'noscript', 'style', 'textarea', 'pre']
+    }
+  });
+  MathJax.Hub.Queue(function() {
+    var all = MathJax.Hub.getAllJax(), i;
+    for(i = 0; i < all.length; i += 1) {
+      all[i].SourceElement().parentNode.className += ' has-jax';
+    }
+  });
+  </script>
+  <script type="text/javascript">
+    var mathjax = document.createElement('script'); 
+    mathjax.type = 'text/javascript'; 
+    mathjax.async = true;
+
+    mathjax.src = ('https:' == document.location.protocol) ?
+        
'https://c328740.ssl.cf1.rackcdn.com/mathjax/latest/MathJax.js?config=TeX-AMS-MML_HTMLorMML'
 : 
+        
'http://cdn.mathjax.org/mathjax/latest/MathJax.js?config=TeX-AMS-MML_HTMLorMML';
+       
+         var s = document.getElementsByTagName('script')[0]; 
+    s.parentNode.insertBefore(mathjax, s);
+  </script>
+</head>
+
+<body id="home" data-twttr-rendered="true">
+  <div id="wrap">
+   <div id="header">
+    <div id="logo"><a href="/overview.html"></a></div>
+  <div id="search">
+    <form id="search-form" action="http://www.google.com/search"; method="get" 
class="navbar-search pull-right">    
+      <input value="http://mahout.apache.org"; name="sitesearch" type="hidden">
+      <input class="search-query" name="q" id="query" type="text">
+      <input id="submission" type="image" src="/images/mahout-lupe.png" 
alt="Search" />
+    </form>
+  </div>
+
+    <div class="navbar navbar-inverse" 
style="position:absolute;top:133px;padding-right:0px;padding-left:0px;">
+      <div class="navbar-inner" style="border: none; background: #999; border: 
none; border-radius: 0px;">
+        <div class="container">
+          <button type="button" class="btn btn-navbar" data-toggle="collapse" 
data-target=".nav-collapse">
+            <span class="icon-bar"></span>
+            <span class="icon-bar"></span>
+            <span class="icon-bar"></span>
+          </button>
+          <!-- <a class="brand" href="#">Apache Community Development 
Project</a> -->
+          <div class="nav-collapse collapse">
+            <ul class="nav">
+              <li><a href="/">Home</a></li>
+              <li class="dropdown"> <a href="#" class="dropdown-toggle" 
data-toggle="dropdown">General<b class="caret"></b></a>
+                <ul class="dropdown-menu">
+                  <li><a href="/general/downloads.html">Downloads</a>
+                  <li><a href="/general/who-we-are.html">Who we are</a>
+                  <li><a 
href="/general/mailing-lists,-irc-and-archives.html">Mailing Lists</a>
+                  <li><a href="/general/release-notes.html">Release Notes</a> 
+                  <li><a href="/general/books-tutorials-and-talks.html">Books, 
Tutorials, Talks</a></li>
+                  <li><a href="/general/powered-by-mahout.html">Powered By 
Mahout</a>
+                  <li><a 
href="/general/professional-support.html">Professional Support</a>
+                  <li class="divider"></li>
+                  <li class="nav-header">Resources</li>
+                  <li><a href="/general/reference-reading.html">Reference 
Reading</a>
+                  <li><a href="/general/faq.html">FAQ</a>
+                  <li class="divider"></li>
+                  <li class="nav-header">Legal</li>
+                  <li><a 
href="http://www.apache.org/licenses/";>License</a></li>
+                  <li><a 
href="http://www.apache.org/security/";>Security</a></li>
+                  <li><a href="/general/privacy-policy.html">Privacy Policy</a>
+                </ul>
+              </li>
+              <li class="dropdown"> <a href="#" class="dropdown-toggle" 
data-toggle="dropdown">Developers<b class="caret"></b></a>
+                <ul class="dropdown-menu">
+                  <li><a href="/developers/developer-resources.html">Developer 
resources</a></li>
+                  <li><a href="/developers/version-control.html">Version 
control</a></li>
+                  <li><a href="/developers/buildingmahout.html">Build from 
source</a></li>
+                  <li><a href="/developers/issue-tracker.html">Issue 
tracker</a></li>
+                  <li><a href="https://builds.apache.org/job/Mahout-Quality/"; 
target="_blank">Code quality reports</a></li>
+                  <li class="divider"></li>
+                  <li class="nav-header">Contributions</li>
+                  <li><a href="/developers/how-to-contribute.html">How to 
contribute</a></li>
+                  <li><a href="/developers/how-to-become-a-committer.html">How 
to become a committer</a></li>
+                  <li><a href="/developers/gsoc.html">GSoC</a></li>
+                  <li class="divider"></li>
+                  <li class="nav-header">For committers</li>
+                  <li><a href="/developers/how-to-update-the-website.html">How 
to update the website</a></li>
+                  <li><a href="/developers/patch-check-list.html">Patch check 
list</a></li>
+                  <li><a href="/developers/github.html">Handling Github 
PRs</a></li>
+                  <li><a href="/developers/how-to-release.html">How to 
release</a></li>
+                  <li><a href="/developers/thirdparty-dependencies.html">Third 
party dependencies</a></li>
+                </ul>
+               </li>
+               <li class="dropdown"> <a href="#" class="dropdown-toggle" 
data-toggle="dropdown">Basics<b class="caret"></b></a>
+                 <ul class="dropdown-menu">
+                  <li><a href="/users/basics/algorithms.html">List of 
algorithms</a>
+                  <li><a href="/users/basics/quickstart.html">Quickstart</a>
+                  <li class="divider"></li>
+                  <li class="nav-header">Working with text</li>
+                  <li><a 
href="/users/basics/creating-vectors-from-text.html">Creating vectors from 
text</a>
+                  <li><a 
href="/users/basics/collocations.html">Collocations</a>
+                  <li class="divider"></li>
+                  <li class="nav-header">Dimensionality reduction</li>
+                  <li><a 
href="/users/dim-reduction/dimensional-reduction.html">Singular Value 
Decomposition</a></li>
+                  <li><a href="/users/dim-reduction/ssvd.html">Stochastic 
SVD</a></li>
+                  <li class="divider"></li>
+                  <li class="nav-header">Topic Models</li>      
+                  <li><a 
href="/users/clustering/latent-dirichlet-allocation.html">Latent Dirichlet 
Allocation</a></li>
+                </ul>
+                 </li>
+               <li class="dropdown"> <a href="#" class="dropdown-toggle" 
data-toggle="dropdown">Spark<b class="caret"></b></a>
+                <ul class="dropdown-menu">
+                  <li><a href="/users/sparkbindings/home.html">Scala &amp; 
Spark Bindings Overview</a></li>
+                  <li><a 
href="/users/sparkbindings/play-with-shell.html">Playing with Mahout's Spark 
Shell</a></li>
+                             <li class="divider"></li>
+                  <li><a href="/users/sparkbindings/faq.html">FAQ</a></li>
+                </ul>
+               </li>
+              <li class="dropdown"> <a href="#" class="dropdown-toggle" 
data-toggle="dropdown">Classification<b class="caret"></b></a>
+                <ul class="dropdown-menu">
+                  <li><a 
href="/users/mapreduce/classification/bayesian.html">Naive Bayes</a></li>
+                  <li><a 
href="/users/mapreduce/classification/hidden-markov-models.html">Hidden Markov 
Models</a></li>
+                  <li><a 
href="/users/mapreduce/classification/logistic-regression.html">Logistic 
Regression</a></li>
+                  <li><a 
href="/users/mapreduce/classification/partial-implementation.html">Random 
Forest</a></li>
+
+                  <li class="divider"></li>
+                  <li class="nav-header">Examples</li>
+                  <li><a 
href="/users/mapreduce/classification/breiman-example.html">Breiman 
example</a></li>
+                  <li><a 
href="/users/mapreduce/classification/twenty-newsgroups.html">20 newsgroups 
example</a></li>
+                </ul></li>
+               <li class="dropdown"> <a href="#" class="dropdown-toggle" 
data-toggle="dropdown">Clustering<b class="caret"></b></a>
+                <ul class="dropdown-menu">
+                <li><a 
href="/users/mapreduce/clustering/k-means-clustering.html">k-Means</a></li>
+                <li><a 
href="/users/mapreduce/clustering/canopy-clustering.html">Canopy</a></li>
+                <li><a 
href="/users/mapreduce/clustering/fuzzy-k-means.html">Fuzzy k-Means</a></li>
+                <li><a 
href="/users/mapreduce/clustering/streaming-k-means.html">Streaming 
KMeans</a></li>
+                <li><a 
href="/users/mapreduce/clustering/spectral-clustering.html">Spectral 
Clustering</a></li>
+                <li class="divider"></li>
+                <li class="nav-header">Commandline usage</li>
+                <li><a 
href="/users/mapreduce/clustering/k-means-commandline.html">Options for 
k-Means</a></li>
+                <li><a 
href="/users/mapreduce/clustering/canopy-commandline.html">Options for 
Canopy</a></li>
+                <li><a 
href="/users/mapreduce/clustering/fuzzy-k-means-commandline.html">Options for 
Fuzzy k-Means</a></li>
+                <li class="divider"></li>
+                <li class="nav-header">Examples</li>
+                <li><a 
href="/users/mapreduce/clustering/clustering-of-synthetic-control-data.html">Synthetic
 data</a></li>
+                <li class="divider"></li>
+                <li class="nav-header">Post processing</li>
+                <li><a 
href="/users/mapreduce/clustering/cluster-dumper.html">Cluster Dumper 
tool</a></li>
+                <li><a 
href="/users/mapreduce/clustering/visualizing-sample-clusters.html">Cluster 
visualisation</a></li>
+                </ul></li>
+                <li class="dropdown"> <a href="#" class="dropdown-toggle" 
data-toggle="dropdown">Recommendations<b class="caret"></b></a>
+                <ul class="dropdown-menu">
+                <li><a 
href="/users/mapreduce/recommender/quickstart.html">Quickstart</a></li>
+                <li><a 
href="/users/mapreduce/recommender/recommender-first-timer-faq.html">First 
Timer FAQ</a></li>
+                <li><a 
href="/users/mapreduce/recommender/userbased-5-minutes.html">A user-based 
recommender <br/>in 5 minutes</a></li>
+               <li><a 
href="/users/mapreduce/recommender/matrix-factorization.html">Matrix 
factorization-based<br/> recommenders</a></li>
+                <li><a 
href="/users/mapreduce/recommender/recommender-documentation.html">Overview</a></li>
+                <li class="divider"></li>
+                <li class="nav-header">Hadoop</li>
+                <li><a 
href="/users/mapreduce/recommender/intro-itembased-hadoop.html">Intro to 
item-based recommendations<br/> with Hadoop</a></li>
+                <li><a 
href="/users/mapreduce/recommender/intro-als-hadoop.html">Intro to ALS 
recommendations<br/> with Hadoop</a></li>
+                <li class="nav-header">Spark</li>
+                <li><a 
href="/users/mapreduce/recommender/intro-cooccurrence-spark.html">Intro to 
cooccurrence-based<br/> recommendations with Spark</a></li>
+              </ul>
+            </li>
+           </ul>
+          </div><!--/.nav-collapse -->
+        </div>
+      </div>
+    </div>
+
+</div>
+
+ <div id="sidebar">
+  <div id="sidebar-wrap">
+    <h2>Twitter</h2>
+       <ul class="sidemenu">
+               <li>
+<a class="twitter-timeline" href="https://twitter.com/ApacheMahout"; 
data-widget-id="422861673444028416">Tweets by @ApacheMahout</a>
+<script>!function(d,s,id){var 
js,fjs=d.getElementsByTagName(s)[0],p=/^http:/.test(d.location)?'http':'https';if(!d.getElementById(id)){js=d.createElement(s);js.id=id;js.src=p+"://platform.twitter.com/widgets.js";fjs.parentNode.insertBefore(js,fjs);}}(document,"script","twitter-wjs");</script>
+</li>
+       </ul>
+    <h2>Apache Software Foundation</h2>
+    <ul class="sidemenu">
+      <li><a href="http://www.apache.org/foundation/how-it-works.html";>How the 
ASF works</a></li>
+      <li><a href="http://www.apache.org/foundation/getinvolved.html";>Get 
Involved</a></li>
+      <li><a href="http://www.apache.org/dev/";>Developer Resources</a></li>
+      <li><a 
href="http://www.apache.org/foundation/sponsorship.html";>Sponsorship</a></li>
+      <li><a 
href="http://www.apache.org/foundation/thanks.html";>Thanks</a></li>
+    </ul>
+    <h2>Related Projects</h2>
+    <ul class="sidemenu">
+      <li><a href="http://lucene.apache.org/";>Lucene</a></li>
+      <li><a href="http://hadoop.apache.org/";>Hadoop</a></li>
+    </ul>
+  </div>
+</div>
+
+  <div id="content-wrap" class="clearfix">
+   <div id="main">
+    <h1 id="intro-to-cooccurrence-recommenders-with-spark">Intro to 
Cooccurrence Recommenders with Spark</h1>
+<p>Mahout provides several important building blocks for creating 
recommendations using Spark. <em>spark-itemsimilarity</em> can 
+be used to create "other people also liked these things" type recommendations 
and paired with a search engine can 
+personalize recommendations for individual users. <em>spark-rowsimilarity</em> 
can provide non-personalized content based 
+recommendations and when paired with a search engine can be used to 
personalize content based recommendations.</p>
+<p><img alt="image" 
src="http://s6.postimg.org/r0m8bpjw1/recommender_architecture.png"; /></p>
+<p>This is a simplified Lambda architecture with Mahout's 
<em>spark-itemsimilarity</em> playing the batch model building role and a 
search engine playing the realtime serving role.</p>
+<p>You will create two collections, one for user history and one for item 
"indicators". Indicators are user interactions that lead to the wished for 
interaction. So for example if you wish a user to purchase something and you 
collect all users purchase interactions <em>spark-itemsimilarity</em> will 
create a purchase indicator from them. But you can also use other user 
interactions in a cross-cooccurrence calculation, to create purchase 
indicators. </p>
+<p>User history is used as a query on the item collection with its 
cooccurrence and cross-cooccurrence indicators (there may be several 
indicators). The primary interaction or action is picked to be the thing you 
want to recommend, other actions are believed to be corelated but may not 
indicate exactly the same user intent. For instance in an ecom recommender a 
purchase is a very good primary action, but you may also know product 
detail-views, or additions-to-wishlists. These can be considered secondary 
actions which may all be used to calculate cross-cooccurrence indicators. The 
user history that forms the recommendations query will contain recorded primary 
and secondary actions all targetted towards the correct indicator fields.</p>
+<h2 id="references">References</h2>
+<ol>
+<li>A free ebook, which talks about the general idea: <a 
href="https://www.mapr.com/practical-machine-learning";>Practical Machine 
Learning</a></li>
+<li>A slide deck, which talks about mixing actions or other indicators: <a 
href="http://occamsmachete.com/ml/2014/10/07/creating-a-unified-recommender-with-mahout-and-a-search-engine/";>Creating
 a Unified Recommender</a></li>
+<li>Two blog posts: <a 
href="http://occamsmachete.com/ml/2014/08/11/mahout-on-spark-whats-new-in-recommenders/";>What's
 New in Recommenders: part #1</a>
+and  <a 
href="http://occamsmachete.com/ml/2014/09/09/mahout-on-spark-whats-new-in-recommenders-part-2/";>What's
 New in Recommenders: part #2</a></li>
+<li>A post describing the loglikelihood ratio:  <a 
href="http://tdunning.blogspot.com/2008/03/surprise-and-coincidence.html";>Surprise
 and Coinsidense</a>  LLR is used to reduce noise in the data while keeping the 
calculations O(n) complexity.</li>
+</ol>
+<p>Below are the command line jobs but the drivers and associated code can 
also be customized and accessed from the Scala APIs.</p>
+<h2 id="1-spark-itemsimilarity">1. spark-itemsimilarity</h2>
+<p><em>spark-itemsimilarity</em> is the Spark counterpart of the of the Mahout 
mapreduce job called <em>itemsimilarity</em>. It takes in elements of 
interactions, which have userID, itemID, and optionally a value. It will 
produce one of more indicator matrices created by comparing every user's 
interactions with every other user. The indicator matrix is an item x item 
matrix where the values are log-likelihood ratio strengths. For the legacy 
mapreduce version, there were several possible similarity measures but these 
are being deprecated in favor of LLR because in practice it performs the 
best.</p>
+<p>Mahout's mapreduce version of itemsimilarity takes a text file that is 
expected to have user and item IDs that conform to 
+Mahout's ID requirements--they are non-negative integers that can be viewed as 
row and column numbers in a matrix.</p>
+<p><em>spark-itemsimilarity</em> also extends the notion of cooccurrence to 
cross-cooccurrence, in other words the Spark version will 
+account for multi-modal interactions and create cross-cooccurrence indicator 
matrices allowing the use of much more data in 
+creating recommendations or similar item lists. People try to do this by 
mixing different actions and giving them weights. 
+For instance they might say an item-view is 0.2 of an item purchase. In 
practice this is often not helpful. Spark-itemsimilarity's
+cross-cooccurrence is a more principled way to handle this case. In effect it 
scrubs secondary actions with the action you want
+to recommend.   </p>
+<div class="codehilite"><pre><span class="n">spark</span><span 
class="o">-</span><span class="n">itemsimilarity</span> <span 
class="n">Mahout</span> 1<span class="p">.</span>0
+<span class="n">Usage</span><span class="p">:</span> <span 
class="n">spark</span><span class="o">-</span><span 
class="n">itemsimilarity</span> <span class="p">[</span><span 
class="n">options</span><span class="p">]</span>
+
+<span class="n">Disconnected</span> <span class="n">from</span> <span 
class="n">the</span> <span class="n">target</span> <span 
class="n">VM</span><span class="p">,</span> <span class="n">address</span><span 
class="p">:</span> <span class="s">&#39;127.0.0.1:64676&#39;</span><span 
class="p">,</span> <span class="n">transport</span><span class="p">:</span> 
<span class="s">&#39;socket&#39;</span>
+<span class="n">Input</span><span class="p">,</span> <span 
class="n">output</span> <span class="n">options</span>
+  <span class="o">-</span><span class="nb">i</span> <span 
class="o">&lt;</span><span class="n">value</span><span class="o">&gt;</span> 
<span class="o">|</span> <span class="o">--</span><span class="n">input</span> 
<span class="o">&lt;</span><span class="n">value</span><span 
class="o">&gt;</span>
+        <span class="n">Input</span> <span class="n">path</span><span 
class="p">,</span> <span class="n">may</span> <span class="n">be</span> <span 
class="n">a</span> <span class="n">filename</span><span class="p">,</span> 
<span class="n">directory</span> <span class="n">name</span><span 
class="p">,</span> <span class="n">or</span> <span class="n">comma</span> <span 
class="n">delimited</span> <span class="n">list</span> <span 
class="n">of</span> <span class="n">HDFS</span> <span 
class="n">supported</span> <span class="n">URIs</span> <span 
class="p">(</span><span class="n">required</span><span class="p">)</span>
+  <span class="o">-</span><span class="n">i2</span> <span 
class="o">&lt;</span><span class="n">value</span><span class="o">&gt;</span> 
<span class="o">|</span> <span class="o">--</span><span class="n">input2</span> 
<span class="o">&lt;</span><span class="n">value</span><span 
class="o">&gt;</span>
+        <span class="n">Secondary</span> <span class="n">input</span> <span 
class="n">path</span> <span class="k">for</span> <span 
class="nb">cross</span><span class="o">-</span><span 
class="n">similarity</span> <span class="n">calculation</span><span 
class="p">,</span> <span class="n">same</span> <span 
class="n">restrictions</span> <span class="n">as</span> &quot;<span 
class="o">--</span><span class="n">input</span>&quot; <span 
class="p">(</span><span class="n">optional</span><span class="p">).</span> 
<span class="n">Default</span><span class="p">:</span> <span 
class="n">empty</span><span class="p">.</span>
+  <span class="o">-</span><span class="n">o</span> <span 
class="o">&lt;</span><span class="n">value</span><span class="o">&gt;</span> 
<span class="o">|</span> <span class="o">--</span><span class="n">output</span> 
<span class="o">&lt;</span><span class="n">value</span><span 
class="o">&gt;</span>
+        <span class="n">Path</span> <span class="k">for</span> <span 
class="n">output</span><span class="p">,</span> <span class="n">any</span> 
<span class="n">local</span> <span class="n">or</span> <span 
class="n">HDFS</span> <span class="n">supported</span> <span 
class="n">URI</span> <span class="p">(</span><span 
class="n">required</span><span class="p">)</span>
+
+<span class="n">Algorithm</span> <span class="n">control</span> <span 
class="n">options</span><span class="p">:</span>
+  <span class="o">-</span><span class="n">mppu</span> <span 
class="o">&lt;</span><span class="n">value</span><span class="o">&gt;</span> 
<span class="o">|</span> <span class="o">--</span><span 
class="n">maxPrefs</span> <span class="o">&lt;</span><span 
class="n">value</span><span class="o">&gt;</span>
+        <span class="n">Max</span> <span class="n">number</span> <span 
class="n">of</span> <span class="n">preferences</span> <span 
class="n">to</span> <span class="n">consider</span> <span class="n">per</span> 
<span class="n">user</span> <span class="p">(</span><span 
class="n">optional</span><span class="p">).</span> <span 
class="n">Default</span><span class="p">:</span> 500
+  <span class="o">-</span><span class="n">m</span> <span 
class="o">&lt;</span><span class="n">value</span><span class="o">&gt;</span> 
<span class="o">|</span> <span class="o">--</span><span 
class="n">maxSimilaritiesPerItem</span> <span class="o">&lt;</span><span 
class="n">value</span><span class="o">&gt;</span>
+        <span class="n">Limit</span> <span class="n">the</span> <span 
class="n">number</span> <span class="n">of</span> <span 
class="n">similarities</span> <span class="n">per</span> <span 
class="n">item</span> <span class="n">to</span> <span class="n">this</span> 
<span class="n">number</span> <span class="p">(</span><span 
class="n">optional</span><span class="p">).</span> <span 
class="n">Default</span><span class="p">:</span> 100
+
+<span class="n">Note</span><span class="p">:</span> <span 
class="n">Only</span> <span class="n">the</span> <span class="n">Log</span> 
<span class="n">Likelihood</span> <span class="n">Ratio</span> <span 
class="p">(</span><span class="n">LLR</span><span class="p">)</span> <span 
class="n">is</span> <span class="n">supported</span> <span class="n">as</span> 
<span class="n">a</span> <span class="n">similarity</span> <span 
class="n">measure</span><span class="p">.</span>
+
+<span class="n">Input</span> <span class="n">text</span> <span 
class="n">file</span> <span class="n">schema</span> <span 
class="n">options</span><span class="p">:</span>
+  <span class="o">-</span><span class="n">id</span> <span 
class="o">&lt;</span><span class="n">value</span><span class="o">&gt;</span> 
<span class="o">|</span> <span class="o">--</span><span 
class="n">inDelim</span> <span class="o">&lt;</span><span 
class="n">value</span><span class="o">&gt;</span>
+        <span class="n">Input</span> <span class="n">delimiter</span> <span 
class="n">character</span> <span class="p">(</span><span 
class="n">optional</span><span class="p">).</span> <span 
class="n">Default</span><span class="p">:</span> &quot;<span 
class="p">[,</span><span class="o">\</span><span class="n">t</span><span 
class="p">]</span>&quot;
+  <span class="o">-</span><span class="n">f1</span> <span 
class="o">&lt;</span><span class="n">value</span><span class="o">&gt;</span> 
<span class="o">|</span> <span class="o">--</span><span 
class="n">filter1</span> <span class="o">&lt;</span><span 
class="n">value</span><span class="o">&gt;</span>
+        <span class="n">String</span> <span class="p">(</span><span 
class="n">or</span> <span class="n">regex</span><span class="p">)</span> <span 
class="n">whose</span> <span class="n">presence</span> <span 
class="n">indicates</span> <span class="n">a</span> <span 
class="n">datum</span> <span class="k">for</span> <span class="n">the</span> 
<span class="n">primary</span> <span class="n">item</span> <span 
class="n">set</span> <span class="p">(</span><span 
class="n">optional</span><span class="p">).</span> <span 
class="n">Default</span><span class="p">:</span> <span class="n">no</span> 
<span class="n">filter</span><span class="p">,</span> <span 
class="n">all</span> <span class="n">data</span> <span class="n">is</span> 
<span class="n">used</span>
+  <span class="o">-</span><span class="n">f2</span> <span 
class="o">&lt;</span><span class="n">value</span><span class="o">&gt;</span> 
<span class="o">|</span> <span class="o">--</span><span 
class="n">filter2</span> <span class="o">&lt;</span><span 
class="n">value</span><span class="o">&gt;</span>
+        <span class="n">String</span> <span class="p">(</span><span 
class="n">or</span> <span class="n">regex</span><span class="p">)</span> <span 
class="n">whose</span> <span class="n">presence</span> <span 
class="n">indicates</span> <span class="n">a</span> <span 
class="n">datum</span> <span class="k">for</span> <span class="n">the</span> 
<span class="n">secondary</span> <span class="n">item</span> <span 
class="n">set</span> <span class="p">(</span><span 
class="n">optional</span><span class="p">).</span> <span class="n">If</span> 
<span class="n">not</span> <span class="n">present</span> <span 
class="n">no</span> <span class="n">secondary</span> <span 
class="n">dataset</span> <span class="n">is</span> <span 
class="n">collected</span>
+  <span class="o">-</span><span class="n">rc</span> <span 
class="o">&lt;</span><span class="n">value</span><span class="o">&gt;</span> 
<span class="o">|</span> <span class="o">--</span><span 
class="n">rowIDColumn</span> <span class="o">&lt;</span><span 
class="n">value</span><span class="o">&gt;</span>
+        <span class="n">Column</span> <span class="n">number</span> <span 
class="p">(</span>0 <span class="n">based</span> <span 
class="n">Int</span><span class="p">)</span> <span class="n">containing</span> 
<span class="n">the</span> <span class="n">row</span> <span class="n">ID</span> 
<span class="n">string</span> <span class="p">(</span><span 
class="n">optional</span><span class="p">).</span> <span 
class="n">Default</span><span class="p">:</span> 0
+  <span class="o">-</span><span class="n">ic</span> <span 
class="o">&lt;</span><span class="n">value</span><span class="o">&gt;</span> 
<span class="o">|</span> <span class="o">--</span><span 
class="n">itemIDColumn</span> <span class="o">&lt;</span><span 
class="n">value</span><span class="o">&gt;</span>
+        <span class="n">Column</span> <span class="n">number</span> <span 
class="p">(</span>0 <span class="n">based</span> <span 
class="n">Int</span><span class="p">)</span> <span class="n">containing</span> 
<span class="n">the</span> <span class="n">item</span> <span 
class="n">ID</span> <span class="n">string</span> <span class="p">(</span><span 
class="n">optional</span><span class="p">).</span> <span 
class="n">Default</span><span class="p">:</span> 1
+  <span class="o">-</span><span class="n">fc</span> <span 
class="o">&lt;</span><span class="n">value</span><span class="o">&gt;</span> 
<span class="o">|</span> <span class="o">--</span><span 
class="n">filterColumn</span> <span class="o">&lt;</span><span 
class="n">value</span><span class="o">&gt;</span>
+        <span class="n">Column</span> <span class="n">number</span> <span 
class="p">(</span>0 <span class="n">based</span> <span 
class="n">Int</span><span class="p">)</span> <span class="n">containing</span> 
<span class="n">the</span> <span class="n">filter</span> <span 
class="n">string</span> <span class="p">(</span><span 
class="n">optional</span><span class="p">).</span> <span 
class="n">Default</span><span class="p">:</span> <span class="o">-</span>1 
<span class="k">for</span> <span class="n">no</span> <span 
class="n">filter</span>
+
+<span class="n">Using</span> <span class="n">all</span> <span 
class="n">defaults</span> <span class="n">the</span> <span 
class="n">input</span> <span class="n">is</span> <span 
class="n">expected</span> <span class="n">of</span> <span class="n">the</span> 
<span class="n">form</span><span class="p">:</span> &quot;<span 
class="n">userID</span><span class="o">&lt;</span><span 
class="n">tab</span><span class="o">&gt;</span><span 
class="n">itemId</span>&quot; <span class="n">or</span> &quot;<span 
class="n">userID</span><span class="o">&lt;</span><span 
class="n">tab</span><span class="o">&gt;</span><span 
class="n">itemID</span><span class="o">&lt;</span><span 
class="n">tab</span><span class="o">&gt;</span><span class="n">any</span><span 
class="o">-</span><span class="n">text</span><span class="p">...</span>&quot; 
<span class="n">and</span> <span class="n">all</span> <span 
class="n">rows</span> <span class="n">will</span> <span class="n">be</span> 
<span class="n">used</span>
+
+<span class="n">File</span> <span class="n">discovery</span> <span 
class="n">options</span><span class="p">:</span>
+  <span class="o">-</span><span class="n">r</span> <span class="o">|</span> 
<span class="o">--</span><span class="n">recursive</span>
+        <span class="n">Searched</span> <span class="n">the</span> <span 
class="o">-</span><span class="nb">i</span> <span class="n">path</span> <span 
class="n">recursively</span> <span class="k">for</span> <span 
class="n">files</span> <span class="n">that</span> <span class="n">match</span> 
<span class="o">--</span><span class="n">filenamePattern</span> <span 
class="p">(</span><span class="n">optional</span><span class="p">),</span> 
<span class="n">Default</span><span class="p">:</span> <span 
class="n">false</span>
+  <span class="o">-</span><span class="n">fp</span> <span 
class="o">&lt;</span><span class="n">value</span><span class="o">&gt;</span> 
<span class="o">|</span> <span class="o">--</span><span 
class="n">filenamePattern</span> <span class="o">&lt;</span><span 
class="n">value</span><span class="o">&gt;</span>
+        <span class="n">Regex</span> <span class="n">to</span> <span 
class="n">match</span> <span class="n">in</span> <span 
class="n">determining</span> <span class="n">input</span> <span 
class="n">files</span> <span class="p">(</span><span 
class="n">optional</span><span class="p">).</span> <span 
class="n">Default</span><span class="p">:</span> <span 
class="n">filename</span> <span class="n">in</span> <span class="n">the</span> 
<span class="o">--</span><span class="n">input</span> <span 
class="n">option</span> <span class="n">or</span> &quot;^<span 
class="n">part</span><span class="o">-.*</span>&quot; <span class="k">if</span> 
<span class="o">--</span><span class="n">input</span> <span class="n">is</span> 
<span class="n">a</span> <span class="n">directory</span>
+
+<span class="n">Output</span> <span class="n">text</span> <span 
class="n">file</span> <span class="n">schema</span> <span 
class="n">options</span><span class="p">:</span>
+  <span class="o">-</span><span class="n">rd</span> <span 
class="o">&lt;</span><span class="n">value</span><span class="o">&gt;</span> 
<span class="o">|</span> <span class="o">--</span><span 
class="n">rowKeyDelim</span> <span class="o">&lt;</span><span 
class="n">value</span><span class="o">&gt;</span>
+        <span class="n">Separates</span> <span class="n">the</span> <span 
class="n">rowID</span> <span class="n">key</span> <span class="n">from</span> 
<span class="n">the</span> <span class="n">vector</span> <span 
class="n">values</span> <span class="n">list</span> <span 
class="p">(</span><span class="n">optional</span><span class="p">).</span> 
<span class="n">Default</span><span class="p">:</span> &quot;<span 
class="o">\</span><span class="n">t</span>&quot;
+  <span class="o">-</span><span class="n">cd</span> <span 
class="o">&lt;</span><span class="n">value</span><span class="o">&gt;</span> 
<span class="o">|</span> <span class="o">--</span><span 
class="n">columnIdStrengthDelim</span> <span class="o">&lt;</span><span 
class="n">value</span><span class="o">&gt;</span>
+        <span class="n">Separates</span> <span class="n">column</span> <span 
class="n">IDs</span> <span class="n">from</span> <span class="n">their</span> 
<span class="n">values</span> <span class="n">in</span> <span 
class="n">the</span> <span class="n">vector</span> <span 
class="n">values</span> <span class="n">list</span> <span 
class="p">(</span><span class="n">optional</span><span class="p">).</span> 
<span class="n">Default</span><span class="p">:</span> &quot;<span 
class="p">:</span>&quot;
+  <span class="o">-</span><span class="n">td</span> <span 
class="o">&lt;</span><span class="n">value</span><span class="o">&gt;</span> 
<span class="o">|</span> <span class="o">--</span><span 
class="n">elementDelim</span> <span class="o">&lt;</span><span 
class="n">value</span><span class="o">&gt;</span>
+        <span class="n">Separates</span> <span class="n">vector</span> <span 
class="n">element</span> <span class="n">values</span> <span 
class="n">in</span> <span class="n">the</span> <span class="n">values</span> 
<span class="n">list</span> <span class="p">(</span><span 
class="n">optional</span><span class="p">).</span> <span 
class="n">Default</span><span class="p">:</span> &quot; &quot;
+  <span class="o">-</span><span class="n">os</span> <span class="o">|</span> 
<span class="o">--</span><span class="n">omitStrength</span>
+        <span class="n">Do</span> <span class="n">not</span> <span 
class="n">write</span> <span class="n">the</span> <span 
class="n">strength</span> <span class="n">to</span> <span class="n">the</span> 
<span class="n">output</span> <span class="n">files</span> <span 
class="p">(</span><span class="n">optional</span><span class="p">),</span> 
<span class="n">Default</span><span class="p">:</span> <span 
class="n">false</span><span class="p">.</span>
+<span class="n">This</span> <span class="n">option</span> <span 
class="n">is</span> <span class="n">used</span> <span class="n">to</span> <span 
class="n">output</span> <span class="n">indexable</span> <span 
class="n">data</span> <span class="k">for</span> <span 
class="n">creating</span> <span class="n">a</span> <span 
class="n">search</span> <span class="n">engine</span> <span 
class="n">recommender</span><span class="p">.</span>
+
+<span class="n">Default</span> <span class="n">delimiters</span> <span 
class="n">will</span> <span class="n">produce</span> <span 
class="n">output</span> <span class="n">of</span> <span class="n">the</span> 
<span class="n">form</span><span class="p">:</span> &quot;<span 
class="n">itemID1</span><span class="o">&lt;</span><span 
class="n">tab</span><span class="o">&gt;</span><span 
class="n">itemID2</span><span class="p">:</span><span 
class="n">value2</span><span class="o">&lt;</span><span 
class="n">space</span><span class="o">&gt;</span><span 
class="n">itemID10</span><span class="p">:</span><span 
class="n">value10</span><span class="p">...</span>&quot;
+
+<span class="n">Spark</span> <span class="n">config</span> <span 
class="n">options</span><span class="p">:</span>
+  <span class="o">-</span><span class="n">ma</span> <span 
class="o">&lt;</span><span class="n">value</span><span class="o">&gt;</span> 
<span class="o">|</span> <span class="o">--</span><span class="n">master</span> 
<span class="o">&lt;</span><span class="n">value</span><span 
class="o">&gt;</span>
+        <span class="n">Spark</span> <span class="n">Master</span> <span 
class="n">URL</span> <span class="p">(</span><span 
class="n">optional</span><span class="p">).</span> <span 
class="n">Default</span><span class="p">:</span> &quot;<span 
class="n">local</span>&quot;<span class="p">.</span> <span 
class="n">Note</span> <span class="n">that</span> <span class="n">you</span> 
<span class="n">can</span> <span class="n">specify</span> <span 
class="n">the</span> <span class="n">number</span> <span class="n">of</span> 
<span class="n">cores</span> <span class="n">to</span> <span 
class="n">get</span> <span class="n">a</span> <span 
class="n">performance</span> <span class="n">improvement</span><span 
class="p">,</span> <span class="k">for</span> <span class="n">example</span> 
&quot;<span class="n">local</span><span class="p">[</span>4<span 
class="p">]</span>&quot;
+  <span class="o">-</span><span class="n">sem</span> <span 
class="o">&lt;</span><span class="n">value</span><span class="o">&gt;</span> 
<span class="o">|</span> <span class="o">--</span><span 
class="n">sparkExecutorMem</span> <span class="o">&lt;</span><span 
class="n">value</span><span class="o">&gt;</span>
+        <span class="n">Max</span> <span class="n">Java</span> <span 
class="n">heap</span> <span class="n">available</span> <span 
class="n">as</span> &quot;<span class="n">executor</span> <span 
class="n">memory</span>&quot; <span class="n">on</span> <span 
class="n">each</span> <span class="n">node</span> <span class="p">(</span><span 
class="n">optional</span><span class="p">).</span> <span 
class="n">Default</span><span class="p">:</span> 4<span class="n">g</span>
+  <span class="o">-</span><span class="n">rs</span> <span 
class="o">&lt;</span><span class="n">value</span><span class="o">&gt;</span> 
<span class="o">|</span> <span class="o">--</span><span 
class="n">randomSeed</span> <span class="o">&lt;</span><span 
class="n">value</span><span class="o">&gt;</span>
+
+  <span class="o">-</span><span class="n">h</span> <span class="o">|</span> 
<span class="o">--</span><span class="n">help</span>
+        <span class="n">prints</span> <span class="n">this</span> <span 
class="n">usage</span> <span class="n">text</span>
+</pre></div>
+
+
+<p>This looks daunting but defaults to simple fairly sane values to take 
exactly the same input as legacy code and is pretty flexible. It allows the 
user to point to a single text file, a directory full of files, or a tree of 
directories to be traversed recursively. The files included can be specified 
with either a regex-style pattern or filename. The schema for the file is 
defined by column numbers, which map to the important bits of data including 
IDs and values. The files can even contain filters, which allow unneeded rows 
to be discarded or used for cross-cooccurrence calculations.</p>
+<p>See ItemSimilarityDriver.scala in Mahout's spark module if you want to 
customize the code. </p>
+<h3 id="defaults-in-the-spark-itemsimilarity-cli">Defaults in the 
<em><strong>spark-itemsimilarity</strong></em> CLI</h3>
+<p>If all defaults are used the input can be as simple as:</p>
+<div class="codehilite"><pre><span class="n">userID1</span><span 
class="p">,</span><span class="n">itemID1</span>
+<span class="n">userID2</span><span class="p">,</span><span 
class="n">itemID2</span>
+<span class="p">...</span>
+</pre></div>
+
+
+<p>With the command line:</p>
+<div class="codehilite"><pre><span class="n">bash</span>$ <span 
class="n">mahout</span> <span class="n">spark</span><span 
class="o">-</span><span class="n">itemsimilarity</span> <span 
class="o">--</span><span class="n">input</span> <span class="n">in</span><span 
class="o">-</span><span class="n">file</span> <span class="o">--</span><span 
class="n">output</span> <span class="n">out</span><span class="o">-</span><span 
class="n">dir</span>
+</pre></div>
+
+
+<p>This will use the "local" Spark context and will output the standard text 
version of a DRM</p>
+<div class="codehilite"><pre><span class="n">itemID1</span><span 
class="o">&lt;</span><span class="n">tab</span><span class="o">&gt;</span><span 
class="n">itemID2</span><span class="p">:</span><span 
class="n">value2</span><span class="o">&lt;</span><span 
class="n">space</span><span class="o">&gt;</span><span 
class="n">itemID10</span><span class="p">:</span><span 
class="n">value10</span><span class="p">...</span>
+</pre></div>
+
+
+<h3 id="wzxhzdk18how-to-use-multiple-user-actionswzxhzdk19"><a 
name="multiple-actions">How To Use Multiple User Actions</a></h3>
+<p>Often we record various actions the user takes for later analytics. These 
can now be used to make recommendations. 
+The idea of a recommender is to recommend the action you want the user to 
make. For an ecom app this might be 
+a purchase action. It is usually not a good idea to just treat other actions 
the same as the action you want to recommend. 
+For instance a view of an item does not indicate the same intent as a purchase 
and if you just mixed the two together you 
+might even make worse recommendations. It is tempting though since there are 
so many more views than purchases. With <em>spark-itemsimilarity</em>
+we can now use both actions. Mahout will use cross-action cooccurrence 
analysis to limit the views to ones that do predict purchases.
+We do this by treating the primary action (purchase) as data for the indicator 
matrix and use the secondary action (view) 
+to calculate the cross-cooccurrence indicator matrix.  </p>
+<p><em>spark-itemsimilarity</em> can read separate actions from separate files 
or from a mixed action log by filtering certain lines. For a mixed 
+action log of the form:</p>
+<div class="codehilite"><pre><span class="n">u1</span><span 
class="p">,</span><span class="n">purchase</span><span class="p">,</span><span 
class="n">iphone</span>
+<span class="n">u1</span><span class="p">,</span><span 
class="n">purchase</span><span class="p">,</span><span class="n">ipad</span>
+<span class="n">u2</span><span class="p">,</span><span 
class="n">purchase</span><span class="p">,</span><span class="n">nexus</span>
+<span class="n">u2</span><span class="p">,</span><span 
class="n">purchase</span><span class="p">,</span><span class="n">galaxy</span>
+<span class="n">u3</span><span class="p">,</span><span 
class="n">purchase</span><span class="p">,</span><span class="n">surface</span>
+<span class="n">u4</span><span class="p">,</span><span 
class="n">purchase</span><span class="p">,</span><span class="n">iphone</span>
+<span class="n">u4</span><span class="p">,</span><span 
class="n">purchase</span><span class="p">,</span><span class="n">galaxy</span>
+<span class="n">u1</span><span class="p">,</span><span 
class="n">view</span><span class="p">,</span><span class="n">iphone</span>
+<span class="n">u1</span><span class="p">,</span><span 
class="n">view</span><span class="p">,</span><span class="n">ipad</span>
+<span class="n">u1</span><span class="p">,</span><span 
class="n">view</span><span class="p">,</span><span class="n">nexus</span>
+<span class="n">u1</span><span class="p">,</span><span 
class="n">view</span><span class="p">,</span><span class="n">galaxy</span>
+<span class="n">u2</span><span class="p">,</span><span 
class="n">view</span><span class="p">,</span><span class="n">iphone</span>
+<span class="n">u2</span><span class="p">,</span><span 
class="n">view</span><span class="p">,</span><span class="n">ipad</span>
+<span class="n">u2</span><span class="p">,</span><span 
class="n">view</span><span class="p">,</span><span class="n">nexus</span>
+<span class="n">u2</span><span class="p">,</span><span 
class="n">view</span><span class="p">,</span><span class="n">galaxy</span>
+<span class="n">u3</span><span class="p">,</span><span 
class="n">view</span><span class="p">,</span><span class="n">surface</span>
+<span class="n">u3</span><span class="p">,</span><span 
class="n">view</span><span class="p">,</span><span class="n">nexus</span>
+<span class="n">u4</span><span class="p">,</span><span 
class="n">view</span><span class="p">,</span><span class="n">iphone</span>
+<span class="n">u4</span><span class="p">,</span><span 
class="n">view</span><span class="p">,</span><span class="n">ipad</span>
+<span class="n">u4</span><span class="p">,</span><span 
class="n">view</span><span class="p">,</span><span class="n">galaxy</span>
+</pre></div>
+
+
+<h3 id="command-line">Command Line</h3>
+<p>Use the following options:</p>
+<div class="codehilite"><pre><span class="n">bash</span>$ <span 
class="n">mahout</span> <span class="n">spark</span><span 
class="o">-</span><span class="n">itemsimilarity</span> <span class="o">\</span>
+    <span class="o">--</span><span class="n">input</span> <span 
class="n">in</span><span class="o">-</span><span class="n">file</span> <span 
class="o">\</span>     # <span class="n">where</span> <span class="n">to</span> 
<span class="n">look</span> <span class="k">for</span> <span 
class="n">data</span>
+    <span class="o">--</span><span class="n">output</span> <span 
class="n">out</span><span class="o">-</span><span class="n">path</span> <span 
class="o">\</span>   # <span class="n">root</span> <span class="n">dir</span> 
<span class="k">for</span> <span class="n">output</span>
+    <span class="o">--</span><span class="n">master</span> <span 
class="n">masterUrl</span> <span class="o">\</span>  # <span 
class="n">URL</span> <span class="n">of</span> <span class="n">the</span> <span 
class="n">Spark</span> <span class="n">master</span> <span 
class="n">server</span>
+    <span class="o">--</span><span class="n">filter1</span> <span 
class="n">purchase</span> <span class="o">\</span>  # <span 
class="n">word</span> <span class="n">that</span> <span class="n">flags</span> 
<span class="n">input</span> <span class="k">for</span> <span 
class="n">the</span> <span class="n">primary</span> <span 
class="n">action</span>
+    <span class="o">--</span><span class="n">filter2</span> <span 
class="n">view</span> <span class="o">\</span>      # <span 
class="n">word</span> <span class="n">that</span> <span class="n">flags</span> 
<span class="n">input</span> <span class="k">for</span> <span 
class="n">the</span> <span class="n">secondary</span> <span 
class="n">action</span>
+    <span class="o">--</span><span class="n">itemIDPosition</span> 2 <span 
class="o">\</span>  # <span class="n">column</span> <span class="n">that</span> 
<span class="n">has</span> <span class="n">the</span> <span 
class="n">item</span> <span class="n">ID</span>
+    <span class="o">--</span><span class="n">rowIDPosition</span> 0 <span 
class="o">\</span>   # <span class="n">column</span> <span 
class="n">that</span> <span class="n">has</span> <span class="n">the</span> 
<span class="n">user</span> <span class="n">ID</span>
+    <span class="o">--</span><span class="n">filterPosition</span> 1    # 
<span class="n">column</span> <span class="n">that</span> <span 
class="n">has</span> <span class="n">the</span> <span class="n">filter</span> 
<span class="n">word</span>
+</pre></div>
+
+
+<h3 id="output">Output</h3>
+<p>The output of the job will be the standard text version of two Mahout DRMs. 
This is a case where we are calculating 
+cross-cooccurrence so a primary indicator matrix and cross-cooccurrence 
indicator matrix will be created</p>
+<div class="codehilite"><pre><span class="n">out</span><span 
class="o">-</span><span class="n">path</span>
+  <span class="o">|--</span> <span class="n">similarity</span><span 
class="o">-</span><span class="n">matrix</span> <span class="o">-</span> <span 
class="n">TDF</span> <span class="n">part</span> <span class="n">files</span>
+  <span class="o">\--</span> <span class="nb">cross</span><span 
class="o">-</span><span class="n">similarity</span><span 
class="o">-</span><span class="n">matrix</span> <span class="o">-</span> <span 
class="n">TDF</span> <span class="n">part</span><span class="o">-</span><span 
class="n">files</span>
+</pre></div>
+
+
+<p>The similarity-matrix will contain the lines:</p>
+<div class="codehilite"><pre><span class="n">galaxy</span><span 
class="o">\</span><span class="n">tnexus</span><span class="p">:</span>1<span 
class="p">.</span>7260924347106847
+<span class="n">ipad</span><span class="o">\</span><span 
class="n">tiphone</span><span class="p">:</span>1<span 
class="p">.</span>7260924347106847
+<span class="n">nexus</span><span class="o">\</span><span 
class="n">tgalaxy</span><span class="p">:</span>1<span 
class="p">.</span>7260924347106847
+<span class="n">iphone</span><span class="o">\</span><span 
class="n">tipad</span><span class="p">:</span>1<span 
class="p">.</span>7260924347106847
+<span class="n">surface</span>
+</pre></div>
+
+
+<p>The cross-similarity-matrix will contain:</p>
+<div class="codehilite"><pre><span class="n">iphone</span><span 
class="o">\</span><span class="n">tnexus</span><span class="p">:</span>1<span 
class="p">.</span>7260924347106847 <span class="n">iphone</span><span 
class="p">:</span>1<span class="p">.</span>7260924347106847 <span 
class="n">ipad</span><span class="p">:</span>1<span 
class="p">.</span>7260924347106847 <span class="n">galaxy</span><span 
class="p">:</span>1<span class="p">.</span>7260924347106847
+<span class="n">ipad</span><span class="o">\</span><span 
class="n">tnexus</span><span class="p">:</span>0<span 
class="p">.</span>6795961471815897 <span class="n">iphone</span><span 
class="p">:</span>0<span class="p">.</span>6795961471815897 <span 
class="n">ipad</span><span class="p">:</span>0<span 
class="p">.</span>6795961471815897 <span class="n">galaxy</span><span 
class="p">:</span>0<span class="p">.</span>6795961471815897
+<span class="n">nexus</span><span class="o">\</span><span 
class="n">tnexus</span><span class="p">:</span>0<span 
class="p">.</span>6795961471815897 <span class="n">iphone</span><span 
class="p">:</span>0<span class="p">.</span>6795961471815897 <span 
class="n">ipad</span><span class="p">:</span>0<span 
class="p">.</span>6795961471815897 <span class="n">galaxy</span><span 
class="p">:</span>0<span class="p">.</span>6795961471815897
+<span class="n">galaxy</span><span class="o">\</span><span 
class="n">tnexus</span><span class="p">:</span>1<span 
class="p">.</span>7260924347106847 <span class="n">iphone</span><span 
class="p">:</span>1<span class="p">.</span>7260924347106847 <span 
class="n">ipad</span><span class="p">:</span>1<span 
class="p">.</span>7260924347106847 <span class="n">galaxy</span><span 
class="p">:</span>1<span class="p">.</span>7260924347106847
+<span class="n">surface</span><span class="o">\</span><span 
class="n">tsurface</span><span class="p">:</span>4<span 
class="p">.</span>498681156950466 <span class="n">nexus</span><span 
class="p">:</span>0<span class="p">.</span>6795961471815897
+</pre></div>
+
+
+<p><strong>Note:</strong> You can run this multiple times to use more than two 
actions or you can use the underlying 
+SimilarityAnalysis.cooccurrence API, which will more efficiently calculate any 
number of cross-cooccurrence indicators.</p>
+<h3 id="log-file-input">Log File Input</h3>
+<p>A common method of storing data is in log files. If they are written using 
some delimiter they can be consumed directly by spark-itemsimilarity. For 
instance input of the form:</p>
+<div class="codehilite"><pre>2014<span class="o">-</span>06<span 
class="o">-</span>23 14<span class="p">:</span>46<span 
class="p">:</span>53<span class="p">.</span>115<span class="o">\</span><span 
class="n">tu1</span><span class="o">\</span><span 
class="n">tpurchase</span><span class="o">\</span><span 
class="n">trandom</span> <span class="n">text</span><span 
class="o">\</span><span class="n">tiphone</span>
+2014<span class="o">-</span>06<span class="o">-</span>23 14<span 
class="p">:</span>46<span class="p">:</span>53<span class="p">.</span>115<span 
class="o">\</span><span class="n">tu1</span><span class="o">\</span><span 
class="n">tpurchase</span><span class="o">\</span><span 
class="n">trandom</span> <span class="n">text</span><span 
class="o">\</span><span class="n">tipad</span>
+2014<span class="o">-</span>06<span class="o">-</span>23 14<span 
class="p">:</span>46<span class="p">:</span>53<span class="p">.</span>115<span 
class="o">\</span><span class="n">tu2</span><span class="o">\</span><span 
class="n">tpurchase</span><span class="o">\</span><span 
class="n">trandom</span> <span class="n">text</span><span 
class="o">\</span><span class="n">tnexus</span>
+2014<span class="o">-</span>06<span class="o">-</span>23 14<span 
class="p">:</span>46<span class="p">:</span>53<span class="p">.</span>115<span 
class="o">\</span><span class="n">tu2</span><span class="o">\</span><span 
class="n">tpurchase</span><span class="o">\</span><span 
class="n">trandom</span> <span class="n">text</span><span 
class="o">\</span><span class="n">tgalaxy</span>
+2014<span class="o">-</span>06<span class="o">-</span>23 14<span 
class="p">:</span>46<span class="p">:</span>53<span class="p">.</span>115<span 
class="o">\</span><span class="n">tu3</span><span class="o">\</span><span 
class="n">tpurchase</span><span class="o">\</span><span 
class="n">trandom</span> <span class="n">text</span><span 
class="o">\</span><span class="n">tsurface</span>
+2014<span class="o">-</span>06<span class="o">-</span>23 14<span 
class="p">:</span>46<span class="p">:</span>53<span class="p">.</span>115<span 
class="o">\</span><span class="n">tu4</span><span class="o">\</span><span 
class="n">tpurchase</span><span class="o">\</span><span 
class="n">trandom</span> <span class="n">text</span><span 
class="o">\</span><span class="n">tiphone</span>
+2014<span class="o">-</span>06<span class="o">-</span>23 14<span 
class="p">:</span>46<span class="p">:</span>53<span class="p">.</span>115<span 
class="o">\</span><span class="n">tu4</span><span class="o">\</span><span 
class="n">tpurchase</span><span class="o">\</span><span 
class="n">trandom</span> <span class="n">text</span><span 
class="o">\</span><span class="n">tgalaxy</span>
+2014<span class="o">-</span>06<span class="o">-</span>23 14<span 
class="p">:</span>46<span class="p">:</span>53<span class="p">.</span>115<span 
class="o">\</span><span class="n">tu1</span><span class="o">\</span><span 
class="n">tview</span><span class="o">\</span><span class="n">trandom</span> 
<span class="n">text</span><span class="o">\</span><span 
class="n">tiphone</span>
+2014<span class="o">-</span>06<span class="o">-</span>23 14<span 
class="p">:</span>46<span class="p">:</span>53<span class="p">.</span>115<span 
class="o">\</span><span class="n">tu1</span><span class="o">\</span><span 
class="n">tview</span><span class="o">\</span><span class="n">trandom</span> 
<span class="n">text</span><span class="o">\</span><span class="n">tipad</span>
+2014<span class="o">-</span>06<span class="o">-</span>23 14<span 
class="p">:</span>46<span class="p">:</span>53<span class="p">.</span>115<span 
class="o">\</span><span class="n">tu1</span><span class="o">\</span><span 
class="n">tview</span><span class="o">\</span><span class="n">trandom</span> 
<span class="n">text</span><span class="o">\</span><span class="n">tnexus</span>
+2014<span class="o">-</span>06<span class="o">-</span>23 14<span 
class="p">:</span>46<span class="p">:</span>53<span class="p">.</span>115<span 
class="o">\</span><span class="n">tu1</span><span class="o">\</span><span 
class="n">tview</span><span class="o">\</span><span class="n">trandom</span> 
<span class="n">text</span><span class="o">\</span><span 
class="n">tgalaxy</span>
+2014<span class="o">-</span>06<span class="o">-</span>23 14<span 
class="p">:</span>46<span class="p">:</span>53<span class="p">.</span>115<span 
class="o">\</span><span class="n">tu2</span><span class="o">\</span><span 
class="n">tview</span><span class="o">\</span><span class="n">trandom</span> 
<span class="n">text</span><span class="o">\</span><span 
class="n">tiphone</span>
+2014<span class="o">-</span>06<span class="o">-</span>23 14<span 
class="p">:</span>46<span class="p">:</span>53<span class="p">.</span>115<span 
class="o">\</span><span class="n">tu2</span><span class="o">\</span><span 
class="n">tview</span><span class="o">\</span><span class="n">trandom</span> 
<span class="n">text</span><span class="o">\</span><span class="n">tipad</span>
+2014<span class="o">-</span>06<span class="o">-</span>23 14<span 
class="p">:</span>46<span class="p">:</span>53<span class="p">.</span>115<span 
class="o">\</span><span class="n">tu2</span><span class="o">\</span><span 
class="n">tview</span><span class="o">\</span><span class="n">trandom</span> 
<span class="n">text</span><span class="o">\</span><span class="n">tnexus</span>
+2014<span class="o">-</span>06<span class="o">-</span>23 14<span 
class="p">:</span>46<span class="p">:</span>53<span class="p">.</span>115<span 
class="o">\</span><span class="n">tu2</span><span class="o">\</span><span 
class="n">tview</span><span class="o">\</span><span class="n">trandom</span> 
<span class="n">text</span><span class="o">\</span><span 
class="n">tgalaxy</span>
+2014<span class="o">-</span>06<span class="o">-</span>23 14<span 
class="p">:</span>46<span class="p">:</span>53<span class="p">.</span>115<span 
class="o">\</span><span class="n">tu3</span><span class="o">\</span><span 
class="n">tview</span><span class="o">\</span><span class="n">trandom</span> 
<span class="n">text</span><span class="o">\</span><span 
class="n">tsurface</span>
+2014<span class="o">-</span>06<span class="o">-</span>23 14<span 
class="p">:</span>46<span class="p">:</span>53<span class="p">.</span>115<span 
class="o">\</span><span class="n">tu3</span><span class="o">\</span><span 
class="n">tview</span><span class="o">\</span><span class="n">trandom</span> 
<span class="n">text</span><span class="o">\</span><span class="n">tnexus</span>
+2014<span class="o">-</span>06<span class="o">-</span>23 14<span 
class="p">:</span>46<span class="p">:</span>53<span class="p">.</span>115<span 
class="o">\</span><span class="n">tu4</span><span class="o">\</span><span 
class="n">tview</span><span class="o">\</span><span class="n">trandom</span> 
<span class="n">text</span><span class="o">\</span><span 
class="n">tiphone</span>
+2014<span class="o">-</span>06<span class="o">-</span>23 14<span 
class="p">:</span>46<span class="p">:</span>53<span class="p">.</span>115<span 
class="o">\</span><span class="n">tu4</span><span class="o">\</span><span 
class="n">tview</span><span class="o">\</span><span class="n">trandom</span> 
<span class="n">text</span><span class="o">\</span><span class="n">tipad</span>
+2014<span class="o">-</span>06<span class="o">-</span>23 14<span 
class="p">:</span>46<span class="p">:</span>53<span class="p">.</span>115<span 
class="o">\</span><span class="n">tu4</span><span class="o">\</span><span 
class="n">tview</span><span class="o">\</span><span class="n">trandom</span> 
<span class="n">text</span><span class="o">\</span><span 
class="n">tgalaxy</span>
+</pre></div>
+
+
+<p>Can be parsed with the following CLI and run on the cluster producing the 
same output as the above example.</p>
+<div class="codehilite"><pre><span class="n">bash</span>$ <span 
class="n">mahout</span> <span class="n">spark</span><span 
class="o">-</span><span class="n">itemsimilarity</span> <span class="o">\</span>
+    <span class="o">--</span><span class="n">input</span> <span 
class="n">in</span><span class="o">-</span><span class="n">file</span> <span 
class="o">\</span>
+    <span class="o">--</span><span class="n">output</span> <span 
class="n">out</span><span class="o">-</span><span class="n">path</span> <span 
class="o">\</span>
+    <span class="o">--</span><span class="n">master</span> <span 
class="n">spark</span><span class="p">:</span><span class="o">//</span><span 
class="n">sparkmaster</span><span class="p">:</span>4044 <span 
class="o">\</span>
+    <span class="o">--</span><span class="n">filter1</span> <span 
class="n">purchase</span> <span class="o">\</span>
+    <span class="o">--</span><span class="n">filter2</span> <span 
class="n">view</span> <span class="o">\</span>
+    <span class="o">--</span><span class="n">inDelim</span> &quot;<span 
class="o">\</span><span class="n">t</span>&quot; <span class="o">\</span>
+    <span class="o">--</span><span class="n">itemIDPosition</span> 4 <span 
class="o">\</span>
+    <span class="o">--</span><span class="n">rowIDPosition</span> 1 <span 
class="o">\</span>
+    <span class="o">--</span><span class="n">filterPosition</span> 2
+</pre></div>
+
+
+<h2 id="2-spark-rowsimilarity">2. spark-rowsimilarity</h2>
+<p><em>spark-rowsimilarity</em> is the companion to 
<em>spark-itemsimilarity</em> the primary difference is that it takes a text 
file version of 
+a matrix of sparse vectors with optional application specific IDs and it finds 
similar rows rather than items (columns). Its use is
+not limited to collaborative filtering. The input is in text-delimited form 
where there are three delimiters used. By 
+default it reads 
(rowID&lt;tab&gt;columnID1:strength1&lt;space&gt;columnID2:strength2...) Since 
this job only supports LLR similarity,
+ which does not use the input strengths, they may be omitted in the input. It 
writes 
+(rowID&lt;tab&gt;rowID1:strength1&lt;space&gt;rowID2:strength2...) 
+The output is sorted by strength descending. The output can be interpreted as 
a row ID from the primary input followed 
+by a list of the most similar rows.</p>
+<p>The command line interface is:</p>
+<div class="codehilite"><pre><span class="n">spark</span><span 
class="o">-</span><span class="n">rowsimilarity</span> <span 
class="n">Mahout</span> 1<span class="p">.</span>0
+<span class="n">Usage</span><span class="p">:</span> <span 
class="n">spark</span><span class="o">-</span><span 
class="n">rowsimilarity</span> <span class="p">[</span><span 
class="n">options</span><span class="p">]</span>
+
+<span class="n">Input</span><span class="p">,</span> <span 
class="n">output</span> <span class="n">options</span>
+  <span class="o">-</span><span class="nb">i</span> <span 
class="o">&lt;</span><span class="n">value</span><span class="o">&gt;</span> 
<span class="o">|</span> <span class="o">--</span><span class="n">input</span> 
<span class="o">&lt;</span><span class="n">value</span><span 
class="o">&gt;</span>
+        <span class="n">Input</span> <span class="n">path</span><span 
class="p">,</span> <span class="n">may</span> <span class="n">be</span> <span 
class="n">a</span> <span class="n">filename</span><span class="p">,</span> 
<span class="n">directory</span> <span class="n">name</span><span 
class="p">,</span> <span class="n">or</span> <span class="n">comma</span> <span 
class="n">delimited</span> <span class="n">list</span> <span 
class="n">of</span> <span class="n">HDFS</span> <span 
class="n">supported</span> <span class="n">URIs</span> <span 
class="p">(</span><span class="n">required</span><span class="p">)</span>
+  <span class="o">-</span><span class="n">o</span> <span 
class="o">&lt;</span><span class="n">value</span><span class="o">&gt;</span> 
<span class="o">|</span> <span class="o">--</span><span class="n">output</span> 
<span class="o">&lt;</span><span class="n">value</span><span 
class="o">&gt;</span>
+        <span class="n">Path</span> <span class="k">for</span> <span 
class="n">output</span><span class="p">,</span> <span class="n">any</span> 
<span class="n">local</span> <span class="n">or</span> <span 
class="n">HDFS</span> <span class="n">supported</span> <span 
class="n">URI</span> <span class="p">(</span><span 
class="n">required</span><span class="p">)</span>
+
+<span class="n">Algorithm</span> <span class="n">control</span> <span 
class="n">options</span><span class="p">:</span>
+  <span class="o">-</span><span class="n">mo</span> <span 
class="o">&lt;</span><span class="n">value</span><span class="o">&gt;</span> 
<span class="o">|</span> <span class="o">--</span><span 
class="n">maxObservations</span> <span class="o">&lt;</span><span 
class="n">value</span><span class="o">&gt;</span>
+        <span class="n">Max</span> <span class="n">number</span> <span 
class="n">of</span> <span class="n">observations</span> <span 
class="n">to</span> <span class="n">consider</span> <span class="n">per</span> 
<span class="n">row</span> <span class="p">(</span><span 
class="n">optional</span><span class="p">).</span> <span 
class="n">Default</span><span class="p">:</span> 500
+  <span class="o">-</span><span class="n">m</span> <span 
class="o">&lt;</span><span class="n">value</span><span class="o">&gt;</span> 
<span class="o">|</span> <span class="o">--</span><span 
class="n">maxSimilaritiesPerRow</span> <span class="o">&lt;</span><span 
class="n">value</span><span class="o">&gt;</span>
+        <span class="n">Limit</span> <span class="n">the</span> <span 
class="n">number</span> <span class="n">of</span> <span 
class="n">similarities</span> <span class="n">per</span> <span 
class="n">item</span> <span class="n">to</span> <span class="n">this</span> 
<span class="n">number</span> <span class="p">(</span><span 
class="n">optional</span><span class="p">).</span> <span 
class="n">Default</span><span class="p">:</span> 100
+
+<span class="n">Note</span><span class="p">:</span> <span 
class="n">Only</span> <span class="n">the</span> <span class="n">Log</span> 
<span class="n">Likelihood</span> <span class="n">Ratio</span> <span 
class="p">(</span><span class="n">LLR</span><span class="p">)</span> <span 
class="n">is</span> <span class="n">supported</span> <span class="n">as</span> 
<span class="n">a</span> <span class="n">similarity</span> <span 
class="n">measure</span><span class="p">.</span>
+<span class="n">Disconnected</span> <span class="n">from</span> <span 
class="n">the</span> <span class="n">target</span> <span 
class="n">VM</span><span class="p">,</span> <span class="n">address</span><span 
class="p">:</span> <span class="s">&#39;127.0.0.1:49162&#39;</span><span 
class="p">,</span> <span class="n">transport</span><span class="p">:</span> 
<span class="s">&#39;socket&#39;</span>
+
+<span class="n">Output</span> <span class="n">text</span> <span 
class="n">file</span> <span class="n">schema</span> <span 
class="n">options</span><span class="p">:</span>
+  <span class="o">-</span><span class="n">rd</span> <span 
class="o">&lt;</span><span class="n">value</span><span class="o">&gt;</span> 
<span class="o">|</span> <span class="o">--</span><span 
class="n">rowKeyDelim</span> <span class="o">&lt;</span><span 
class="n">value</span><span class="o">&gt;</span>
+        <span class="n">Separates</span> <span class="n">the</span> <span 
class="n">rowID</span> <span class="n">key</span> <span class="n">from</span> 
<span class="n">the</span> <span class="n">vector</span> <span 
class="n">values</span> <span class="n">list</span> <span 
class="p">(</span><span class="n">optional</span><span class="p">).</span> 
<span class="n">Default</span><span class="p">:</span> &quot;<span 
class="o">\</span><span class="n">t</span>&quot;
+  <span class="o">-</span><span class="n">cd</span> <span 
class="o">&lt;</span><span class="n">value</span><span class="o">&gt;</span> 
<span class="o">|</span> <span class="o">--</span><span 
class="n">columnIdStrengthDelim</span> <span class="o">&lt;</span><span 
class="n">value</span><span class="o">&gt;</span>
+        <span class="n">Separates</span> <span class="n">column</span> <span 
class="n">IDs</span> <span class="n">from</span> <span class="n">their</span> 
<span class="n">values</span> <span class="n">in</span> <span 
class="n">the</span> <span class="n">vector</span> <span 
class="n">values</span> <span class="n">list</span> <span 
class="p">(</span><span class="n">optional</span><span class="p">).</span> 
<span class="n">Default</span><span class="p">:</span> &quot;<span 
class="p">:</span>&quot;
+  <span class="o">-</span><span class="n">td</span> <span 
class="o">&lt;</span><span class="n">value</span><span class="o">&gt;</span> 
<span class="o">|</span> <span class="o">--</span><span 
class="n">elementDelim</span> <span class="o">&lt;</span><span 
class="n">value</span><span class="o">&gt;</span>
+        <span class="n">Separates</span> <span class="n">vector</span> <span 
class="n">element</span> <span class="n">values</span> <span 
class="n">in</span> <span class="n">the</span> <span class="n">values</span> 
<span class="n">list</span> <span class="p">(</span><span 
class="n">optional</span><span class="p">).</span> <span 
class="n">Default</span><span class="p">:</span> &quot; &quot;
+  <span class="o">-</span><span class="n">os</span> <span class="o">|</span> 
<span class="o">--</span><span class="n">omitStrength</span>
+        <span class="n">Do</span> <span class="n">not</span> <span 
class="n">write</span> <span class="n">the</span> <span 
class="n">strength</span> <span class="n">to</span> <span class="n">the</span> 
<span class="n">output</span> <span class="n">files</span> <span 
class="p">(</span><span class="n">optional</span><span class="p">),</span> 
<span class="n">Default</span><span class="p">:</span> <span 
class="n">false</span><span class="p">.</span>
+<span class="n">This</span> <span class="n">option</span> <span 
class="n">is</span> <span class="n">used</span> <span class="n">to</span> <span 
class="n">output</span> <span class="n">indexable</span> <span 
class="n">data</span> <span class="k">for</span> <span 
class="n">creating</span> <span class="n">a</span> <span 
class="n">search</span> <span class="n">engine</span> <span 
class="n">recommender</span><span class="p">.</span>
+
+<span class="n">Default</span> <span class="n">delimiters</span> <span 
class="n">will</span> <span class="n">produce</span> <span 
class="n">output</span> <span class="n">of</span> <span class="n">the</span> 
<span class="n">form</span><span class="p">:</span> &quot;<span 
class="n">itemID1</span><span class="o">&lt;</span><span 
class="n">tab</span><span class="o">&gt;</span><span 
class="n">itemID2</span><span class="p">:</span><span 
class="n">value2</span><span class="o">&lt;</span><span 
class="n">space</span><span class="o">&gt;</span><span 
class="n">itemID10</span><span class="p">:</span><span 
class="n">value10</span><span class="p">...</span>&quot;
+
+<span class="n">File</span> <span class="n">discovery</span> <span 
class="n">options</span><span class="p">:</span>
+  <span class="o">-</span><span class="n">r</span> <span class="o">|</span> 
<span class="o">--</span><span class="n">recursive</span>
+        <span class="n">Searched</span> <span class="n">the</span> <span 
class="o">-</span><span class="nb">i</span> <span class="n">path</span> <span 
class="n">recursively</span> <span class="k">for</span> <span 
class="n">files</span> <span class="n">that</span> <span class="n">match</span> 
<span class="o">--</span><span class="n">filenamePattern</span> <span 
class="p">(</span><span class="n">optional</span><span class="p">),</span> 
<span class="n">Default</span><span class="p">:</span> <span 
class="n">false</span>
+  <span class="o">-</span><span class="n">fp</span> <span 
class="o">&lt;</span><span class="n">value</span><span class="o">&gt;</span> 
<span class="o">|</span> <span class="o">--</span><span 
class="n">filenamePattern</span> <span class="o">&lt;</span><span 
class="n">value</span><span class="o">&gt;</span>
+        <span class="n">Regex</span> <span class="n">to</span> <span 
class="n">match</span> <span class="n">in</span> <span 
class="n">determining</span> <span class="n">input</span> <span 
class="n">files</span> <span class="p">(</span><span 
class="n">optional</span><span class="p">).</span> <span 
class="n">Default</span><span class="p">:</span> <span 
class="n">filename</span> <span class="n">in</span> <span class="n">the</span> 
<span class="o">--</span><span class="n">input</span> <span 
class="n">option</span> <span class="n">or</span> &quot;^<span 
class="n">part</span><span class="o">-.*</span>&quot; <span class="k">if</span> 
<span class="o">--</span><span class="n">input</span> <span class="n">is</span> 
<span class="n">a</span> <span class="n">directory</span>
+
+<span class="n">Spark</span> <span class="n">config</span> <span 
class="n">options</span><span class="p">:</span>
+  <span class="o">-</span><span class="n">ma</span> <span 
class="o">&lt;</span><span class="n">value</span><span class="o">&gt;</span> 
<span class="o">|</span> <span class="o">--</span><span class="n">master</span> 
<span class="o">&lt;</span><span class="n">value</span><span 
class="o">&gt;</span>
+        <span class="n">Spark</span> <span class="n">Master</span> <span 
class="n">URL</span> <span class="p">(</span><span 
class="n">optional</span><span class="p">).</span> <span 
class="n">Default</span><span class="p">:</span> &quot;<span 
class="n">local</span>&quot;<span class="p">.</span> <span 
class="n">Note</span> <span class="n">that</span> <span class="n">you</span> 
<span class="n">can</span> <span class="n">specify</span> <span 
class="n">the</span> <span class="n">number</span> <span class="n">of</span> 
<span class="n">cores</span> <span class="n">to</span> <span 
class="n">get</span> <span class="n">a</span> <span 
class="n">performance</span> <span class="n">improvement</span><span 
class="p">,</span> <span class="k">for</span> <span class="n">example</span> 
&quot;<span class="n">local</span><span class="p">[</span>4<span 
class="p">]</span>&quot;
+  <span class="o">-</span><span class="n">sem</span> <span 
class="o">&lt;</span><span class="n">value</span><span class="o">&gt;</span> 
<span class="o">|</span> <span class="o">--</span><span 
class="n">sparkExecutorMem</span> <span class="o">&lt;</span><span 
class="n">value</span><span class="o">&gt;</span>
+        <span class="n">Max</span> <span class="n">Java</span> <span 
class="n">heap</span> <span class="n">available</span> <span 
class="n">as</span> &quot;<span class="n">executor</span> <span 
class="n">memory</span>&quot; <span class="n">on</span> <span 
class="n">each</span> <span class="n">node</span> <span class="p">(</span><span 
class="n">optional</span><span class="p">).</span> <span 
class="n">Default</span><span class="p">:</span> 4<span class="n">g</span>
+  <span class="o">-</span><span class="n">rs</span> <span 
class="o">&lt;</span><span class="n">value</span><span class="o">&gt;</span> 
<span class="o">|</span> <span class="o">--</span><span 
class="n">randomSeed</span> <span class="o">&lt;</span><span 
class="n">value</span><span class="o">&gt;</span>
+
+  <span class="o">-</span><span class="n">h</span> <span class="o">|</span> 
<span class="o">--</span><span class="n">help</span>
+        <span class="n">prints</span> <span class="n">this</span> <span 
class="n">usage</span> <span class="n">text</span>
+</pre></div>
+
+
+<p>See RowSimilarityDriver.scala in Mahout's spark module if you want to 
customize the code. </p>
+<h1 id="3-using-spark-rowsimilarity-with-text-data">3. Using 
<em>spark-rowsimilarity</em> with Text Data</h1>
+<p>Another use case for <em>spark-rowsimilarity</em> is in finding similar 
textual content. For instance given the tags associated with 
+a blog post,
+ which other posts have similar tags. In this case the columns are tags and 
the rows are posts. Since LLR is 
+the only similarity method supported this is not the optimal way to determine 
general "bag-of-words" document similarity. 
+LLR is used more as a quality filter than as a similarity measure. However 
<em>spark-rowsimilarity</em> will produce 
+lists of similar docs for every doc if input is docs with lists of terms. The 
Apache <a href="http://lucene.apache.org";>Lucene</a> project provides several 
methods of <a 
href="http://lucene.apache.org/core/4_9_0/core/org/apache/lucene/analysis/package-summary.html#package_description";>analyzing
 and tokenizing</a> documents.</p>
+<h1 id="wzxhzdk244-creating-a-multimodal-recommenderwzxhzdk25"><a 
name="unified-recommender">4. Creating a Multimodal Recommender</a></h1>
+<p>Using the output of <em>spark-itemsimilarity</em> and 
<em>spark-rowsimilarity</em> you can build a miltimodal cooccurrence and 
content based
+ recommender that can be used in both or either mode depending on indicators 
available and the history available at 
+runtime for a user. Some slide describing this method can be found <a 
href="http://occamsmachete.com/ml/2014/10/07/creating-a-unified-recommender-with-mahout-and-a-search-engine/";>here</a></p>
+<h2 id="requirements">Requirements</h2>
+<ol>
+<li>Mahout SNAPSHOT-1.0 or later</li>
+<li>Hadoop</li>
+<li>Spark, the correct version for your version of Mahout and Hadoop</li>
+<li>A search engine like Solr or Elasticsearch</li>
+</ol>
+<h2 id="indicators">Indicators</h2>
+<p>Indicators come in 3 types</p>
+<ol>
+<li><strong>Cooccurrence</strong>: calculated with 
<em>spark-itemsimilarity</em> from user actions</li>
+<li><strong>Content</strong>: calculated from item metadata or content using 
<em>spark-rowsimilarity</em></li>
+<li><strong>Intrinsic</strong>: assigned to items as metadata. Can be anything 
that describes the item.</li>
+</ol>
+<p>The query for recommendations will be a mix of values meant to match one of 
your indicators. The query can be constructed 
+from user history and values derived from context (category being viewed for 
instance) or special precalculated data 
+(popularity rank for instance). This blending of indicators allows for 
creating many flavors or recommendations to fit 
+a very wide variety of circumstances.</p>
+<p>With the right mix of indicators developers can construct a single query 
that works for completely new items and new users 
+while working well for items with lots of interactions and users with many 
recorded actions. In other words by adding in content and intrinsic 
+indicators developers can create a solution for the "cold-start" problem that 
gracefully improves with more user history
+and as items have more interactions. It is also possible to create a 
completely content-based recommender that personalizes 
+recommendations.</p>
+<h2 id="example-with-3-indicators">Example with 3 Indicators</h2>
+<p>You will need to decide how you store user action data so they can be 
processed by the item and row similarity jobs and 
+this is most easily done by using text files as described above. The data that 
is processed by these jobs is considered the 
+training data. You will need some amount of user history in your recs query. 
It is typical to use the most recent user history 
+but need not be exactly what is in the training set, which may include a 
greater volume of historical data. Keeping the user 
+history for query purposes could be done with a database by storing it in a 
users table. In the example above the two 
+collaborative filtering actions are "purchase" and "view", but let's also add 
tags (taken from catalog categories or other 
+descriptive metadata). </p>
+<p>We will need to create 1 cooccurrence indicator from the primary action 
(purchase) 1 cross-action cooccurrence indicator 
+from the secondary action (view) 
+and 1 content indicator (tags). We'll have to run 
<em>spark-itemsimilarity</em> once and <em>spark-rowsimilarity</em> once.</p>
+<p>We have described how to create the collaborative filtering indicators for 
purchase and view (the <a href="#multiple-actions">How to use Multiple User 
+Actions</a> section) but tags will be a slightly different process. We want to 
use the fact that 
+certain items have tags similar to the ones associated with a user's 
purchases. This is not a collaborative filtering indicator 
+but rather a "content" or "metadata" type indicator since you are not using 
other users' history, only the 
+individual that you are making recs for. This means that this method will make 
recommendations for items that have 
+no collaborative filtering data, as happens with new items in a catalog. New 
items may have tags assigned but no one
+ has purchased or viewed them yet. In the final query we will mix all 3 
indicators.</p>
+<h2 id="content-indicator">Content Indicator</h2>
+<p>To create a content-indicator we'll make use of the fact that the user has 
purchased items with certain tags. We want to find 
+items with the most similar tags. Notice that other users' behavior is not 
considered--only other item's tags. This defines a 
+content or metadata indicator. They are used when you want to find items that 
are similar to other items by using their 
+content or metadata, not by which users interacted with them.</p>
+<p><strong>Note</strong>: It may be advisable to treat tags as 
cross-cooccurrence indicators but for the sake of an example they are treated 
here as content only.</p>
+<p>For this we need input of the form:</p>
+<div class="codehilite"><pre><span class="n">itemID</span><span 
class="o">&lt;</span><span class="n">tab</span><span class="o">&gt;</span><span 
class="n">list</span><span class="o">-</span><span class="n">of</span><span 
class="o">-</span><span class="n">tags</span>
+<span class="p">...</span>
+</pre></div>
+
+
+<p>The full collection will look like the tags column from a catalog DB. For 
our ecom example it might be:</p>
+<div class="codehilite"><pre>3459860<span class="n">b</span><span 
class="o">&lt;</span><span class="n">tab</span><span class="o">&gt;</span><span 
class="n">men</span> <span class="n">long</span><span class="o">-</span><span 
class="n">sleeve</span> <span class="n">chambray</span> <span 
class="n">clothing</span> <span class="n">casual</span>
+9446577<span class="n">d</span><span class="o">&lt;</span><span 
class="n">tab</span><span class="o">&gt;</span><span class="n">women</span> 
<span class="n">tops</span> <span class="n">chambray</span> <span 
class="n">clothing</span> <span class="n">casual</span>
+<span class="p">...</span>
+</pre></div>
+
+
+<p>We'll use <em>spark-rowimilairity</em> because we are looking for similar 
rows, which encode items in this case. As with the 
+collaborative filtering indicators we use the --omitStrength option. The 
strengths created are 
+probabilistic log-likelihood ratios and so are used to filter unimportant 
similarities. Once the filtering or downsampling 
+is finished we no longer need the strengths. We will get an indicator matrix 
of the form:</p>
+<div class="codehilite"><pre><span class="n">itemID</span><span 
class="o">&lt;</span><span class="n">tab</span><span class="o">&gt;</span><span 
class="n">list</span><span class="o">-</span><span class="n">of</span><span 
class="o">-</span><span class="n">item</span> <span class="n">IDs</span>
+<span class="p">...</span>
+</pre></div>
+
+
+<p>This is a content indicator since it has found other items with similar 
content or metadata.</p>
+<div class="codehilite"><pre>3459860<span class="n">b</span><span 
class="o">&lt;</span><span class="n">tab</span><span 
class="o">&gt;</span>3459860<span class="n">b</span> 3459860<span 
class="n">b</span> 6749860<span class="n">c</span> 5959860<span 
class="n">a</span> 3434860<span class="n">a</span> 3477860<span 
class="n">a</span>
+9446577<span class="n">d</span><span class="o">&lt;</span><span 
class="n">tab</span><span class="o">&gt;</span>9446577<span class="n">d</span> 
9496577<span class="n">d</span> 0943577<span class="n">d</span> 8346577<span 
class="n">d</span> 9442277<span class="n">d</span> 9446577<span 
class="n">e</span>
+<span class="p">...</span>
+</pre></div>
+
+
+<p>We now have three indicators, two collaborative filtering type and one 
content type.</p>
+<h2 id="multimodal-recommender-query">Multimodal Recommender Query</h2>
+<p>The actual form of the query for recommendations will vary depending on 
your search engine but the intent is the same. For a given user, map their 
history of an action or content to the correct indicator field and perform an 
OR'd query. </p>
+<p>We have 3 indicators, these are indexed by the search engine into 3 fields, 
we'll call them "purchase", "view", and "tags". 
+We take the user's history that corresponds to each indicator and create a 
query of the form:</p>
+<div class="codehilite"><pre><span class="n">Query</span><span 
class="o">:</span>
+  <span class="n">field</span><span class="o">:</span> <span 
class="n">purchase</span><span class="o">;</span> <span class="n">q</span><span 
class="o">:</span><span class="n">user</span><span 
class="s1">&#39;s-purchase-history</span>
+<span class="s1">  field: view; q:user&#39;</span><span class="n">s</span> 
<span class="n">view</span><span class="o">-</span><span 
class="n">history</span>
+  <span class="n">field</span><span class="o">:</span> <span 
class="n">tags</span><span class="o">;</span> <span class="n">q</span><span 
class="o">:</span><span class="n">user</span><span 
class="err">&#39;</span><span class="n">s</span><span class="o">-</span><span 
class="n">tags</span><span class="o">-</span><span 
class="n">associated</span><span class="o">-</span><span 
class="k">with</span><span class="o">-</span><span class="n">purchases</span>
+</pre></div>
+
+
+<p>The query will result in an ordered list of items recommended for purchase 
but skewed towards items with similar tags to 
+the ones the user has already purchased. </p>
+<p>This is only an example and not necessarily the optimal way to create recs. 
It illustrates how business decisions can be 
+translated into recommendations. This technique can be used to skew 
recommendations towards intrinsic indicators also. 
+For instance you may want to put personalized popular item recs in a special 
place in the UI. Create a popularity indicator 
+by tagging items with some category of popularity (hot, warm, cold for 
instance) then
+index that as a new indicator field and include the corresponding value in a 
query 
+on the popularity field. If we use the ecom example but use the query to get 
"hot" recommendations it might look like this:</p>
+<div class="codehilite"><pre><span class="n">Query</span><span 
class="o">:</span>
+  <span class="n">field</span><span class="o">:</span> <span 
class="n">purchase</span><span class="o">;</span> <span class="n">q</span><span 
class="o">:</span><span class="n">user</span><span 
class="s1">&#39;s-purchase-history</span>
+<span class="s1">  field: view; q:user&#39;</span><span class="n">s</span> 
<span class="n">view</span><span class="o">-</span><span 
class="n">history</span>
+  <span class="n">field</span><span class="o">:</span> <span 
class="n">popularity</span><span class="o">;</span> <span 
class="n">q</span><span class="o">:</span><span 
class="s2">&quot;hot&quot;</span>
+</pre></div>
+
+
+<p>This will return recommendations favoring ones that have the intrinsic 
indicator "hot".</p>
+<h2 id="notes">Notes</h2>
+<ol>
+<li>Use as much user action history as you can gather. Choose a primary action 
that is closest to what you want to recommend and the others will be used to 
create cross-cooccurrence indicators. Using more data in this fashion will 
almost always produce better recommendations.</li>
+<li>Content can be used where there is no recorded user behavior or when items 
change too quickly to get much interaction history. They can be used alone or 
mixed with other indicators.</li>
+<li>Most search engines support "boost" factors so you can favor one or more 
indicators. In the example query, if you want tags to only have a small effect 
you could boost the CF indicators.</li>

[... 33 lines stripped ...]

Reply via email to