Author: buildbot
Date: Fri Aug 29 17:44:17 2014
New Revision: 920731

Log:
Staging update by buildbot for mahout

Modified:
    websites/staging/mahout/trunk/content/   (props changed)
    
websites/staging/mahout/trunk/content/users/recommender/intro-cooccurrence-spark.html

Propchange: websites/staging/mahout/trunk/content/
------------------------------------------------------------------------------
--- cms:source-revision (original)
+++ cms:source-revision Fri Aug 29 17:44:17 2014
@@ -1 +1 @@
-1621350
+1621351

Modified: 
websites/staging/mahout/trunk/content/users/recommender/intro-cooccurrence-spark.html
==============================================================================
--- 
websites/staging/mahout/trunk/content/users/recommender/intro-cooccurrence-spark.html
 (original)
+++ 
websites/staging/mahout/trunk/content/users/recommender/intro-cooccurrence-spark.html
 Fri Aug 29 17:44:17 2014
@@ -252,242 +252,298 @@
 <p><em>spark-itemsimilarity</em> is the Spark counterpart of the of the Mahout 
mapreduce job called <em>itemsimilarity</em>. It takes in elements of 
interactions, which have userID, itemID, and optionally a value. It will 
produce one of more indicator matrices created by comparing every user's 
interactions with every other user. The indicator matrix is an item x item 
matrix where the values are log-likelihood ratio strengths. For the legacy 
mapreduce version, there were several possible similarity measures but these 
are being deprecated in favor of LLR because in practice it performs the 
best.</p>
 <p>Mahout's mapreduce version of itemsimilarity takes a text file that is 
expected to have user and item IDs that conform to Mahout's ID 
requirements--they are non-negative integer that can be viewed as row and 
column numbers in a matrix.</p>
 <p><em>spark-itemsimilarity</em> also extends the notion of cooccurrence to 
cross-cooccurrence, in other words the Spark version will account for 
multi-modal interactions and create cross-indicator matrices allowing users to 
make use of much more data in creating recommendations or similar item 
lists.</p>
-<p>```
-spark-itemsimilarity Mahout 1.0-SNAPSHOT
-Usage: spark-itemsimilarity [options]</p>
-<p>Input, output options
-  -i <value> | --input <value>
-        Input path, may be a filename, directory name, or comma delimited list 
of 
-        HDFS supported URIs (required)
-  -i2 <value> | --input2 <value>
-        Secondary input path for cross-similarity calculation, same 
restrictions 
-        as "--input" (optional). Default: empty.
-  -o <value> | --output <value>
-        Path for output, any local or HDFS supported URI (required)</p>
-<p>Algorithm control options:
-  -mppu <value> | --maxPrefs <value>
-        Max number of preferences to consider per user (optional). Default: 500
-  -m <value> | --maxSimilaritiesPerItem <value>
-        Limit the number of similarities per item to this number (optional). 
-        Default: 100</p>
-<p>Note: Only the Log Likelihood Ratio (LLR) is supported as a similarity 
measure.</p>
-<p>Input text file schema options:
-  -id <value> | --inDelim <value>
-        Input delimiter character (optional). Default: "[,\t]"
-  -f1 <value> | --filter1 <value>
-        String (or regex) whose presence indicates a datum for the primary 
item 
-        set (optional). Default: no filter, all data is used
-  -f2 <value> | --filter2 <value>
-        String (or regex) whose presence indicates a datum for the secondary 
item 
-        set (optional). If not present no secondary dataset is collected
-  -rc <value> | --rowIDPosition <value>
-        Column number (0 based Int) containing the row ID string (optional). 
-        Default: 0
-  -ic <value> | --itemIDPosition <value>
-        Column number (0 based Int) containing the item ID string (optional). 
-        Default: 1
-  -fc <value> | --filterPosition <value>
-        Column number (0 based Int) containing the filter string (optional). 
-        Default: -1 for no filter</p>
-<p>Using all defaults the input is expected of the form: "userID<tab>itemId" 
or "userID<tab>itemID<tab>any-text..." and all rows will be used</p>
-<p>File discovery options:
-  -r | --recursive
-        Searched the -i path recursively for files that match 
--filenamePattern 
-        (optional), default: false
-  -fp <value> | --filenamePattern <value>
-        Regex to match in determining input files (optional). Default: 
filename 
-        in the --input option or "^part-.*" if --input is a directory</p>
-<p>Output text file schema options:
-  -rd <value> | --rowKeyDelim <value>
-        Separates the rowID key from the vector values list (optional). 
Default: 
-\t"
-  -cd <value> | --columnIdStrengthDelim <value>
-        Separates column IDs from their values in the vector values list 
(optional). 
-        Default: ":"
-  -td <value> | --elementDelim <value>
-        Separates vector element values in the values list (optional). 
Default: " "
-  -os | --omitStrength
-        Do not write the strength to the output files (optional), Default: 
false.
-        This option is used to output indexable data for creating a search 
engine 
-        recommender.</p>
-<p>Default delimiters will produce output of the form: 
"itemID1<tab>itemID2:value2<space>itemID10:value10..."</p>
-<p>Spark config options:
-  -ma <value> | --master <value>
-        Spark Master URL (optional). Default: "local". Note that you can 
specify 
-        the number of cores to get a performance improvement, for example 
"local[4]"
-  -sem <value> | --sparkExecutorMem <value>
-        Max Java heap available as "executor memory" on each node (optional). 
-        Default: 4g</p>
-<p>General config options:
-  -rs <value> | --randomSeed <value></p>
-<p>-h | --help
-        prints this usage text
-```</p>
+<div class="codehilite"><pre><span class="n">spark</span><span 
class="o">-</span><span class="n">itemsimilarity</span> <span 
class="n">Mahout</span> 1<span class="p">.</span>0<span class="o">-</span><span 
class="n">SNAPSHOT</span>
+<span class="n">Usage</span><span class="p">:</span> <span 
class="n">spark</span><span class="o">-</span><span 
class="n">itemsimilarity</span> <span class="p">[</span><span 
class="n">options</span><span class="p">]</span>
+
+<span class="n">Input</span><span class="p">,</span> <span 
class="n">output</span> <span class="n">options</span>
+  <span class="o">-</span><span class="nb">i</span> <span 
class="o">&lt;</span><span class="n">value</span><span class="o">&gt;</span> 
<span class="o">|</span> <span class="o">--</span><span class="n">input</span> 
<span class="o">&lt;</span><span class="n">value</span><span 
class="o">&gt;</span>
+        <span class="n">Input</span> <span class="n">path</span><span 
class="p">,</span> <span class="n">may</span> <span class="n">be</span> <span 
class="n">a</span> <span class="n">filename</span><span class="p">,</span> 
<span class="n">directory</span> <span class="n">name</span><span 
class="p">,</span> <span class="n">or</span> <span class="n">comma</span> <span 
class="n">delimited</span> <span class="n">list</span> <span 
class="n">of</span> 
+        <span class="n">HDFS</span> <span class="n">supported</span> <span 
class="n">URIs</span> <span class="p">(</span><span 
class="n">required</span><span class="p">)</span>
+  <span class="o">-</span><span class="n">i2</span> <span 
class="o">&lt;</span><span class="n">value</span><span class="o">&gt;</span> 
<span class="o">|</span> <span class="o">--</span><span class="n">input2</span> 
<span class="o">&lt;</span><span class="n">value</span><span 
class="o">&gt;</span>
+        <span class="n">Secondary</span> <span class="n">input</span> <span 
class="n">path</span> <span class="k">for</span> <span 
class="nb">cross</span><span class="o">-</span><span 
class="n">similarity</span> <span class="n">calculation</span><span 
class="p">,</span> <span class="n">same</span> <span 
class="n">restrictions</span> 
+        <span class="n">as</span> &quot;<span class="o">--</span><span 
class="n">input</span>&quot; <span class="p">(</span><span 
class="n">optional</span><span class="p">).</span> <span 
class="n">Default</span><span class="p">:</span> <span 
class="n">empty</span><span class="p">.</span>
+  <span class="o">-</span><span class="n">o</span> <span 
class="o">&lt;</span><span class="n">value</span><span class="o">&gt;</span> 
<span class="o">|</span> <span class="o">--</span><span class="n">output</span> 
<span class="o">&lt;</span><span class="n">value</span><span 
class="o">&gt;</span>
+        <span class="n">Path</span> <span class="k">for</span> <span 
class="n">output</span><span class="p">,</span> <span class="n">any</span> 
<span class="n">local</span> <span class="n">or</span> <span 
class="n">HDFS</span> <span class="n">supported</span> <span 
class="n">URI</span> <span class="p">(</span><span 
class="n">required</span><span class="p">)</span>
+
+<span class="n">Algorithm</span> <span class="n">control</span> <span 
class="n">options</span><span class="p">:</span>
+  <span class="o">-</span><span class="n">mppu</span> <span 
class="o">&lt;</span><span class="n">value</span><span class="o">&gt;</span> 
<span class="o">|</span> <span class="o">--</span><span 
class="n">maxPrefs</span> <span class="o">&lt;</span><span 
class="n">value</span><span class="o">&gt;</span>
+        <span class="n">Max</span> <span class="n">number</span> <span 
class="n">of</span> <span class="n">preferences</span> <span 
class="n">to</span> <span class="n">consider</span> <span class="n">per</span> 
<span class="n">user</span> <span class="p">(</span><span 
class="n">optional</span><span class="p">).</span> <span 
class="n">Default</span><span class="p">:</span> 500
+  <span class="o">-</span><span class="n">m</span> <span 
class="o">&lt;</span><span class="n">value</span><span class="o">&gt;</span> 
<span class="o">|</span> <span class="o">--</span><span 
class="n">maxSimilaritiesPerItem</span> <span class="o">&lt;</span><span 
class="n">value</span><span class="o">&gt;</span>
+        <span class="n">Limit</span> <span class="n">the</span> <span 
class="n">number</span> <span class="n">of</span> <span 
class="n">similarities</span> <span class="n">per</span> <span 
class="n">item</span> <span class="n">to</span> <span class="n">this</span> 
<span class="n">number</span> <span class="p">(</span><span 
class="n">optional</span><span class="p">).</span> 
+        <span class="n">Default</span><span class="p">:</span> 100
+
+<span class="n">Note</span><span class="p">:</span> <span 
class="n">Only</span> <span class="n">the</span> <span class="n">Log</span> 
<span class="n">Likelihood</span> <span class="n">Ratio</span> <span 
class="p">(</span><span class="n">LLR</span><span class="p">)</span> <span 
class="n">is</span> <span class="n">supported</span> <span class="n">as</span> 
<span class="n">a</span> <span class="n">similarity</span> <span 
class="n">measure</span><span class="p">.</span>
+
+<span class="n">Input</span> <span class="n">text</span> <span 
class="n">file</span> <span class="n">schema</span> <span 
class="n">options</span><span class="p">:</span>
+  <span class="o">-</span><span class="n">id</span> <span 
class="o">&lt;</span><span class="n">value</span><span class="o">&gt;</span> 
<span class="o">|</span> <span class="o">--</span><span 
class="n">inDelim</span> <span class="o">&lt;</span><span 
class="n">value</span><span class="o">&gt;</span>
+        <span class="n">Input</span> <span class="n">delimiter</span> <span 
class="n">character</span> <span class="p">(</span><span 
class="n">optional</span><span class="p">).</span> <span 
class="n">Default</span><span class="p">:</span> &quot;<span 
class="p">[,</span><span class="o">\</span><span class="n">t</span><span 
class="p">]</span>&quot;
+  <span class="o">-</span><span class="n">f1</span> <span 
class="o">&lt;</span><span class="n">value</span><span class="o">&gt;</span> 
<span class="o">|</span> <span class="o">--</span><span 
class="n">filter1</span> <span class="o">&lt;</span><span 
class="n">value</span><span class="o">&gt;</span>
+        <span class="n">String</span> <span class="p">(</span><span 
class="n">or</span> <span class="n">regex</span><span class="p">)</span> <span 
class="n">whose</span> <span class="n">presence</span> <span 
class="n">indicates</span> <span class="n">a</span> <span 
class="n">datum</span> <span class="k">for</span> <span class="n">the</span> 
<span class="n">primary</span> <span class="n">item</span> 
+        <span class="n">set</span> <span class="p">(</span><span 
class="n">optional</span><span class="p">).</span> <span 
class="n">Default</span><span class="p">:</span> <span class="n">no</span> 
<span class="n">filter</span><span class="p">,</span> <span 
class="n">all</span> <span class="n">data</span> <span class="n">is</span> 
<span class="n">used</span>
+  <span class="o">-</span><span class="n">f2</span> <span 
class="o">&lt;</span><span class="n">value</span><span class="o">&gt;</span> 
<span class="o">|</span> <span class="o">--</span><span 
class="n">filter2</span> <span class="o">&lt;</span><span 
class="n">value</span><span class="o">&gt;</span>
+        <span class="n">String</span> <span class="p">(</span><span 
class="n">or</span> <span class="n">regex</span><span class="p">)</span> <span 
class="n">whose</span> <span class="n">presence</span> <span 
class="n">indicates</span> <span class="n">a</span> <span 
class="n">datum</span> <span class="k">for</span> <span class="n">the</span> 
<span class="n">secondary</span> <span class="n">item</span> 
+        <span class="n">set</span> <span class="p">(</span><span 
class="n">optional</span><span class="p">).</span> <span class="n">If</span> 
<span class="n">not</span> <span class="n">present</span> <span 
class="n">no</span> <span class="n">secondary</span> <span 
class="n">dataset</span> <span class="n">is</span> <span 
class="n">collected</span>
+  <span class="o">-</span><span class="n">rc</span> <span 
class="o">&lt;</span><span class="n">value</span><span class="o">&gt;</span> 
<span class="o">|</span> <span class="o">--</span><span 
class="n">rowIDPosition</span> <span class="o">&lt;</span><span 
class="n">value</span><span class="o">&gt;</span>
+        <span class="n">Column</span> <span class="n">number</span> <span 
class="p">(</span>0 <span class="n">based</span> <span 
class="n">Int</span><span class="p">)</span> <span class="n">containing</span> 
<span class="n">the</span> <span class="n">row</span> <span class="n">ID</span> 
<span class="n">string</span> <span class="p">(</span><span 
class="n">optional</span><span class="p">).</span> 
+        <span class="n">Default</span><span class="p">:</span> 0
+  <span class="o">-</span><span class="n">ic</span> <span 
class="o">&lt;</span><span class="n">value</span><span class="o">&gt;</span> 
<span class="o">|</span> <span class="o">--</span><span 
class="n">itemIDPosition</span> <span class="o">&lt;</span><span 
class="n">value</span><span class="o">&gt;</span>
+        <span class="n">Column</span> <span class="n">number</span> <span 
class="p">(</span>0 <span class="n">based</span> <span 
class="n">Int</span><span class="p">)</span> <span class="n">containing</span> 
<span class="n">the</span> <span class="n">item</span> <span 
class="n">ID</span> <span class="n">string</span> <span class="p">(</span><span 
class="n">optional</span><span class="p">).</span> 
+        <span class="n">Default</span><span class="p">:</span> 1
+  <span class="o">-</span><span class="n">fc</span> <span 
class="o">&lt;</span><span class="n">value</span><span class="o">&gt;</span> 
<span class="o">|</span> <span class="o">--</span><span 
class="n">filterPosition</span> <span class="o">&lt;</span><span 
class="n">value</span><span class="o">&gt;</span>
+        <span class="n">Column</span> <span class="n">number</span> <span 
class="p">(</span>0 <span class="n">based</span> <span 
class="n">Int</span><span class="p">)</span> <span class="n">containing</span> 
<span class="n">the</span> <span class="n">filter</span> <span 
class="n">string</span> <span class="p">(</span><span 
class="n">optional</span><span class="p">).</span> 
+        <span class="n">Default</span><span class="p">:</span> <span 
class="o">-</span>1 <span class="k">for</span> <span class="n">no</span> <span 
class="n">filter</span>
+
+<span class="n">Using</span> <span class="n">all</span> <span 
class="n">defaults</span> <span class="n">the</span> <span 
class="n">input</span> <span class="n">is</span> <span 
class="n">expected</span> <span class="n">of</span> <span class="n">the</span> 
<span class="n">form</span><span class="p">:</span> &quot;<span 
class="n">userID</span><span class="o">&lt;</span><span 
class="n">tab</span><span class="o">&gt;</span><span 
class="n">itemId</span>&quot; <span class="n">or</span> &quot;<span 
class="n">userID</span><span class="o">&lt;</span><span 
class="n">tab</span><span class="o">&gt;</span><span 
class="n">itemID</span><span class="o">&lt;</span><span 
class="n">tab</span><span class="o">&gt;</span><span class="n">any</span><span 
class="o">-</span><span class="n">text</span><span class="p">...</span>&quot; 
<span class="n">and</span> <span class="n">all</span> <span 
class="n">rows</span> <span class="n">will</span> <span class="n">be</span> 
<span class="n">used</span>
+
+<span class="n">File</span> <span class="n">discovery</span> <span 
class="n">options</span><span class="p">:</span>
+  <span class="o">-</span><span class="n">r</span> <span class="o">|</span> 
<span class="o">--</span><span class="n">recursive</span>
+        <span class="n">Searched</span> <span class="n">the</span> <span 
class="o">-</span><span class="nb">i</span> <span class="n">path</span> <span 
class="n">recursively</span> <span class="k">for</span> <span 
class="n">files</span> <span class="n">that</span> <span class="n">match</span> 
<span class="o">--</span><span class="n">filenamePattern</span> 
+        <span class="p">(</span><span class="n">optional</span><span 
class="p">),</span> <span class="n">default</span><span class="p">:</span> 
<span class="n">false</span>
+  <span class="o">-</span><span class="n">fp</span> <span 
class="o">&lt;</span><span class="n">value</span><span class="o">&gt;</span> 
<span class="o">|</span> <span class="o">--</span><span 
class="n">filenamePattern</span> <span class="o">&lt;</span><span 
class="n">value</span><span class="o">&gt;</span>
+        <span class="n">Regex</span> <span class="n">to</span> <span 
class="n">match</span> <span class="n">in</span> <span 
class="n">determining</span> <span class="n">input</span> <span 
class="n">files</span> <span class="p">(</span><span 
class="n">optional</span><span class="p">).</span> <span 
class="n">Default</span><span class="p">:</span> <span 
class="n">filename</span> 
+        <span class="n">in</span> <span class="n">the</span> <span 
class="o">--</span><span class="n">input</span> <span class="n">option</span> 
<span class="n">or</span> &quot;^<span class="n">part</span><span 
class="o">-.*</span>&quot; <span class="k">if</span> <span 
class="o">--</span><span class="n">input</span> <span class="n">is</span> <span 
class="n">a</span> <span class="n">directory</span>
+
+<span class="n">Output</span> <span class="n">text</span> <span 
class="n">file</span> <span class="n">schema</span> <span 
class="n">options</span><span class="p">:</span>
+  <span class="o">-</span><span class="n">rd</span> <span 
class="o">&lt;</span><span class="n">value</span><span class="o">&gt;</span> 
<span class="o">|</span> <span class="o">--</span><span 
class="n">rowKeyDelim</span> <span class="o">&lt;</span><span 
class="n">value</span><span class="o">&gt;</span>
+        <span class="n">Separates</span> <span class="n">the</span> <span 
class="n">rowID</span> <span class="n">key</span> <span class="n">from</span> 
<span class="n">the</span> <span class="n">vector</span> <span 
class="n">values</span> <span class="n">list</span> <span 
class="p">(</span><span class="n">optional</span><span class="p">).</span> 
<span class="n">Default</span><span class="p">:</span> 
+<span class="o">\</span><span class="n">t</span>&quot;
+  <span class="o">-</span><span class="n">cd</span> <span 
class="o">&lt;</span><span class="n">value</span><span class="o">&gt;</span> 
<span class="o">|</span> <span class="o">--</span><span 
class="n">columnIdStrengthDelim</span> <span class="o">&lt;</span><span 
class="n">value</span><span class="o">&gt;</span>
+        <span class="n">Separates</span> <span class="n">column</span> <span 
class="n">IDs</span> <span class="n">from</span> <span class="n">their</span> 
<span class="n">values</span> <span class="n">in</span> <span 
class="n">the</span> <span class="n">vector</span> <span 
class="n">values</span> <span class="n">list</span> <span 
class="p">(</span><span class="n">optional</span><span class="p">).</span> 
+        <span class="n">Default</span><span class="p">:</span> &quot;<span 
class="p">:</span>&quot;
+  <span class="o">-</span><span class="n">td</span> <span 
class="o">&lt;</span><span class="n">value</span><span class="o">&gt;</span> 
<span class="o">|</span> <span class="o">--</span><span 
class="n">elementDelim</span> <span class="o">&lt;</span><span 
class="n">value</span><span class="o">&gt;</span>
+        <span class="n">Separates</span> <span class="n">vector</span> <span 
class="n">element</span> <span class="n">values</span> <span 
class="n">in</span> <span class="n">the</span> <span class="n">values</span> 
<span class="n">list</span> <span class="p">(</span><span 
class="n">optional</span><span class="p">).</span> <span 
class="n">Default</span><span class="p">:</span> &quot; &quot;
+  <span class="o">-</span><span class="n">os</span> <span class="o">|</span> 
<span class="o">--</span><span class="n">omitStrength</span>
+        <span class="n">Do</span> <span class="n">not</span> <span 
class="n">write</span> <span class="n">the</span> <span 
class="n">strength</span> <span class="n">to</span> <span class="n">the</span> 
<span class="n">output</span> <span class="n">files</span> <span 
class="p">(</span><span class="n">optional</span><span class="p">),</span> 
<span class="n">Default</span><span class="p">:</span> <span 
class="n">false</span><span class="p">.</span>
+        <span class="n">This</span> <span class="n">option</span> <span 
class="n">is</span> <span class="n">used</span> <span class="n">to</span> <span 
class="n">output</span> <span class="n">indexable</span> <span 
class="n">data</span> <span class="k">for</span> <span 
class="n">creating</span> <span class="n">a</span> <span 
class="n">search</span> <span class="n">engine</span> 
+        <span class="n">recommender</span><span class="p">.</span>
+
+<span class="n">Default</span> <span class="n">delimiters</span> <span 
class="n">will</span> <span class="n">produce</span> <span 
class="n">output</span> <span class="n">of</span> <span class="n">the</span> 
<span class="n">form</span><span class="p">:</span> &quot;<span 
class="n">itemID1</span><span class="o">&lt;</span><span 
class="n">tab</span><span class="o">&gt;</span><span 
class="n">itemID2</span><span class="p">:</span><span 
class="n">value2</span><span class="o">&lt;</span><span 
class="n">space</span><span class="o">&gt;</span><span 
class="n">itemID10</span><span class="p">:</span><span 
class="n">value10</span><span class="p">...</span>&quot;
+
+<span class="n">Spark</span> <span class="n">config</span> <span 
class="n">options</span><span class="p">:</span>
+  <span class="o">-</span><span class="n">ma</span> <span 
class="o">&lt;</span><span class="n">value</span><span class="o">&gt;</span> 
<span class="o">|</span> <span class="o">--</span><span class="n">master</span> 
<span class="o">&lt;</span><span class="n">value</span><span 
class="o">&gt;</span>
+        <span class="n">Spark</span> <span class="n">Master</span> <span 
class="n">URL</span> <span class="p">(</span><span 
class="n">optional</span><span class="p">).</span> <span 
class="n">Default</span><span class="p">:</span> &quot;<span 
class="n">local</span>&quot;<span class="p">.</span> <span 
class="n">Note</span> <span class="n">that</span> <span class="n">you</span> 
<span class="n">can</span> <span class="n">specify</span> 
+        <span class="n">the</span> <span class="n">number</span> <span 
class="n">of</span> <span class="n">cores</span> <span class="n">to</span> 
<span class="n">get</span> <span class="n">a</span> <span 
class="n">performance</span> <span class="n">improvement</span><span 
class="p">,</span> <span class="k">for</span> <span class="n">example</span> 
&quot;<span class="n">local</span><span class="p">[</span>4<span 
class="p">]</span>&quot;
+  <span class="o">-</span><span class="n">sem</span> <span 
class="o">&lt;</span><span class="n">value</span><span class="o">&gt;</span> 
<span class="o">|</span> <span class="o">--</span><span 
class="n">sparkExecutorMem</span> <span class="o">&lt;</span><span 
class="n">value</span><span class="o">&gt;</span>
+        <span class="n">Max</span> <span class="n">Java</span> <span 
class="n">heap</span> <span class="n">available</span> <span 
class="n">as</span> &quot;<span class="n">executor</span> <span 
class="n">memory</span>&quot; <span class="n">on</span> <span 
class="n">each</span> <span class="n">node</span> <span class="p">(</span><span 
class="n">optional</span><span class="p">).</span> 
+        <span class="n">Default</span><span class="p">:</span> 4<span 
class="n">g</span>
+
+<span class="n">General</span> <span class="n">config</span> <span 
class="n">options</span><span class="p">:</span>
+  <span class="o">-</span><span class="n">rs</span> <span 
class="o">&lt;</span><span class="n">value</span><span class="o">&gt;</span> 
<span class="o">|</span> <span class="o">--</span><span 
class="n">randomSeed</span> <span class="o">&lt;</span><span 
class="n">value</span><span class="o">&gt;</span>
+
+  <span class="o">-</span><span class="n">h</span> <span class="o">|</span> 
<span class="o">--</span><span class="n">help</span>
+        <span class="n">prints</span> <span class="n">this</span> <span 
class="n">usage</span> <span class="n">text</span>
+</pre></div>
+
+
 <p>This looks daunting but defaults to simple fairly sane values to take 
exactly the same input as legacy code and is pretty flexible. It allows the 
user to point to a single text file, a directory full of files, or a tree of 
directories to be traversed recursively. The files included can be specified 
with either a regex-style pattern or filename. The schema for the file is 
defined by column numbers, which map to the important bits of data including 
IDs and values. The files can even contain filters, which allow unneeded rows 
to be discarded or used for cross-cooccurrence calculations.</p>
 <p>See ItemSimilarityDriver.scala in Mahout's spark module if you want to 
customize the code. </p>
 <h3 id="defaults-in-the-spark-itemsimilarity-cli">Defaults in the 
<em>spark-itemsimilarity</em> CLI</h3>
 <p>If all defaults are used the input can be as simple as:</p>
-<p><code>userID1,itemID1
-userID2,itemID2
-...</code></p>
+<div class="codehilite"><pre><span class="n">userID1</span><span 
class="p">,</span><span class="n">itemID1</span>
+<span class="n">userID2</span><span class="p">,</span><span 
class="n">itemID2</span>
+<span class="p">...</span>
+</pre></div>
+
+
 <p>With the command line:</p>
-<p><code>bash$ mahout spark-itemsimilarity --input in-file --output 
out-dir</code></p>
+<div class="codehilite"><pre><span class="n">bash</span>$ <span 
class="n">mahout</span> <span class="n">spark</span><span 
class="o">-</span><span class="n">itemsimilarity</span> <span 
class="o">--</span><span class="n">input</span> <span class="n">in</span><span 
class="o">-</span><span class="n">file</span> <span class="o">--</span><span 
class="n">output</span> <span class="n">out</span><span class="o">-</span><span 
class="n">dir</span>
+</pre></div>
+
+
 <p>This will use the "local" Spark context and will output the standard text 
version of a DRM</p>
-<p><code>itemID1&lt;tab&gt;itemID2:value2&lt;space&gt;itemID10:value10...</code></p>
+<div class="codehilite"><pre><span class="n">itemID1</span><span 
class="o">&lt;</span><span class="n">tab</span><span class="o">&gt;</span><span 
class="n">itemID2</span><span class="p">:</span><span 
class="n">value2</span><span class="o">&lt;</span><span 
class="n">space</span><span class="o">&gt;</span><span 
class="n">itemID10</span><span class="p">:</span><span 
class="n">value10</span><span class="p">...</span>
+</pre></div>
+
+
 <h3 id="more-complex-input">More Complex Input</h3>
 <p>For input of the form:</p>
-<p><code>u1,purchase,iphone
-u1,purchase,ipad
-u2,purchase,nexus
-u2,purchase,galaxy
-u3,purchase,surface
-u4,purchase,iphone
-u4,purchase,galaxy
-u1,view,iphone
-u1,view,ipad
-u1,view,nexus
-u1,view,galaxy
-u2,view,iphone
-u2,view,ipad
-u2,view,nexus
-u2,view,galaxy
-u3,view,surface
-u3,view,nexus
-u4,view,iphone
-u4,view,ipad
-u4,view,galaxy</code></p>
+<div class="codehilite"><pre><span class="n">u1</span><span 
class="p">,</span><span class="n">purchase</span><span class="p">,</span><span 
class="n">iphone</span>
+<span class="n">u1</span><span class="p">,</span><span 
class="n">purchase</span><span class="p">,</span><span class="n">ipad</span>
+<span class="n">u2</span><span class="p">,</span><span 
class="n">purchase</span><span class="p">,</span><span class="n">nexus</span>
+<span class="n">u2</span><span class="p">,</span><span 
class="n">purchase</span><span class="p">,</span><span class="n">galaxy</span>
+<span class="n">u3</span><span class="p">,</span><span 
class="n">purchase</span><span class="p">,</span><span class="n">surface</span>
+<span class="n">u4</span><span class="p">,</span><span 
class="n">purchase</span><span class="p">,</span><span class="n">iphone</span>
+<span class="n">u4</span><span class="p">,</span><span 
class="n">purchase</span><span class="p">,</span><span class="n">galaxy</span>
+<span class="n">u1</span><span class="p">,</span><span 
class="n">view</span><span class="p">,</span><span class="n">iphone</span>
+<span class="n">u1</span><span class="p">,</span><span 
class="n">view</span><span class="p">,</span><span class="n">ipad</span>
+<span class="n">u1</span><span class="p">,</span><span 
class="n">view</span><span class="p">,</span><span class="n">nexus</span>
+<span class="n">u1</span><span class="p">,</span><span 
class="n">view</span><span class="p">,</span><span class="n">galaxy</span>
+<span class="n">u2</span><span class="p">,</span><span 
class="n">view</span><span class="p">,</span><span class="n">iphone</span>
+<span class="n">u2</span><span class="p">,</span><span 
class="n">view</span><span class="p">,</span><span class="n">ipad</span>
+<span class="n">u2</span><span class="p">,</span><span 
class="n">view</span><span class="p">,</span><span class="n">nexus</span>
+<span class="n">u2</span><span class="p">,</span><span 
class="n">view</span><span class="p">,</span><span class="n">galaxy</span>
+<span class="n">u3</span><span class="p">,</span><span 
class="n">view</span><span class="p">,</span><span class="n">surface</span>
+<span class="n">u3</span><span class="p">,</span><span 
class="n">view</span><span class="p">,</span><span class="n">nexus</span>
+<span class="n">u4</span><span class="p">,</span><span 
class="n">view</span><span class="p">,</span><span class="n">iphone</span>
+<span class="n">u4</span><span class="p">,</span><span 
class="n">view</span><span class="p">,</span><span class="n">ipad</span>
+<span class="n">u4</span><span class="p">,</span><span 
class="n">view</span><span class="p">,</span><span class="n">galaxy</span>
+</pre></div>
+
+
 <h3 id="command-line">Command Line</h3>
 <p>Use the following options can be used:</p>
-<p><code>bash$ mahout spark-itemsimilarity \
-    --input in-file \     # where to look for data
-    --output out-path \   # root dir for output
-    --master masterUrl \  # URL of the Spark master server
-    --filter1 purchase \  # word that flags input for the primary action
-    --filter2 view \      # word that flags input for the secondary action
-    --itemIDPosition 2 \  # column that has the item ID
-    --rowIDPosition 0 \   # column that has the user ID
-    --filterPosition 1    # column that has the filter word</code></p>
+<div class="codehilite"><pre><span class="n">bash</span>$ <span 
class="n">mahout</span> <span class="n">spark</span><span 
class="o">-</span><span class="n">itemsimilarity</span> <span class="o">\</span>
+    <span class="o">--</span><span class="n">input</span> <span 
class="n">in</span><span class="o">-</span><span class="n">file</span> <span 
class="o">\</span>     # <span class="n">where</span> <span class="n">to</span> 
<span class="n">look</span> <span class="k">for</span> <span 
class="n">data</span>
+    <span class="o">--</span><span class="n">output</span> <span 
class="n">out</span><span class="o">-</span><span class="n">path</span> <span 
class="o">\</span>   # <span class="n">root</span> <span class="n">dir</span> 
<span class="k">for</span> <span class="n">output</span>
+    <span class="o">--</span><span class="n">master</span> <span 
class="n">masterUrl</span> <span class="o">\</span>  # <span 
class="n">URL</span> <span class="n">of</span> <span class="n">the</span> <span 
class="n">Spark</span> <span class="n">master</span> <span 
class="n">server</span>
+    <span class="o">--</span><span class="n">filter1</span> <span 
class="n">purchase</span> <span class="o">\</span>  # <span 
class="n">word</span> <span class="n">that</span> <span class="n">flags</span> 
<span class="n">input</span> <span class="k">for</span> <span 
class="n">the</span> <span class="n">primary</span> <span 
class="n">action</span>
+    <span class="o">--</span><span class="n">filter2</span> <span 
class="n">view</span> <span class="o">\</span>      # <span 
class="n">word</span> <span class="n">that</span> <span class="n">flags</span> 
<span class="n">input</span> <span class="k">for</span> <span 
class="n">the</span> <span class="n">secondary</span> <span 
class="n">action</span>
+    <span class="o">--</span><span class="n">itemIDPosition</span> 2 <span 
class="o">\</span>  # <span class="n">column</span> <span class="n">that</span> 
<span class="n">has</span> <span class="n">the</span> <span 
class="n">item</span> <span class="n">ID</span>
+    <span class="o">--</span><span class="n">rowIDPosition</span> 0 <span 
class="o">\</span>   # <span class="n">column</span> <span 
class="n">that</span> <span class="n">has</span> <span class="n">the</span> 
<span class="n">user</span> <span class="n">ID</span>
+    <span class="o">--</span><span class="n">filterPosition</span> 1    # 
<span class="n">column</span> <span class="n">that</span> <span 
class="n">has</span> <span class="n">the</span> <span class="n">filter</span> 
<span class="n">word</span>
+</pre></div>
+
+
 <h3 id="output">Output</h3>
 <p>The output of the job will be the standard text version of two Mahout DRMs. 
This is a case where we are calculating cross-cooccurrence so a primary 
indicator matrix and cross-indicator matrix will be created</p>
-<p>```
-out-path
-  |-- indicator-matrix - TDF part files
-  -- cross-indicator-matrix - TDF part-files</p>
-<p>```
-The indicator matrix will contain the lines:</p>
-<p><code>galaxy\tnexus:1.7260924347106847
-ipad\tiphone:1.7260924347106847
-nexus\tgalaxy:1.7260924347106847
-iphone\tipad:1.7260924347106847
-surface</code></p>
+<div class="codehilite"><pre><span class="n">out</span><span 
class="o">-</span><span class="n">path</span>
+  <span class="o">|--</span> <span class="n">indicator</span><span 
class="o">-</span><span class="n">matrix</span> <span class="o">-</span> <span 
class="n">TDF</span> <span class="n">part</span> <span class="n">files</span>
+  <span class="o">\--</span> <span class="nb">cross</span><span 
class="o">-</span><span class="n">indicator</span><span class="o">-</span><span 
class="n">matrix</span> <span class="o">-</span> <span class="n">TDF</span> 
<span class="n">part</span><span class="o">-</span><span class="n">files</span>
+</pre></div>
+
+
+<p>The indicator matrix will contain the lines:</p>
+<div class="codehilite"><pre><span class="n">galaxy</span><span 
class="o">\</span><span class="n">tnexus</span><span class="p">:</span>1<span 
class="p">.</span>7260924347106847
+<span class="n">ipad</span><span class="o">\</span><span 
class="n">tiphone</span><span class="p">:</span>1<span 
class="p">.</span>7260924347106847
+<span class="n">nexus</span><span class="o">\</span><span 
class="n">tgalaxy</span><span class="p">:</span>1<span 
class="p">.</span>7260924347106847
+<span class="n">iphone</span><span class="o">\</span><span 
class="n">tipad</span><span class="p">:</span>1<span 
class="p">.</span>7260924347106847
+<span class="n">surface</span>
+</pre></div>
+
+
 <p>The cross-indicator matrix will contain:</p>
-<p><code>iphone\tnexus:1.7260924347106847 iphone:1.7260924347106847 
ipad:1.7260924347106847 galaxy:1.7260924347106847
-ipad\tnexus:0.6795961471815897 iphone:0.6795961471815897 
ipad:0.6795961471815897 galaxy:0.6795961471815897
-nexus\tnexus:0.6795961471815897 iphone:0.6795961471815897 
ipad:0.6795961471815897 galaxy:0.6795961471815897
-galaxy\tnexus:1.7260924347106847 iphone:1.7260924347106847 
ipad:1.7260924347106847 galaxy:1.7260924347106847
-surface\tsurface:4.498681156950466 nexus:0.6795961471815897</code></p>
+<div class="codehilite"><pre><span class="n">iphone</span><span 
class="o">\</span><span class="n">tnexus</span><span class="p">:</span>1<span 
class="p">.</span>7260924347106847 <span class="n">iphone</span><span 
class="p">:</span>1<span class="p">.</span>7260924347106847 <span 
class="n">ipad</span><span class="p">:</span>1<span 
class="p">.</span>7260924347106847 <span class="n">galaxy</span><span 
class="p">:</span>1<span class="p">.</span>7260924347106847
+<span class="n">ipad</span><span class="o">\</span><span 
class="n">tnexus</span><span class="p">:</span>0<span 
class="p">.</span>6795961471815897 <span class="n">iphone</span><span 
class="p">:</span>0<span class="p">.</span>6795961471815897 <span 
class="n">ipad</span><span class="p">:</span>0<span 
class="p">.</span>6795961471815897 <span class="n">galaxy</span><span 
class="p">:</span>0<span class="p">.</span>6795961471815897
+<span class="n">nexus</span><span class="o">\</span><span 
class="n">tnexus</span><span class="p">:</span>0<span 
class="p">.</span>6795961471815897 <span class="n">iphone</span><span 
class="p">:</span>0<span class="p">.</span>6795961471815897 <span 
class="n">ipad</span><span class="p">:</span>0<span 
class="p">.</span>6795961471815897 <span class="n">galaxy</span><span 
class="p">:</span>0<span class="p">.</span>6795961471815897
+<span class="n">galaxy</span><span class="o">\</span><span 
class="n">tnexus</span><span class="p">:</span>1<span 
class="p">.</span>7260924347106847 <span class="n">iphone</span><span 
class="p">:</span>1<span class="p">.</span>7260924347106847 <span 
class="n">ipad</span><span class="p">:</span>1<span 
class="p">.</span>7260924347106847 <span class="n">galaxy</span><span 
class="p">:</span>1<span class="p">.</span>7260924347106847
+<span class="n">surface</span><span class="o">\</span><span 
class="n">tsurface</span><span class="p">:</span>4<span 
class="p">.</span>498681156950466 <span class="n">nexus</span><span 
class="p">:</span>0<span class="p">.</span>6795961471815897
+</pre></div>
+
+
 <h3 id="log-file-input">Log File Input</h3>
 <p>A common method of storing data is in log files. If they are written using 
some delimiter they can be consumed directly by spark-itemsimilarity. For 
instance input of the form:</p>
-<p><code>2014-06-23 14:46:53.115\tu1\tpurchase\trandom text\tiphone
-2014-06-23 14:46:53.115\tu1\tpurchase\trandom text\tipad
-2014-06-23 14:46:53.115\tu2\tpurchase\trandom text\tnexus
-2014-06-23 14:46:53.115\tu2\tpurchase\trandom text\tgalaxy
-2014-06-23 14:46:53.115\tu3\tpurchase\trandom text\tsurface
-2014-06-23 14:46:53.115\tu4\tpurchase\trandom text\tiphone
-2014-06-23 14:46:53.115\tu4\tpurchase\trandom text\tgalaxy
-2014-06-23 14:46:53.115\tu1\tview\trandom text\tiphone
-2014-06-23 14:46:53.115\tu1\tview\trandom text\tipad
-2014-06-23 14:46:53.115\tu1\tview\trandom text\tnexus
-2014-06-23 14:46:53.115\tu1\tview\trandom text\tgalaxy
-2014-06-23 14:46:53.115\tu2\tview\trandom text\tiphone
-2014-06-23 14:46:53.115\tu2\tview\trandom text\tipad
-2014-06-23 14:46:53.115\tu2\tview\trandom text\tnexus
-2014-06-23 14:46:53.115\tu2\tview\trandom text\tgalaxy
-2014-06-23 14:46:53.115\tu3\tview\trandom text\tsurface
-2014-06-23 14:46:53.115\tu3\tview\trandom text\tnexus
-2014-06-23 14:46:53.115\tu4\tview\trandom text\tiphone
-2014-06-23 14:46:53.115\tu4\tview\trandom text\tipad
-2014-06-23 14:46:53.115\tu4\tview\trandom text\tgalaxy</code></p>
+<div class="codehilite"><pre>2014<span class="o">-</span>06<span 
class="o">-</span>23 14<span class="p">:</span>46<span 
class="p">:</span>53<span class="p">.</span>115<span class="o">\</span><span 
class="n">tu1</span><span class="o">\</span><span 
class="n">tpurchase</span><span class="o">\</span><span 
class="n">trandom</span> <span class="n">text</span><span 
class="o">\</span><span class="n">tiphone</span>
+2014<span class="o">-</span>06<span class="o">-</span>23 14<span 
class="p">:</span>46<span class="p">:</span>53<span class="p">.</span>115<span 
class="o">\</span><span class="n">tu1</span><span class="o">\</span><span 
class="n">tpurchase</span><span class="o">\</span><span 
class="n">trandom</span> <span class="n">text</span><span 
class="o">\</span><span class="n">tipad</span>
+2014<span class="o">-</span>06<span class="o">-</span>23 14<span 
class="p">:</span>46<span class="p">:</span>53<span class="p">.</span>115<span 
class="o">\</span><span class="n">tu2</span><span class="o">\</span><span 
class="n">tpurchase</span><span class="o">\</span><span 
class="n">trandom</span> <span class="n">text</span><span 
class="o">\</span><span class="n">tnexus</span>
+2014<span class="o">-</span>06<span class="o">-</span>23 14<span 
class="p">:</span>46<span class="p">:</span>53<span class="p">.</span>115<span 
class="o">\</span><span class="n">tu2</span><span class="o">\</span><span 
class="n">tpurchase</span><span class="o">\</span><span 
class="n">trandom</span> <span class="n">text</span><span 
class="o">\</span><span class="n">tgalaxy</span>
+2014<span class="o">-</span>06<span class="o">-</span>23 14<span 
class="p">:</span>46<span class="p">:</span>53<span class="p">.</span>115<span 
class="o">\</span><span class="n">tu3</span><span class="o">\</span><span 
class="n">tpurchase</span><span class="o">\</span><span 
class="n">trandom</span> <span class="n">text</span><span 
class="o">\</span><span class="n">tsurface</span>
+2014<span class="o">-</span>06<span class="o">-</span>23 14<span 
class="p">:</span>46<span class="p">:</span>53<span class="p">.</span>115<span 
class="o">\</span><span class="n">tu4</span><span class="o">\</span><span 
class="n">tpurchase</span><span class="o">\</span><span 
class="n">trandom</span> <span class="n">text</span><span 
class="o">\</span><span class="n">tiphone</span>
+2014<span class="o">-</span>06<span class="o">-</span>23 14<span 
class="p">:</span>46<span class="p">:</span>53<span class="p">.</span>115<span 
class="o">\</span><span class="n">tu4</span><span class="o">\</span><span 
class="n">tpurchase</span><span class="o">\</span><span 
class="n">trandom</span> <span class="n">text</span><span 
class="o">\</span><span class="n">tgalaxy</span>
+2014<span class="o">-</span>06<span class="o">-</span>23 14<span 
class="p">:</span>46<span class="p">:</span>53<span class="p">.</span>115<span 
class="o">\</span><span class="n">tu1</span><span class="o">\</span><span 
class="n">tview</span><span class="o">\</span><span class="n">trandom</span> 
<span class="n">text</span><span class="o">\</span><span 
class="n">tiphone</span>
+2014<span class="o">-</span>06<span class="o">-</span>23 14<span 
class="p">:</span>46<span class="p">:</span>53<span class="p">.</span>115<span 
class="o">\</span><span class="n">tu1</span><span class="o">\</span><span 
class="n">tview</span><span class="o">\</span><span class="n">trandom</span> 
<span class="n">text</span><span class="o">\</span><span class="n">tipad</span>
+2014<span class="o">-</span>06<span class="o">-</span>23 14<span 
class="p">:</span>46<span class="p">:</span>53<span class="p">.</span>115<span 
class="o">\</span><span class="n">tu1</span><span class="o">\</span><span 
class="n">tview</span><span class="o">\</span><span class="n">trandom</span> 
<span class="n">text</span><span class="o">\</span><span class="n">tnexus</span>
+2014<span class="o">-</span>06<span class="o">-</span>23 14<span 
class="p">:</span>46<span class="p">:</span>53<span class="p">.</span>115<span 
class="o">\</span><span class="n">tu1</span><span class="o">\</span><span 
class="n">tview</span><span class="o">\</span><span class="n">trandom</span> 
<span class="n">text</span><span class="o">\</span><span 
class="n">tgalaxy</span>
+2014<span class="o">-</span>06<span class="o">-</span>23 14<span 
class="p">:</span>46<span class="p">:</span>53<span class="p">.</span>115<span 
class="o">\</span><span class="n">tu2</span><span class="o">\</span><span 
class="n">tview</span><span class="o">\</span><span class="n">trandom</span> 
<span class="n">text</span><span class="o">\</span><span 
class="n">tiphone</span>
+2014<span class="o">-</span>06<span class="o">-</span>23 14<span 
class="p">:</span>46<span class="p">:</span>53<span class="p">.</span>115<span 
class="o">\</span><span class="n">tu2</span><span class="o">\</span><span 
class="n">tview</span><span class="o">\</span><span class="n">trandom</span> 
<span class="n">text</span><span class="o">\</span><span class="n">tipad</span>
+2014<span class="o">-</span>06<span class="o">-</span>23 14<span 
class="p">:</span>46<span class="p">:</span>53<span class="p">.</span>115<span 
class="o">\</span><span class="n">tu2</span><span class="o">\</span><span 
class="n">tview</span><span class="o">\</span><span class="n">trandom</span> 
<span class="n">text</span><span class="o">\</span><span class="n">tnexus</span>
+2014<span class="o">-</span>06<span class="o">-</span>23 14<span 
class="p">:</span>46<span class="p">:</span>53<span class="p">.</span>115<span 
class="o">\</span><span class="n">tu2</span><span class="o">\</span><span 
class="n">tview</span><span class="o">\</span><span class="n">trandom</span> 
<span class="n">text</span><span class="o">\</span><span 
class="n">tgalaxy</span>
+2014<span class="o">-</span>06<span class="o">-</span>23 14<span 
class="p">:</span>46<span class="p">:</span>53<span class="p">.</span>115<span 
class="o">\</span><span class="n">tu3</span><span class="o">\</span><span 
class="n">tview</span><span class="o">\</span><span class="n">trandom</span> 
<span class="n">text</span><span class="o">\</span><span 
class="n">tsurface</span>
+2014<span class="o">-</span>06<span class="o">-</span>23 14<span 
class="p">:</span>46<span class="p">:</span>53<span class="p">.</span>115<span 
class="o">\</span><span class="n">tu3</span><span class="o">\</span><span 
class="n">tview</span><span class="o">\</span><span class="n">trandom</span> 
<span class="n">text</span><span class="o">\</span><span class="n">tnexus</span>
+2014<span class="o">-</span>06<span class="o">-</span>23 14<span 
class="p">:</span>46<span class="p">:</span>53<span class="p">.</span>115<span 
class="o">\</span><span class="n">tu4</span><span class="o">\</span><span 
class="n">tview</span><span class="o">\</span><span class="n">trandom</span> 
<span class="n">text</span><span class="o">\</span><span 
class="n">tiphone</span>
+2014<span class="o">-</span>06<span class="o">-</span>23 14<span 
class="p">:</span>46<span class="p">:</span>53<span class="p">.</span>115<span 
class="o">\</span><span class="n">tu4</span><span class="o">\</span><span 
class="n">tview</span><span class="o">\</span><span class="n">trandom</span> 
<span class="n">text</span><span class="o">\</span><span class="n">tipad</span>
+2014<span class="o">-</span>06<span class="o">-</span>23 14<span 
class="p">:</span>46<span class="p">:</span>53<span class="p">.</span>115<span 
class="o">\</span><span class="n">tu4</span><span class="o">\</span><span 
class="n">tview</span><span class="o">\</span><span class="n">trandom</span> 
<span class="n">text</span><span class="o">\</span><span 
class="n">tgalaxy</span>
+</pre></div>
+
+
 <p>Can be parsed with the following CLI and run on the cluster producing the 
same output as the above example.</p>
-<p><code>bash$ mahout spark-itemsimilarity \
-    --input in-file \
-    --output out-path \
-    --master spark://sparkmaster:4044 \
-    --filter1 purchase \
-    --filter2 view \
-    --inDelim "\t" \
-    --itemIDPosition 4 \
-    --rowIDPosition 1 \
-    --filterPosition 2 \</code></p>
+<div class="codehilite"><pre><span class="n">bash</span>$ <span 
class="n">mahout</span> <span class="n">spark</span><span 
class="o">-</span><span class="n">itemsimilarity</span> <span class="o">\</span>
+    <span class="o">--</span><span class="n">input</span> <span 
class="n">in</span><span class="o">-</span><span class="n">file</span> <span 
class="o">\</span>
+    <span class="o">--</span><span class="n">output</span> <span 
class="n">out</span><span class="o">-</span><span class="n">path</span> <span 
class="o">\</span>
+    <span class="o">--</span><span class="n">master</span> <span 
class="n">spark</span><span class="p">:</span><span class="o">//</span><span 
class="n">sparkmaster</span><span class="p">:</span>4044 <span 
class="o">\</span>
+    <span class="o">--</span><span class="n">filter1</span> <span 
class="n">purchase</span> <span class="o">\</span>
+    <span class="o">--</span><span class="n">filter2</span> <span 
class="n">view</span> <span class="o">\</span>
+    <span class="o">--</span><span class="n">inDelim</span> &quot;<span 
class="o">\</span><span class="n">t</span>&quot; <span class="o">\</span>
+    <span class="o">--</span><span class="n">itemIDPosition</span> 4 <span 
class="o">\</span>
+    <span class="o">--</span><span class="n">rowIDPosition</span> 1 <span 
class="o">\</span>
+    <span class="o">--</span><span class="n">filterPosition</span> 2 <span 
class="o">\</span>
+</pre></div>
+
+
 <h2 id="2-spark-rowsimilarity">2. spark-rowsimilarity</h2>
 <p><em>spark-rowsimilarity</em> is the companion to 
<em>spark-itemsimilarity</em> the primary difference is that it takes a text 
file version of a DRM with optional application specific IDs. The input is in 
text-delimited form where there are three delimiters used. By default it reads 
(rowID<tab>columnID1:strength1<space>columnID2:strength2...) Since this job 
only supports LLR similarity, which does not use the input strengths, they may 
be omitted in the input. It writes 
(columnID<tab>columnID1:strength1<space>columnID2:strength2...) The output is 
sorted by strength descending. The output can be interpreted as a column id 
from the primary input followed by a list of the most similar columns. For a 
discussion of the output layout and formatting see 
<em>spark-itemsimilarity</em>. </p>
 <p>One significant output option is --omitStrength. This allows output of the 
form (columnID<tab>columnID2<space>columnID2<space>...) This is a tab-delimited 
file containing a columnID token followed by a space delimited string of 
tokens. It can be directly indexed by search engines to create an item-based 
recommender.</p>
 <p>The command line interface is:</p>
-<p>```
-spark-rowsimilarity Mahout 1.0-SNAPSHOT
-Usage: spark-rowsimilarity [options]</p>
-<p>Input, output options
-  -i <value> | --input <value>
-        Input path, may be a filename, directory name, or comma delimited list 
-        of HDFS supported URIs (required)
-  -i2 <value> | --input2 <value>
-        Secondary input path for cross-similarity calculation, same 
restrictions 
-        as "--input" (optional). Default: empty.
-  -o <value> | --output <value>
-        Path for output, any local or HDFS supported URI (required)</p>
-<p>Algorithm control options:
-  -mo <value> | --maxObservations <value>
-        Max number of observations to consider per row (optional). Default: 500
-  -m <value> | --maxSimilaritiesPerRow <value>
-        Limit the number of similarities per item to this number (optional). 
-        Default: 100</p>
-<p>Note: Only the Log Likelihood Ratio (LLR) is supported as a similarity 
measure.</p>
-<p>Output text file schema options:
-  -rd <value> | --rowKeyDelim <value>
-        Separates the rowID key from the vector values list (optional). 
-        Default: "\t"
-  -cd <value> | --columnIdStrengthDelim <value>
-        Separates column IDs from their values in the vector values list 
-        (optional). Default: ":"
-  -td <value> | --elementDelim <value>
-        Separates vector element values in the values list (optional). 
-        Default: " "
-  -os | --omitStrength
-        Do not write the strength to the output files (optional), Default: 
-        false.
-This option is used to output indexable data for creating a search engine 
-recommender.</p>
-<p>Default delimiters will produce output of the form: 
"itemID1<tab>itemID2:value2<space>itemID10:value10..."</p>
-<p>File discovery options:
-  -r | --recursive
-        Searched the -i path recursively for files that match 
-        --filenamePattern (optional), Default: false
-  -fp <value> | --filenamePattern <value>
-        Regex to match in determining input files (optional). Default: 
-        filename in the --input option or "^part-.*" if --input is a 
directory</p>
-<p>Spark config options:
-  -ma <value> | --master <value>
-        Spark Master URL (optional). Default: "local". Note that you can 
-        specify the number of cores to get a performance improvement, for 
-        example "local[4]"
-  -sem <value> | --sparkExecutorMem <value>
-        Max Java heap available as "executor memory" on each node (optional). 
-        Default: 4g</p>
-<p>General config options:
-  -rs <value> | --randomSeed <value></p>
-<p>-h | --help
-        prints this usage text
-```
-See RowSimilarityDriver.scala in Mahout's spark module if you want to 
customize the code. </p>
+<div class="codehilite"><pre><span class="n">spark</span><span 
class="o">-</span><span class="n">rowsimilarity</span> <span 
class="n">Mahout</span> 1<span class="p">.</span>0<span class="o">-</span><span 
class="n">SNAPSHOT</span>
+<span class="n">Usage</span><span class="p">:</span> <span 
class="n">spark</span><span class="o">-</span><span 
class="n">rowsimilarity</span> <span class="p">[</span><span 
class="n">options</span><span class="p">]</span>
+
+<span class="n">Input</span><span class="p">,</span> <span 
class="n">output</span> <span class="n">options</span>
+  <span class="o">-</span><span class="nb">i</span> <span 
class="o">&lt;</span><span class="n">value</span><span class="o">&gt;</span> 
<span class="o">|</span> <span class="o">--</span><span class="n">input</span> 
<span class="o">&lt;</span><span class="n">value</span><span 
class="o">&gt;</span>
+        <span class="n">Input</span> <span class="n">path</span><span 
class="p">,</span> <span class="n">may</span> <span class="n">be</span> <span 
class="n">a</span> <span class="n">filename</span><span class="p">,</span> 
<span class="n">directory</span> <span class="n">name</span><span 
class="p">,</span> <span class="n">or</span> <span class="n">comma</span> <span 
class="n">delimited</span> <span class="n">list</span> 
+        <span class="n">of</span> <span class="n">HDFS</span> <span 
class="n">supported</span> <span class="n">URIs</span> <span 
class="p">(</span><span class="n">required</span><span class="p">)</span>
+  <span class="o">-</span><span class="n">i2</span> <span 
class="o">&lt;</span><span class="n">value</span><span class="o">&gt;</span> 
<span class="o">|</span> <span class="o">--</span><span class="n">input2</span> 
<span class="o">&lt;</span><span class="n">value</span><span 
class="o">&gt;</span>
+        <span class="n">Secondary</span> <span class="n">input</span> <span 
class="n">path</span> <span class="k">for</span> <span 
class="nb">cross</span><span class="o">-</span><span 
class="n">similarity</span> <span class="n">calculation</span><span 
class="p">,</span> <span class="n">same</span> <span 
class="n">restrictions</span> 
+        <span class="n">as</span> &quot;<span class="o">--</span><span 
class="n">input</span>&quot; <span class="p">(</span><span 
class="n">optional</span><span class="p">).</span> <span 
class="n">Default</span><span class="p">:</span> <span 
class="n">empty</span><span class="p">.</span>
+  <span class="o">-</span><span class="n">o</span> <span 
class="o">&lt;</span><span class="n">value</span><span class="o">&gt;</span> 
<span class="o">|</span> <span class="o">--</span><span class="n">output</span> 
<span class="o">&lt;</span><span class="n">value</span><span 
class="o">&gt;</span>
+        <span class="n">Path</span> <span class="k">for</span> <span 
class="n">output</span><span class="p">,</span> <span class="n">any</span> 
<span class="n">local</span> <span class="n">or</span> <span 
class="n">HDFS</span> <span class="n">supported</span> <span 
class="n">URI</span> <span class="p">(</span><span 
class="n">required</span><span class="p">)</span>
+
+<span class="n">Algorithm</span> <span class="n">control</span> <span 
class="n">options</span><span class="p">:</span>
+  <span class="o">-</span><span class="n">mo</span> <span 
class="o">&lt;</span><span class="n">value</span><span class="o">&gt;</span> 
<span class="o">|</span> <span class="o">--</span><span 
class="n">maxObservations</span> <span class="o">&lt;</span><span 
class="n">value</span><span class="o">&gt;</span>
+        <span class="n">Max</span> <span class="n">number</span> <span 
class="n">of</span> <span class="n">observations</span> <span 
class="n">to</span> <span class="n">consider</span> <span class="n">per</span> 
<span class="n">row</span> <span class="p">(</span><span 
class="n">optional</span><span class="p">).</span> <span 
class="n">Default</span><span class="p">:</span> 500
+  <span class="o">-</span><span class="n">m</span> <span 
class="o">&lt;</span><span class="n">value</span><span class="o">&gt;</span> 
<span class="o">|</span> <span class="o">--</span><span 
class="n">maxSimilaritiesPerRow</span> <span class="o">&lt;</span><span 
class="n">value</span><span class="o">&gt;</span>
+        <span class="n">Limit</span> <span class="n">the</span> <span 
class="n">number</span> <span class="n">of</span> <span 
class="n">similarities</span> <span class="n">per</span> <span 
class="n">item</span> <span class="n">to</span> <span class="n">this</span> 
<span class="n">number</span> <span class="p">(</span><span 
class="n">optional</span><span class="p">).</span> 
+        <span class="n">Default</span><span class="p">:</span> 100
+
+<span class="n">Note</span><span class="p">:</span> <span 
class="n">Only</span> <span class="n">the</span> <span class="n">Log</span> 
<span class="n">Likelihood</span> <span class="n">Ratio</span> <span 
class="p">(</span><span class="n">LLR</span><span class="p">)</span> <span 
class="n">is</span> <span class="n">supported</span> <span class="n">as</span> 
<span class="n">a</span> <span class="n">similarity</span> <span 
class="n">measure</span><span class="p">.</span>
+
+<span class="n">Output</span> <span class="n">text</span> <span 
class="n">file</span> <span class="n">schema</span> <span 
class="n">options</span><span class="p">:</span>
+  <span class="o">-</span><span class="n">rd</span> <span 
class="o">&lt;</span><span class="n">value</span><span class="o">&gt;</span> 
<span class="o">|</span> <span class="o">--</span><span 
class="n">rowKeyDelim</span> <span class="o">&lt;</span><span 
class="n">value</span><span class="o">&gt;</span>
+        <span class="n">Separates</span> <span class="n">the</span> <span 
class="n">rowID</span> <span class="n">key</span> <span class="n">from</span> 
<span class="n">the</span> <span class="n">vector</span> <span 
class="n">values</span> <span class="n">list</span> <span 
class="p">(</span><span class="n">optional</span><span class="p">).</span> 
+        <span class="n">Default</span><span class="p">:</span> &quot;<span 
class="o">\</span><span class="n">t</span>&quot;
+  <span class="o">-</span><span class="n">cd</span> <span 
class="o">&lt;</span><span class="n">value</span><span class="o">&gt;</span> 
<span class="o">|</span> <span class="o">--</span><span 
class="n">columnIdStrengthDelim</span> <span class="o">&lt;</span><span 
class="n">value</span><span class="o">&gt;</span>
+        <span class="n">Separates</span> <span class="n">column</span> <span 
class="n">IDs</span> <span class="n">from</span> <span class="n">their</span> 
<span class="n">values</span> <span class="n">in</span> <span 
class="n">the</span> <span class="n">vector</span> <span 
class="n">values</span> <span class="n">list</span> 
+        <span class="p">(</span><span class="n">optional</span><span 
class="p">).</span> <span class="n">Default</span><span class="p">:</span> 
&quot;<span class="p">:</span>&quot;
+  <span class="o">-</span><span class="n">td</span> <span 
class="o">&lt;</span><span class="n">value</span><span class="o">&gt;</span> 
<span class="o">|</span> <span class="o">--</span><span 
class="n">elementDelim</span> <span class="o">&lt;</span><span 
class="n">value</span><span class="o">&gt;</span>
+        <span class="n">Separates</span> <span class="n">vector</span> <span 
class="n">element</span> <span class="n">values</span> <span 
class="n">in</span> <span class="n">the</span> <span class="n">values</span> 
<span class="n">list</span> <span class="p">(</span><span 
class="n">optional</span><span class="p">).</span> 
+        <span class="n">Default</span><span class="p">:</span> &quot; &quot;
+  <span class="o">-</span><span class="n">os</span> <span class="o">|</span> 
<span class="o">--</span><span class="n">omitStrength</span>
+        <span class="n">Do</span> <span class="n">not</span> <span 
class="n">write</span> <span class="n">the</span> <span 
class="n">strength</span> <span class="n">to</span> <span class="n">the</span> 
<span class="n">output</span> <span class="n">files</span> <span 
class="p">(</span><span class="n">optional</span><span class="p">),</span> 
<span class="n">Default</span><span class="p">:</span> 
+        <span class="n">false</span><span class="p">.</span>
+<span class="n">This</span> <span class="n">option</span> <span 
class="n">is</span> <span class="n">used</span> <span class="n">to</span> <span 
class="n">output</span> <span class="n">indexable</span> <span 
class="n">data</span> <span class="k">for</span> <span 
class="n">creating</span> <span class="n">a</span> <span 
class="n">search</span> <span class="n">engine</span> 
+<span class="n">recommender</span><span class="p">.</span>
+
+<span class="n">Default</span> <span class="n">delimiters</span> <span 
class="n">will</span> <span class="n">produce</span> <span 
class="n">output</span> <span class="n">of</span> <span class="n">the</span> 
<span class="n">form</span><span class="p">:</span> &quot;<span 
class="n">itemID1</span><span class="o">&lt;</span><span 
class="n">tab</span><span class="o">&gt;</span><span 
class="n">itemID2</span><span class="p">:</span><span 
class="n">value2</span><span class="o">&lt;</span><span 
class="n">space</span><span class="o">&gt;</span><span 
class="n">itemID10</span><span class="p">:</span><span 
class="n">value10</span><span class="p">...</span>&quot;
+
+<span class="n">File</span> <span class="n">discovery</span> <span 
class="n">options</span><span class="p">:</span>
+  <span class="o">-</span><span class="n">r</span> <span class="o">|</span> 
<span class="o">--</span><span class="n">recursive</span>
+        <span class="n">Searched</span> <span class="n">the</span> <span 
class="o">-</span><span class="nb">i</span> <span class="n">path</span> <span 
class="n">recursively</span> <span class="k">for</span> <span 
class="n">files</span> <span class="n">that</span> <span class="n">match</span> 
+        <span class="o">--</span><span class="n">filenamePattern</span> <span 
class="p">(</span><span class="n">optional</span><span class="p">),</span> 
<span class="n">Default</span><span class="p">:</span> <span 
class="n">false</span>
+  <span class="o">-</span><span class="n">fp</span> <span 
class="o">&lt;</span><span class="n">value</span><span class="o">&gt;</span> 
<span class="o">|</span> <span class="o">--</span><span 
class="n">filenamePattern</span> <span class="o">&lt;</span><span 
class="n">value</span><span class="o">&gt;</span>
+        <span class="n">Regex</span> <span class="n">to</span> <span 
class="n">match</span> <span class="n">in</span> <span 
class="n">determining</span> <span class="n">input</span> <span 
class="n">files</span> <span class="p">(</span><span 
class="n">optional</span><span class="p">).</span> <span 
class="n">Default</span><span class="p">:</span> 
+        <span class="n">filename</span> <span class="n">in</span> <span 
class="n">the</span> <span class="o">--</span><span class="n">input</span> 
<span class="n">option</span> <span class="n">or</span> &quot;^<span 
class="n">part</span><span class="o">-.*</span>&quot; <span class="k">if</span> 
<span class="o">--</span><span class="n">input</span> <span class="n">is</span> 
<span class="n">a</span> <span class="n">directory</span>
+
+<span class="n">Spark</span> <span class="n">config</span> <span 
class="n">options</span><span class="p">:</span>
+  <span class="o">-</span><span class="n">ma</span> <span 
class="o">&lt;</span><span class="n">value</span><span class="o">&gt;</span> 
<span class="o">|</span> <span class="o">--</span><span class="n">master</span> 
<span class="o">&lt;</span><span class="n">value</span><span 
class="o">&gt;</span>
+        <span class="n">Spark</span> <span class="n">Master</span> <span 
class="n">URL</span> <span class="p">(</span><span 
class="n">optional</span><span class="p">).</span> <span 
class="n">Default</span><span class="p">:</span> &quot;<span 
class="n">local</span>&quot;<span class="p">.</span> <span 
class="n">Note</span> <span class="n">that</span> <span class="n">you</span> 
<span class="n">can</span> 
+        <span class="n">specify</span> <span class="n">the</span> <span 
class="n">number</span> <span class="n">of</span> <span class="n">cores</span> 
<span class="n">to</span> <span class="n">get</span> <span class="n">a</span> 
<span class="n">performance</span> <span class="n">improvement</span><span 
class="p">,</span> <span class="k">for</span> 
+        <span class="n">example</span> &quot;<span class="n">local</span><span 
class="p">[</span>4<span class="p">]</span>&quot;
+  <span class="o">-</span><span class="n">sem</span> <span 
class="o">&lt;</span><span class="n">value</span><span class="o">&gt;</span> 
<span class="o">|</span> <span class="o">--</span><span 
class="n">sparkExecutorMem</span> <span class="o">&lt;</span><span 
class="n">value</span><span class="o">&gt;</span>
+        <span class="n">Max</span> <span class="n">Java</span> <span 
class="n">heap</span> <span class="n">available</span> <span 
class="n">as</span> &quot;<span class="n">executor</span> <span 
class="n">memory</span>&quot; <span class="n">on</span> <span 
class="n">each</span> <span class="n">node</span> <span class="p">(</span><span 
class="n">optional</span><span class="p">).</span> 
+        <span class="n">Default</span><span class="p">:</span> 4<span 
class="n">g</span>
+
+<span class="n">General</span> <span class="n">config</span> <span 
class="n">options</span><span class="p">:</span>
+  <span class="o">-</span><span class="n">rs</span> <span 
class="o">&lt;</span><span class="n">value</span><span class="o">&gt;</span> 
<span class="o">|</span> <span class="o">--</span><span 
class="n">randomSeed</span> <span class="o">&lt;</span><span 
class="n">value</span><span class="o">&gt;</span>
+
+  <span class="o">-</span><span class="n">h</span> <span class="o">|</span> 
<span class="o">--</span><span class="n">help</span>
+        <span class="n">prints</span> <span class="n">this</span> <span 
class="n">usage</span> <span class="n">text</span>
+</pre></div>
+
+
+<p>See RowSimilarityDriver.scala in Mahout's spark module if you want to 
customize the code. </p>
 <h1 id="3-creating-a-recommender">3. Creating a Recommender</h1>
 <p>One significant output option for the spark-itemsimilarity job is 
--omitStrength. This is a tab-delimited file containing a itemID token followed 
by a space delimited string of tokens of the form:</p>
-<p><code>itemID&lt;tab&gt;itemsIDs-from-the-indicator-matrix</code></p>
+<div class="codehilite"><pre><span class="n">itemID</span><span 
class="o">&lt;</span><span class="n">tab</span><span class="o">&gt;</span><span 
class="n">itemsIDs</span><span class="o">-</span><span 
class="n">from</span><span class="o">-</span><span class="n">the</span><span 
class="o">-</span><span class="n">indicator</span><span class="o">-</span><span 
class="n">matrix</span>
+</pre></div>
+
+
 <p>To create a cooccurrence type collaborative filtering recommender using a 
search engine simply index this output created with --omitStrength. Then at 
runtime query the indexed data with the current user's history of the primary 
action on the index field that contains the primary indicator tokens. The 
result will be an ordered list of itemIDs as recommendations.</p>
 <p>It is possible to include the indicator strengths by attaching them to the 
tokens before indexing but that is engine specific and beyond this description. 
Using without weights generally provides good results since the indicators have 
been downsampled by strength so the indicator matrix has some degree of quality 
guarantee. </p>
 <h2 id="multi-action-recommendations">Multi-action Recommendations</h2>
 <p>Optionally the query can contain the user's history of a secondary action 
(input with --input2) against the cross-indicator tokens as a second field.</p>
 <p>In this case the indicator-matrix and the cross-indicator-matrix should be 
combined and indexed as two fields. The data will be of the form:</p>
-<p><code>itemID, itemIDs-from-indicator-matrix, 
itemIDs-from-cross-indicator-matrix</code></p>
+<div class="codehilite"><pre><span class="n">itemID</span><span 
class="p">,</span> <span class="n">itemIDs</span><span class="o">-</span><span 
class="n">from</span><span class="o">-</span><span 
class="n">indicator</span><span class="o">-</span><span 
class="n">matrix</span><span class="p">,</span> <span 
class="n">itemIDs</span><span class="o">-</span><span 
class="n">from</span><span class="o">-</span><span class="nb">cross</span><span 
class="o">-</span><span class="n">indicator</span><span class="o">-</span><span 
class="n">matrix</span>
+</pre></div>
+
+
 <p>Now the query will have one string of the user's primary action history and 
a second of the user's secondary action history against two fields in the 
index.</p>
 <p>It is probably better to index the two (or more) fields as multi-valued 
fields (arrays) and query them as such but the above works in much the same way 
if the indexed tokens are space delimited as is the query string. </p>
 <p><strong>Note:</strong> Using the underlying code it is possible to use as 
many actions as you have data for to create a multi-action recommender that 
makes the most of available data. The CLI only supports two actions.</p>


Reply via email to