[2/2] mahout git commit: WEBSITE Added LastFM tutorial and screenshots

rawkintrevo Mon, 01 May 2017 12:31:29 -0700

WEBSITE Added LastFM tutorial and screenshots


Project: http://git-wip-us.apache.org/repos/asf/mahout/repo
Commit: http://git-wip-us.apache.org/repos/asf/mahout/commit/b582dc52
Tree: http://git-wip-us.apache.org/repos/asf/mahout/tree/b582dc52
Diff: http://git-wip-us.apache.org/repos/asf/mahout/diff/b582dc52

Branch: refs/heads/website
Commit: b582dc5291512ace348540491db8671d7d41c0a2
Parents: 747d94b
Author: rawkintrevo <[email protected]>
Authored: Mon May 1 14:30:46 2017 -0500
Committer: rawkintrevo <[email protected]>
Committed: Mon May 1 14:30:46 2017 -0500

----------------------------------------------------------------------
 website/_config.yml                             | 128 ------
 website/docs/README.md                          |  78 +---
 website/docs/_config.yml                        |   2 +-
 website/docs/_includes/tutorial_navbar.html     |  20 +-
 .../reccomenders/intro-cooccurrence-spark.md    | 446 -------------------
 website/docs/native-solvers/cuda.md             |   4 +-
 website/docs/native-solvers/viennacl-omp.md     |   2 +-
 website/docs/native-solvers/viennacl.md         |   2 +-
 website/docs/screenshots/landing.png            | Bin 0 -> 205187 bytes
 website/docs/screenshots/mr-algos.png           | Bin 0 -> 316933 bytes
 website/docs/screenshots/tutorials.png          | Bin 0 -> 350203 bytes
 .../docs/tutorials/cco-lastfm/cco-lastfm.scala  |  89 ++++
 website/docs/tutorials/cco-lastfm/index.md      | 149 +++++++
 .../docs/tutorials/eigenfaces/eigenfaces.png    | Bin 0 -> 355453 bytes
 website/docs/tutorials/eigenfaces/index.md      |  17 +-
 .../tutorials/intro-cooccurrence-spark/index.md | 446 +++++++++++++++++++
 .../playing-with-samsara-flink-batch.md         |   2 +-
 website/front/README.md                         |  79 +---
 website/front/_config.yml                       |   2 +-
 website/front/screenshots/landing.png           | Bin 0 -> 338899 bytes
 20 files changed, 732 insertions(+), 734 deletions(-)
----------------------------------------------------------------------


http://git-wip-us.apache.org/repos/asf/mahout/blob/b582dc52/website/_config.yml
----------------------------------------------------------------------
diff --git a/website/_config.yml b/website/_config.yml
deleted file mode 100644
index b30bef6..0000000
--- a/website/_config.yml
+++ /dev/null
@@ -1,128 +0,0 @@
-# This is the default format.
-# For more see: http://jekyllrb.com/docs/permalinks/
-permalink: /:categories/:year/:month/:day/:title
-
-exclude: [".rvmrc", ".rbenv-version", "README.md", "Rakefile", "changelog.md", 
"vendor", "node_modules", "scss"]
-#pygments: true
-highlighter: rouge
-markdown: kramdown
-redcarpet:
-  extensions: ["tables"]
-encoding: utf-8
-
-# Themes are encouraged to use these universal variables
-# so be sure to set them if your theme uses them.
-#
-title : Apache Mahout
-tagline: Distributed Linear Algebra
-author :
-  name : The Apache Software Foundation
-  email : [email protected]
-  github : apache
-  twitter : ASF
-  feedburner : feedname
-
-# Serving
-detach:  false
-port:    4000
-host:    127.0.0.1
-baseurl: "" # does not include hostname
-
-MAHOUT_VERSION : 0.13.1
-
-# The production_url is only used when full-domain names are needed
-# such as sitemap.txt
-# Most places will/should use BASE_PATH to make the urls
-#
-# If you have set a CNAME (pages.github.com) set your custom domain here.
-# Else if you are pushing to username.github.io, replace with your username.
-# Finally if you are pushing to a GitHub project page, include the project 
name at the end.
-#
-production_url : http://mahout.apache.org/
-# All Jekyll-Bootstrap specific configurations are namespaced into this hash
-#
-JB :
-  version : 0.3.0
-
-  # All links will be namespaced by BASE_PATH if defined.
-  # Links in your website should always be prefixed with {{BASE_PATH}}
-  # however this value will be dynamically changed depending on your 
deployment situation.
-  #
-  # CNAME (http://yourcustomdomain.com)
-  #   DO NOT SET BASE_PATH
-  #   (urls will be prefixed with "/" and work relatively)
-  #
-  # GitHub Pages (http://username.github.io)
-  #   DO NOT SET BASE_PATH
-  #   (urls will be prefixed with "/" and work relatively)
-  #
-  # GitHub Project Pages (http://username.github.io/project-name)
-  #
-  #   A GitHub Project site exists in the `gh-pages` branch of one of your 
repositories.
-  #  REQUIRED! Set BASE_PATH to: http://username.github.io/project-name
-  #
-  # CAUTION:
-  #   - When in Localhost, your site will run from root "/" regardless of 
BASE_PATH
-  #   - Only the following values are falsy: ["", null, false]
-  #   - When setting BASE_PATH it must be a valid url.
-  #     This means always setting the protocol (http|https) or prefixing with 
"/"
-  BASE_PATH : "/"
-
-  # By default, the asset_path is automatically defined relative to BASE_PATH 
plus the enabled theme.
-  # ex: [BASE_PATH]/assets/themes/[THEME-NAME]
-  #
-  # Override this by defining an absolute path to assets here.
-  # ex:
-  #   http://s3.amazonaws.com/yoursite/themes/watermelon
-  #   /assets
-  #
-  ASSET_PATH : false
-
-  # These paths are to the main pages Jekyll-Bootstrap ships with.
-  # Some JB helpers refer to these paths; change them here if needed.
-  #
-  archive_path: /archive.html
-  categories_path : /categories.html
-  tags_path : /tags.html
-  atom_path : /atom.xml
-  rss_path : /rss.xml
-
-  # Settings for comments helper
-  # Set 'provider' to the comment provider you want to use.
-  # Set 'provider' to false to turn commenting off globally.
-  #
-  comments :
-    provider : disqus
-    disqus :
-      short_name : jekyllbootstrap
-    livefyre :
-      site_id : 123
-    intensedebate :
-      account : 123abc
-    facebook :
-      appid : 123
-      num_posts: 5
-      width: 580
-      colorscheme: light
-
-  # Settings for analytics helper
-  # Set 'provider' to the analytics provider you want to use.
-  # Set 'provider' to false to turn analytics off globally.
-
-  # Settings for sharing helper.
-  # Sharing is for things like tweet, plusone, like, reddit buttons etc.
-  # Set 'provider' to the sharing provider you want to use.
-  # Set 'provider' to false to turn sharing off globally.
-  #
-  sharing :
-    provider : false
-
-  # Settings for all other include helpers can be defined by creating
-  # a hash with key named for the given helper. ex:
-  #
-  #   pages_list :
-  #     provider : "custom"
-  #
-  # Setting any helper's provider to 'custom' will bypass the helper code
-  # and include your custom code. Your custom file must be defined at:
-  #   ./_includes/custom/[HELPER]
\ No newline at end of file

http://git-wip-us.apache.org/repos/asf/mahout/blob/b582dc52/website/docs/README.md
----------------------------------------------------------------------
diff --git a/website/docs/README.md b/website/docs/README.md
index 62fcfca..68646fd 100755
--- a/website/docs/README.md
+++ b/website/docs/README.md
@@ -1,78 +1,16 @@
-# Jekyll-Bootstrap
 
-The quickest way to start and publish your Jekyll powered blog. 100% 
compatible with GitHub pages
+### Landing Page
 
-## Usage
+![landing](screenshots/landing.png)
 
-For all usage and documentation please see: <http://jekyllbootstrap.com>
+### Tutorials
 
-## Version
+Tutorials, Algorithms, MR-Tutorials, and MR-Algorithsm each have an accordion 
side bar for navigation
+(this is the part we need help building)
+![tutorials](screenshots/tutorials.png)
 
-0.3.0 - stable and versioned using [semantic versioning](http://semver.org/).
+### Old MR Algos
 
-**NOTE:** 0.3.0 introduces a new theme which is not backwards compatible in 
the sense it won't _look_ like the old version.
-However, the actual API has not changed at all.
-You might want to run 0.3.0 in a branch to make sure you are ok with the theme 
design changes.
 
-## Milestones
+![landing](screenshots/mr-algos.png)
 
-[0.4.0](https://github.com/plusjade/jekyll-bootstrap/milestones/v%200.4.0) - 
next release [ETA 03/29/2015]
-
-### GOALS
-
-* No open PRs against master branch.
-* Squash some bugs.
-* Add some new features (low-hanging fruit).
-* Establish social media presence.
-
-
-### Bugs
-
-|Bug |Description
-|------|---------------
-|[#86](https://github.com/plusjade/jekyll-bootstrap/issues/86)  |&#x2611; 
Facebook Comments
-|[#113](https://github.com/plusjade/jekyll-bootstrap/issues/113)|&#x2611; 
ASSET_PATH w/ page & post
-|[#144](https://github.com/plusjade/jekyll-bootstrap/issues/144)|&#x2610; 
BASE_PATH w/ FQDN
-|[#227](https://github.com/plusjade/jekyll-bootstrap/issues/227)|&#x2611; 
Redundant JB/setup
-
-### Features
-
-|Bug |Description
-|------|---------------
-|[#98](https://github.com/plusjade/jekyll-bootstrap/issues/98)  |&#x2611; GIST 
Integration
-|[#244](https://github.com/plusjade/jekyll-bootstrap/issues/244)|&#x2611; 
JB/file_exists Helper
-|[#42](https://github.com/plusjade/jekyll-bootstrap/issues/42)  |&#x2611; Sort 
collections of Pages / Posts
-|[#84](https://github.com/plusjade/jekyll-bootstrap/issues/84)  |&#x2610; 
Detecting production mode
-
-### TODOS
-
-Review existing pull requests against plusjake/jekyll-bootstrap:master. Merge 
or close each.
-
-* Create twitter account. Add link / icon on jekyllbootstrap.com.
-* Create blog posts under plusjade/gh-pages, expose on jekyllbootstrap.com, 
feed to twitter account.
-* Announce state of project, announce roadmap(s), announce new versions as 
theyâre released.
-
-## Contributing
-
-
-To contribute to the framework please make sure to checkout your branch based 
on `jb-development`!!
-This is very important as it allows me to accept your pull request without 
having to publish a public version release.
-
-Small, atomic Features, bugs, etc.
-Use the `jb-development` branch but note it will likely change fast as pull 
requests are accepted.
-Please rebase as often as possible when working.
-Work on small, atomic features/bugs to avoid upstream commits 
affecting/breaking your development work.
-
-For Big Features or major API extensions/edits:
-This is the one case where I'll accept pull-requests based off the master 
branch.
-This allows you to work in isolation but it means I'll have to manually merge 
your work into the next public release.
-Translation : it might take a bit longer so please be patient! (but sincerely 
thank you).
-
-**Jekyll-Bootstrap Documentation Website.**
-
-The documentation website at <http://jekyllbootstrap.com> is maintained at 
https://github.com/plusjade/jekyllbootstrap.com
-
-
-## License
-
-[MIT](http://opensource.org/licenses/MIT)

http://git-wip-us.apache.org/repos/asf/mahout/blob/b582dc52/website/docs/_config.yml
----------------------------------------------------------------------
diff --git a/website/docs/_config.yml b/website/docs/_config.yml
index f9ce00d..c13024e 100755
--- a/website/docs/_config.yml
+++ b/website/docs/_config.yml
@@ -2,7 +2,7 @@
 # For more see: http://jekyllrb.com/docs/permalinks/
 permalink: /:categories/:year/:month/:day/:title 
 
-exclude: [".rvmrc", ".rbenv-version", "README.md", "Rakefile", "changelog.md"]
+exclude: [".rvmrc", ".rbenv-version", "README.md", "Rakefile", "changelog.md", 
"screenshots"]
 highlighter: pygments
 
 # Themes are encouraged to use these universal variables 

http://git-wip-us.apache.org/repos/asf/mahout/blob/b582dc52/website/docs/_includes/tutorial_navbar.html
----------------------------------------------------------------------
diff --git a/website/docs/_includes/tutorial_navbar.html 
b/website/docs/_includes/tutorial_navbar.html
index 6a208b6..9ce3cf8 100644
--- a/website/docs/_includes/tutorial_navbar.html
+++ b/website/docs/_includes/tutorial_navbar.html
@@ -2,9 +2,17 @@
     <span><b>Tutorials</b></span>
     <div class="list-group panel">
         <a href="#linalg" class="list-group-item list-group-item-success" 
data-toggle="collapse" data-parent="#MrTutorialMenu"><b>Linear Algebra</b><i 
class="fa fa-caret-down"></i></a>
-        <div class="collapse" id="linalg">
-<ul class="nav sidebar-nav">
-    <li> <a href="{{ BASE_PATH }}/tutorials/eigenfaces">Eigenfaces Demo (Shell 
or Zeppelin)</a></li>
-
-</ul>
-            </div></div></div>
\ No newline at end of file
+            <div class="collapse" id="linalg">
+                <ul class="nav sidebar-nav">
+                    <li><a href="{{ BASE_PATH 
}}/tutorials/eigenfaces">Eigenfaces Demo (Shell or Zeppelin)</a></li>
+                </ul>
+            </div>
+        <a href="#reccomenders" class="list-group-item 
list-group-item-success" data-toggle="collapse" 
data-parent="#MrTutorialMenu"><b>Reccomenders</b><i class="fa 
fa-caret-down"></i></a>
+            <div class="collapse" id="reccomenders">
+                <ul class="nav sidebar-nav">
+                    <li><a href="{{ BASE_PATH }}/tutorials/cco-lastfm">CCO 
Example with Last.FM Data</a></li>
+                    <li><a href="{{ BASE_PATH 
}}/tutorials/intro-cooccurrence-spark">Introduction to Cooccurrence in 
Spark</a></li>
+                </ul>
+            </div>
+    </div>
+</div>

http://git-wip-us.apache.org/repos/asf/mahout/blob/b582dc52/website/docs/algorithms/reccomenders/intro-cooccurrence-spark.md
----------------------------------------------------------------------
diff --git a/website/docs/algorithms/reccomenders/intro-cooccurrence-spark.md 
b/website/docs/algorithms/reccomenders/intro-cooccurrence-spark.md
deleted file mode 100644
index d7d0185..0000000
--- a/website/docs/algorithms/reccomenders/intro-cooccurrence-spark.md
+++ /dev/null
@@ -1,446 +0,0 @@
----
-layout: algorithm
-title: Intro to Cooccurrence Recommenders with Spark
-theme:
-    name: retro-mahout
----
-
-# Intro to Cooccurrence Recommenders with Spark
-
-Mahout provides several important building blocks for creating recommendations 
using Spark. *spark-itemsimilarity* can 
-be used to create "other people also liked these things" type recommendations 
and paired with a search engine can 
-personalize recommendations for individual users. *spark-rowsimilarity* can 
provide non-personalized content based 
-recommendations and when paired with a search engine can be used to 
personalize content based recommendations.
-
-![image](http://s6.postimg.org/r0m8bpjw1/recommender_architecture.png)
-
-This is a simplified Lambda architecture with Mahout's *spark-itemsimilarity* 
playing the batch model building role and a search engine playing the realtime 
serving role.
-
-You will create two collections, one for user history and one for item 
"indicators". Indicators are user interactions that lead to the wished for 
interaction. So for example if you wish a user to purchase something and you 
collect all users purchase interactions *spark-itemsimilarity* will create a 
purchase indicator from them. But you can also use other user interactions in a 
cross-cooccurrence calculation, to create purchase indicators. 
-
-User history is used as a query on the item collection with its cooccurrence 
and cross-cooccurrence indicators (there may be several indicators). The 
primary interaction or action is picked to be the thing you want to recommend, 
other actions are believed to be corelated but may not indicate exactly the 
same user intent. For instance in an ecom recommender a purchase is a very good 
primary action, but you may also know product detail-views, or 
additions-to-wishlists. These can be considered secondary actions which may all 
be used to calculate cross-cooccurrence indicators. The user history that forms 
the recommendations query will contain recorded primary and secondary actions 
all targetted towards the correct indicator fields.
-
-## References
-
-1. A free ebook, which talks about the general idea: [Practical Machine 
Learning](https://www.mapr.com/practical-machine-learning)
-2. A slide deck, which talks about mixing actions or other indicators: 
[Creating a Unified 
Recommender](http://occamsmachete.com/ml/2014/10/07/creating-a-unified-recommender-with-mahout-and-a-search-engine/)
-3. Two blog posts: [What's New in Recommenders: part 
#1](http://occamsmachete.com/ml/2014/08/11/mahout-on-spark-whats-new-in-recommenders/)
-and  [What's New in Recommenders: part 
#2](http://occamsmachete.com/ml/2014/09/09/mahout-on-spark-whats-new-in-recommenders-part-2/)
-3. A post describing the loglikelihood ratio:  [Surprise and 
Coinsidense](http://tdunning.blogspot.com/2008/03/surprise-and-coincidence.html)
  LLR is used to reduce noise in the data while keeping the calculations O(n) 
complexity.
-
-Below are the command line jobs but the drivers and associated code can also 
be customized and accessed from the Scala APIs.
-
-## 1. spark-itemsimilarity
-*spark-itemsimilarity* is the Spark counterpart of the of the Mahout mapreduce 
job called *itemsimilarity*. It takes in elements of interactions, which have 
userID, itemID, and optionally a value. It will produce one of more indicator 
matrices created by comparing every user's interactions with every other user. 
The indicator matrix is an item x item matrix where the values are 
log-likelihood ratio strengths. For the legacy mapreduce version, there were 
several possible similarity measures but these are being deprecated in favor of 
LLR because in practice it performs the best.
-
-Mahout's mapreduce version of itemsimilarity takes a text file that is 
expected to have user and item IDs that conform to 
-Mahout's ID requirements--they are non-negative integers that can be viewed as 
row and column numbers in a matrix.
-
-*spark-itemsimilarity* also extends the notion of cooccurrence to 
cross-cooccurrence, in other words the Spark version will 
-account for multi-modal interactions and create cross-cooccurrence indicator 
matrices allowing the use of much more data in 
-creating recommendations or similar item lists. People try to do this by 
mixing different actions and giving them weights. 
-For instance they might say an item-view is 0.2 of an item purchase. In 
practice this is often not helpful. Spark-itemsimilarity's
-cross-cooccurrence is a more principled way to handle this case. In effect it 
scrubs secondary actions with the action you want
-to recommend.   
-
-
-    spark-itemsimilarity Mahout 1.0
-    Usage: spark-itemsimilarity [options]
-    
-    Disconnected from the target VM, address: '127.0.0.1:64676', transport: 
'socket'
-    Input, output options
-      -i <value> | --input <value>
-            Input path, may be a filename, directory name, or comma delimited 
list of HDFS supported URIs (required)
-      -i2 <value> | --input2 <value>
-            Secondary input path for cross-similarity calculation, same 
restrictions as "--input" (optional). Default: empty.
-      -o <value> | --output <value>
-            Path for output, any local or HDFS supported URI (required)
-    
-    Algorithm control options:
-      -mppu <value> | --maxPrefs <value>
-            Max number of preferences to consider per user (optional). 
Default: 500
-      -m <value> | --maxSimilaritiesPerItem <value>
-            Limit the number of similarities per item to this number 
(optional). Default: 100
-    
-    Note: Only the Log Likelihood Ratio (LLR) is supported as a similarity 
measure.
-    
-    Input text file schema options:
-      -id <value> | --inDelim <value>
-            Input delimiter character (optional). Default: "[,\t]"
-      -f1 <value> | --filter1 <value>
-            String (or regex) whose presence indicates a datum for the primary 
item set (optional). Default: no filter, all data is used
-      -f2 <value> | --filter2 <value>
-            String (or regex) whose presence indicates a datum for the 
secondary item set (optional). If not present no secondary dataset is collected
-      -rc <value> | --rowIDColumn <value>
-            Column number (0 based Int) containing the row ID string 
(optional). Default: 0
-      -ic <value> | --itemIDColumn <value>
-            Column number (0 based Int) containing the item ID string 
(optional). Default: 1
-      -fc <value> | --filterColumn <value>
-            Column number (0 based Int) containing the filter string 
(optional). Default: -1 for no filter
-    
-    Using all defaults the input is expected of the form: "userID<tab>itemId" 
or "userID<tab>itemID<tab>any-text..." and all rows will be used
-    
-    File discovery options:
-      -r | --recursive
-            Searched the -i path recursively for files that match 
--filenamePattern (optional), Default: false
-      -fp <value> | --filenamePattern <value>
-            Regex to match in determining input files (optional). Default: 
filename in the --input option or "^part-.*" if --input is a directory
-    
-    Output text file schema options:
-      -rd <value> | --rowKeyDelim <value>
-            Separates the rowID key from the vector values list (optional). 
Default: "\t"
-      -cd <value> | --columnIdStrengthDelim <value>
-            Separates column IDs from their values in the vector values list 
(optional). Default: ":"
-      -td <value> | --elementDelim <value>
-            Separates vector element values in the values list (optional). 
Default: " "
-      -os | --omitStrength
-            Do not write the strength to the output files (optional), Default: 
false.
-    This option is used to output indexable data for creating a search engine 
recommender.
-    
-    Default delimiters will produce output of the form: 
"itemID1<tab>itemID2:value2<space>itemID10:value10..."
-    
-    Spark config options:
-      -ma <value> | --master <value>
-            Spark Master URL (optional). Default: "local". Note that you can 
specify the number of cores to get a performance improvement, for example 
"local[4]"
-      -sem <value> | --sparkExecutorMem <value>
-            Max Java heap available as "executor memory" on each node 
(optional). Default: 4g
-      -rs <value> | --randomSeed <value>
-            
-      -h | --help
-            prints this usage text
-
-This looks daunting but defaults to simple fairly sane values to take exactly 
the same input as legacy code and is pretty flexible. It allows the user to 
point to a single text file, a directory full of files, or a tree of 
directories to be traversed recursively. The files included can be specified 
with either a regex-style pattern or filename. The schema for the file is 
defined by column numbers, which map to the important bits of data including 
IDs and values. The files can even contain filters, which allow unneeded rows 
to be discarded or used for cross-cooccurrence calculations.
-
-See ItemSimilarityDriver.scala in Mahout's spark module if you want to 
customize the code. 
-
-### Defaults in the _**spark-itemsimilarity**_ CLI
-
-If all defaults are used the input can be as simple as:
-
-    userID1,itemID1
-    userID2,itemID2
-    ...
-
-With the command line:
-
-
-    bash$ mahout spark-itemsimilarity --input in-file --output out-dir
-
-
-This will use the "local" Spark context and will output the standard text 
version of a DRM
-
-    itemID1<tab>itemID2:value2<space>itemID10:value10...
-
-### <a name="multiple-actions">How To Use Multiple User Actions</a>
-
-Often we record various actions the user takes for later analytics. These can 
now be used to make recommendations. 
-The idea of a recommender is to recommend the action you want the user to 
make. For an ecom app this might be 
-a purchase action. It is usually not a good idea to just treat other actions 
the same as the action you want to recommend. 
-For instance a view of an item does not indicate the same intent as a purchase 
and if you just mixed the two together you 
-might even make worse recommendations. It is tempting though since there are 
so many more views than purchases. With *spark-itemsimilarity*
-we can now use both actions. Mahout will use cross-action cooccurrence 
analysis to limit the views to ones that do predict purchases.
-We do this by treating the primary action (purchase) as data for the indicator 
matrix and use the secondary action (view) 
-to calculate the cross-cooccurrence indicator matrix.  
-
-*spark-itemsimilarity* can read separate actions from separate files or from a 
mixed action log by filtering certain lines. For a mixed 
-action log of the form:
-
-    u1,purchase,iphone
-    u1,purchase,ipad
-    u2,purchase,nexus
-    u2,purchase,galaxy
-    u3,purchase,surface
-    u4,purchase,iphone
-    u4,purchase,galaxy
-    u1,view,iphone
-    u1,view,ipad
-    u1,view,nexus
-    u1,view,galaxy
-    u2,view,iphone
-    u2,view,ipad
-    u2,view,nexus
-    u2,view,galaxy
-    u3,view,surface
-    u3,view,nexus
-    u4,view,iphone
-    u4,view,ipad
-    u4,view,galaxy
-
-###Command Line
-
-
-Use the following options:
-
-    bash$ mahout spark-itemsimilarity \
-       --input in-file \     # where to look for data
-        --output out-path \   # root dir for output
-        --master masterUrl \  # URL of the Spark master server
-        --filter1 purchase \  # word that flags input for the primary action
-        --filter2 view \      # word that flags input for the secondary action
-        --itemIDPosition 2 \  # column that has the item ID
-        --rowIDPosition 0 \   # column that has the user ID
-        --filterPosition 1    # column that has the filter word
-
-
-
-### Output
-
-The output of the job will be the standard text version of two Mahout DRMs. 
This is a case where we are calculating 
-cross-cooccurrence so a primary indicator matrix and cross-cooccurrence 
indicator matrix will be created
-
-    out-path
-      |-- similarity-matrix - TDF part files
-      \-- cross-similarity-matrix - TDF part-files
-
-The similarity-matrix will contain the lines:
-
-    galaxy\tnexus:1.7260924347106847
-    ipad\tiphone:1.7260924347106847
-    nexus\tgalaxy:1.7260924347106847
-    iphone\tipad:1.7260924347106847
-    surface
-
-The cross-similarity-matrix will contain:
-
-    iphone\tnexus:1.7260924347106847 iphone:1.7260924347106847 
ipad:1.7260924347106847 galaxy:1.7260924347106847
-    ipad\tnexus:0.6795961471815897 iphone:0.6795961471815897 
ipad:0.6795961471815897 galaxy:0.6795961471815897
-    nexus\tnexus:0.6795961471815897 iphone:0.6795961471815897 
ipad:0.6795961471815897 galaxy:0.6795961471815897
-    galaxy\tnexus:1.7260924347106847 iphone:1.7260924347106847 
ipad:1.7260924347106847 galaxy:1.7260924347106847
-    surface\tsurface:4.498681156950466 nexus:0.6795961471815897
-
-**Note:** You can run this multiple times to use more than two actions or you 
can use the underlying 
-SimilarityAnalysis.cooccurrence API, which will more efficiently calculate any 
number of cross-cooccurrence indicators.
-
-### Log File Input
- 
-A common method of storing data is in log files. If they are written using 
some delimiter they can be consumed directly by spark-itemsimilarity. For 
instance input of the form:
-
-    2014-06-23 14:46:53.115\tu1\tpurchase\trandom text\tiphone
-    2014-06-23 14:46:53.115\tu1\tpurchase\trandom text\tipad
-    2014-06-23 14:46:53.115\tu2\tpurchase\trandom text\tnexus
-    2014-06-23 14:46:53.115\tu2\tpurchase\trandom text\tgalaxy
-    2014-06-23 14:46:53.115\tu3\tpurchase\trandom text\tsurface
-    2014-06-23 14:46:53.115\tu4\tpurchase\trandom text\tiphone
-    2014-06-23 14:46:53.115\tu4\tpurchase\trandom text\tgalaxy
-    2014-06-23 14:46:53.115\tu1\tview\trandom text\tiphone
-    2014-06-23 14:46:53.115\tu1\tview\trandom text\tipad
-    2014-06-23 14:46:53.115\tu1\tview\trandom text\tnexus
-    2014-06-23 14:46:53.115\tu1\tview\trandom text\tgalaxy
-    2014-06-23 14:46:53.115\tu2\tview\trandom text\tiphone
-    2014-06-23 14:46:53.115\tu2\tview\trandom text\tipad
-    2014-06-23 14:46:53.115\tu2\tview\trandom text\tnexus
-    2014-06-23 14:46:53.115\tu2\tview\trandom text\tgalaxy
-    2014-06-23 14:46:53.115\tu3\tview\trandom text\tsurface
-    2014-06-23 14:46:53.115\tu3\tview\trandom text\tnexus
-    2014-06-23 14:46:53.115\tu4\tview\trandom text\tiphone
-    2014-06-23 14:46:53.115\tu4\tview\trandom text\tipad
-    2014-06-23 14:46:53.115\tu4\tview\trandom text\tgalaxy    
-
-Can be parsed with the following CLI and run on the cluster producing the same 
output as the above example.
-
-    bash$ mahout spark-itemsimilarity \
-        --input in-file \
-        --output out-path \
-        --master spark://sparkmaster:4044 \
-        --filter1 purchase \
-        --filter2 view \
-        --inDelim "\t" \
-        --itemIDPosition 4 \
-        --rowIDPosition 1 \
-        --filterPosition 2
-
-## 2. spark-rowsimilarity
-
-*spark-rowsimilarity* is the companion to *spark-itemsimilarity* the primary 
difference is that it takes a text file version of 
-a matrix of sparse vectors with optional application specific IDs and it finds 
similar rows rather than items (columns). Its use is
-not limited to collaborative filtering. The input is in text-delimited form 
where there are three delimiters used. By 
-default it reads 
(rowID&lt;tab>columnID1:strength1&lt;space>columnID2:strength2...) Since this 
job only supports LLR similarity,
- which does not use the input strengths, they may be omitted in the input. It 
writes 
-(rowID&lt;tab>rowID1:strength1&lt;space>rowID2:strength2...) 
-The output is sorted by strength descending. The output can be interpreted as 
a row ID from the primary input followed 
-by a list of the most similar rows.
-
-The command line interface is:
-
-    spark-rowsimilarity Mahout 1.0
-    Usage: spark-rowsimilarity [options]
-    
-    Input, output options
-      -i <value> | --input <value>
-            Input path, may be a filename, directory name, or comma delimited 
list of HDFS supported URIs (required)
-      -o <value> | --output <value>
-            Path for output, any local or HDFS supported URI (required)
-    
-    Algorithm control options:
-      -mo <value> | --maxObservations <value>
-            Max number of observations to consider per row (optional). 
Default: 500
-      -m <value> | --maxSimilaritiesPerRow <value>
-            Limit the number of similarities per item to this number 
(optional). Default: 100
-    
-    Note: Only the Log Likelihood Ratio (LLR) is supported as a similarity 
measure.
-    Disconnected from the target VM, address: '127.0.0.1:49162', transport: 
'socket'
-    
-    Output text file schema options:
-      -rd <value> | --rowKeyDelim <value>
-            Separates the rowID key from the vector values list (optional). 
Default: "\t"
-      -cd <value> | --columnIdStrengthDelim <value>
-            Separates column IDs from their values in the vector values list 
(optional). Default: ":"
-      -td <value> | --elementDelim <value>
-            Separates vector element values in the values list (optional). 
Default: " "
-      -os | --omitStrength
-            Do not write the strength to the output files (optional), Default: 
false.
-    This option is used to output indexable data for creating a search engine 
recommender.
-    
-    Default delimiters will produce output of the form: 
"itemID1<tab>itemID2:value2<space>itemID10:value10..."
-    
-    File discovery options:
-      -r | --recursive
-            Searched the -i path recursively for files that match 
--filenamePattern (optional), Default: false
-      -fp <value> | --filenamePattern <value>
-            Regex to match in determining input files (optional). Default: 
filename in the --input option or "^part-.*" if --input is a directory
-    
-    Spark config options:
-      -ma <value> | --master <value>
-            Spark Master URL (optional). Default: "local". Note that you can 
specify the number of cores to get a performance improvement, for example 
"local[4]"
-      -sem <value> | --sparkExecutorMem <value>
-            Max Java heap available as "executor memory" on each node 
(optional). Default: 4g
-      -rs <value> | --randomSeed <value>
-            
-      -h | --help
-            prints this usage text
-
-See RowSimilarityDriver.scala in Mahout's spark module if you want to 
customize the code. 
-
-# 3. Using *spark-rowsimilarity* with Text Data
-
-Another use case for *spark-rowsimilarity* is in finding similar textual 
content. For instance given the tags associated with 
-a blog post,
- which other posts have similar tags. In this case the columns are tags and 
the rows are posts. Since LLR is 
-the only similarity method supported this is not the optimal way to determine 
general "bag-of-words" document similarity. 
-LLR is used more as a quality filter than as a similarity measure. However 
*spark-rowsimilarity* will produce 
-lists of similar docs for every doc if input is docs with lists of terms. The 
Apache [Lucene](http://lucene.apache.org) project provides several methods of 
[analyzing and 
tokenizing](http://lucene.apache.org/core/4_9_0/core/org/apache/lucene/analysis/package-summary.html#package_description)
 documents.
-
-# <a name="unified-recommender">4. Creating a Multimodal Recommender</a>
-
-Using the output of *spark-itemsimilarity* and *spark-rowsimilarity* you can 
build a miltimodal cooccurrence and content based
- recommender that can be used in both or either mode depending on indicators 
available and the history available at 
-runtime for a user. Some slide describing this method can be found 
[here](http://occamsmachete.com/ml/2014/10/07/creating-a-unified-recommender-with-mahout-and-a-search-engine/)
-
-## Requirements
-
-1. Mahout SNAPSHOT-1.0 or later
-2. Hadoop
-3. Spark, the correct version for your version of Mahout and Hadoop
-4. A search engine like Solr or Elasticsearch
-
-## Indicators
-
-Indicators come in 3 types
-
-1. **Cooccurrence**: calculated with *spark-itemsimilarity* from user actions
-2. **Content**: calculated from item metadata or content using 
*spark-rowsimilarity*
-3. **Intrinsic**: assigned to items as metadata. Can be anything that 
describes the item.
-
-The query for recommendations will be a mix of values meant to match one of 
your indicators. The query can be constructed 
-from user history and values derived from context (category being viewed for 
instance) or special precalculated data 
-(popularity rank for instance). This blending of indicators allows for 
creating many flavors or recommendations to fit 
-a very wide variety of circumstances.
-
-With the right mix of indicators developers can construct a single query that 
works for completely new items and new users 
-while working well for items with lots of interactions and users with many 
recorded actions. In other words by adding in content and intrinsic 
-indicators developers can create a solution for the "cold-start" problem that 
gracefully improves with more user history
-and as items have more interactions. It is also possible to create a 
completely content-based recommender that personalizes 
-recommendations.
-
-## Example with 3 Indicators
-
-You will need to decide how you store user action data so they can be 
processed by the item and row similarity jobs and 
-this is most easily done by using text files as described above. The data that 
is processed by these jobs is considered the 
-training data. You will need some amount of user history in your recs query. 
It is typical to use the most recent user history 
-but need not be exactly what is in the training set, which may include a 
greater volume of historical data. Keeping the user 
-history for query purposes could be done with a database by storing it in a 
users table. In the example above the two 
-collaborative filtering actions are "purchase" and "view", but let's also add 
tags (taken from catalog categories or other 
-descriptive metadata). 
-
-We will need to create 1 cooccurrence indicator from the primary action 
(purchase) 1 cross-action cooccurrence indicator 
-from the secondary action (view) 
-and 1 content indicator (tags). We'll have to run *spark-itemsimilarity* once 
and *spark-rowsimilarity* once.
-
-We have described how to create the collaborative filtering indicators for 
purchase and view (the [How to use Multiple User 
-Actions](#multiple-actions) section) but tags will be a slightly different 
process. We want to use the fact that 
-certain items have tags similar to the ones associated with a user's 
purchases. This is not a collaborative filtering indicator 
-but rather a "content" or "metadata" type indicator since you are not using 
other users' history, only the 
-individual that you are making recs for. This means that this method will make 
recommendations for items that have 
-no collaborative filtering data, as happens with new items in a catalog. New 
items may have tags assigned but no one
- has purchased or viewed them yet. In the final query we will mix all 3 
indicators.
-
-##Content Indicator
-
-To create a content-indicator we'll make use of the fact that the user has 
purchased items with certain tags. We want to find 
-items with the most similar tags. Notice that other users' behavior is not 
considered--only other item's tags. This defines a 
-content or metadata indicator. They are used when you want to find items that 
are similar to other items by using their 
-content or metadata, not by which users interacted with them.
-
-**Note**: It may be advisable to treat tags as cross-cooccurrence indicators 
but for the sake of an example they are treated here as content only.
-
-For this we need input of the form:
-
-    itemID<tab>list-of-tags
-    ...
-
-The full collection will look like the tags column from a catalog DB. For our 
ecom example it might be:
-
-    3459860b<tab>men long-sleeve chambray clothing casual
-    9446577d<tab>women tops chambray clothing casual
-    ...
-
-We'll use *spark-rowimilairity* because we are looking for similar rows, which 
encode items in this case. As with the 
-collaborative filtering indicators we use the --omitStrength option. The 
strengths created are 
-probabilistic log-likelihood ratios and so are used to filter unimportant 
similarities. Once the filtering or downsampling 
-is finished we no longer need the strengths. We will get an indicator matrix 
of the form:
-
-    itemID<tab>list-of-item IDs
-    ...
-
-This is a content indicator since it has found other items with similar 
content or metadata.
-
-    3459860b<tab>3459860b 3459860b 6749860c 5959860a 3434860a 3477860a
-    9446577d<tab>9446577d 9496577d 0943577d 8346577d 9442277d 9446577e
-    ...  
-    
-We now have three indicators, two collaborative filtering type and one content 
type.
-
-##  Multimodal Recommender Query
-
-The actual form of the query for recommendations will vary depending on your 
search engine but the intent is the same. For a given user, map their history 
of an action or content to the correct indicator field and perform an OR'd 
query. 
-
-We have 3 indicators, these are indexed by the search engine into 3 fields, 
we'll call them "purchase", "view", and "tags". 
-We take the user's history that corresponds to each indicator and create a 
query of the form:
-
-    Query:
-      field: purchase; q:user's-purchase-history
-      field: view; q:user's view-history
-      field: tags; q:user's-tags-associated-with-purchases
-      
-The query will result in an ordered list of items recommended for purchase but 
skewed towards items with similar tags to 
-the ones the user has already purchased. 
-
-This is only an example and not necessarily the optimal way to create recs. It 
illustrates how business decisions can be 
-translated into recommendations. This technique can be used to skew 
recommendations towards intrinsic indicators also. 
-For instance you may want to put personalized popular item recs in a special 
place in the UI. Create a popularity indicator 
-by tagging items with some category of popularity (hot, warm, cold for 
instance) then
-index that as a new indicator field and include the corresponding value in a 
query 
-on the popularity field. If we use the ecom example but use the query to get 
"hot" recommendations it might look like this:
-
-    Query:
-      field: purchase; q:user's-purchase-history
-      field: view; q:user's view-history
-      field: popularity; q:"hot"
-
-This will return recommendations favoring ones that have the intrinsic 
indicator "hot".
-
-## Notes
-1. Use as much user action history as you can gather. Choose a primary action 
that is closest to what you want to recommend and the others will be used to 
create cross-cooccurrence indicators. Using more data in this fashion will 
almost always produce better recommendations.
-2. Content can be used where there is no recorded user behavior or when items 
change too quickly to get much interaction history. They can be used alone or 
mixed with other indicators.
-3. Most search engines support "boost" factors so you can favor one or more 
indicators. In the example query, if you want tags to only have a small effect 
you could boost the CF indicators.
-4. In the examples we have used space delimited strings for lists of IDs in 
indicators and in queries. It may be better to use arrays of strings if your 
storage system and search engine support them. For instance Solr allows 
multi-valued fields, which correspond to arrays.

http://git-wip-us.apache.org/repos/asf/mahout/blob/b582dc52/website/docs/native-solvers/cuda.md
----------------------------------------------------------------------
diff --git a/website/docs/native-solvers/cuda.md 
b/website/docs/native-solvers/cuda.md
index 8d1f25f..1ec7807 100644
--- a/website/docs/native-solvers/cuda.md
+++ b/website/docs/native-solvers/cuda.md
@@ -1,6 +1,6 @@
 ---
 layout: page
-title: Native Solvers: CUDA
+title: Native Solvers- CUDA
 theme:
     name: mahout2
----
\ No newline at end of file
+---

http://git-wip-us.apache.org/repos/asf/mahout/blob/b582dc52/website/docs/native-solvers/viennacl-omp.md
----------------------------------------------------------------------
diff --git a/website/docs/native-solvers/viennacl-omp.md 
b/website/docs/native-solvers/viennacl-omp.md
index 85a2b64..7540ad3 100644
--- a/website/docs/native-solvers/viennacl-omp.md
+++ b/website/docs/native-solvers/viennacl-omp.md
@@ -1,6 +1,6 @@
 ---
 layout: page
-title: Native Solvers: ViennaCL-OMP
+title: Native Solvers- ViennaCL-OMP
 theme:
     name: mahout2
 ---
\ No newline at end of file

http://git-wip-us.apache.org/repos/asf/mahout/blob/b582dc52/website/docs/native-solvers/viennacl.md
----------------------------------------------------------------------
diff --git a/website/docs/native-solvers/viennacl.md 
b/website/docs/native-solvers/viennacl.md
index 491a636..d41e0f7 100644
--- a/website/docs/native-solvers/viennacl.md
+++ b/website/docs/native-solvers/viennacl.md
@@ -1,6 +1,6 @@
 ---
 layout: page
-title: Native Solvers: ViennaCL
+title: Native Solvers- ViennaCL
 theme:
     name: mahout2
 ---
\ No newline at end of file

http://git-wip-us.apache.org/repos/asf/mahout/blob/b582dc52/website/docs/screenshots/landing.png
----------------------------------------------------------------------
diff --git a/website/docs/screenshots/landing.png 
b/website/docs/screenshots/landing.png
new file mode 100644
index 0000000..d879e46
Binary files /dev/null and b/website/docs/screenshots/landing.png differ

http://git-wip-us.apache.org/repos/asf/mahout/blob/b582dc52/website/docs/screenshots/mr-algos.png
----------------------------------------------------------------------
diff --git a/website/docs/screenshots/mr-algos.png 
b/website/docs/screenshots/mr-algos.png
new file mode 100644
index 0000000..34b4f53
Binary files /dev/null and b/website/docs/screenshots/mr-algos.png differ

http://git-wip-us.apache.org/repos/asf/mahout/blob/b582dc52/website/docs/screenshots/tutorials.png
----------------------------------------------------------------------
diff --git a/website/docs/screenshots/tutorials.png 
b/website/docs/screenshots/tutorials.png
new file mode 100644
index 0000000..500187a
Binary files /dev/null and b/website/docs/screenshots/tutorials.png differ

http://git-wip-us.apache.org/repos/asf/mahout/blob/b582dc52/website/docs/tutorials/cco-lastfm/cco-lastfm.scala
----------------------------------------------------------------------
diff --git a/website/docs/tutorials/cco-lastfm/cco-lastfm.scala 
b/website/docs/tutorials/cco-lastfm/cco-lastfm.scala
new file mode 100644
index 0000000..dc99d57
--- /dev/null
+++ b/website/docs/tutorials/cco-lastfm/cco-lastfm.scala
@@ -0,0 +1,89 @@
+
+/**
+  * Created by rawkintrevo on 4/5/17.
+  */
+
+// Only need these to intelliJ doesn't whine
+
+import org.apache.mahout.drivers.ItemSimilarityDriver.parser
+import org.apache.mahout.math._
+import org.apache.mahout.math.scalabindings._
+import org.apache.mahout.math.drm._
+import org.apache.mahout.math.scalabindings.RLikeOps._
+import org.apache.mahout.math.drm.RLikeDrmOps._
+import org.apache.mahout.sparkbindings._
+import org.apache.spark.SparkContext
+import org.apache.spark.SparkContext._
+import org.apache.spark.SparkConf
+val conf = new SparkConf().setAppName("Simple Application")
+val sc = new SparkContext(conf)
+
+implicit val sdc: org.apache.mahout.sparkbindings.SparkDistributedContext = 
sc2sdc(sc)
+
+
+// </pandering to intellij>
+
+// http://files.grouplens.org/datasets/hetrec2011/hetrec2011-lastfm-2k.zip
+// start mahout shell like this: $MAHOUT_HOME/bin/mahout spark-shell
+
+import org.apache.mahout.sparkbindings.indexeddataset.IndexedDatasetSpark
+
+// We need to turn our raw text files into RDD[(String, String)] 
+val userTagsRDD = 
sc.textFile("/home/rawkintrevo/gits/MahoutExamples/data/lastfm/user_taggedartists.dat").map(line
 => line.split("\t")).map(a => (a(0), a(2))).filter(_._1 != "userID")
+val userTagsIDS = IndexedDatasetSpark.apply(userTagsRDD)(sc)
+
+val userArtistsRDD = 
sc.textFile("/home/rawkintrevo/gits/MahoutExamples/data/lastfm/user_artists.dat").map(line
 => line.split("\t")).map(a => (a(0), a(1))).filter(_._1 != "userID")
+val userArtistsIDS = IndexedDatasetSpark.apply(userArtistsRDD)(sc)
+
+val userFriendsRDD = 
sc.textFile("/home/rawkintrevo/gits/MahoutExamples/data/lastfm/user_friends.dat").map(line
 => line.split("\t")).map(a => (a(0), a(1))).filter(_._1 != "userID")
+val userFriendsIDS = IndexedDatasetSpark.apply(userFriendsRDD)(sc)
+
+import org.apache.mahout.math.cf.SimilarityAnalysis
+
+val artistReccosLlrDrmListByArtist = 
SimilarityAnalysis.cooccurrencesIDSs(Array(userArtistsIDS, userTagsIDS, 
userFriendsIDS), maxInterestingItemsPerThing = 20, maxNumInteractions = 500, 
randomSeed = 1234)
+
+// Anonymous User
+
+val artistMap = 
sc.textFile("/home/rawkintrevo/gits/MahoutExamples/data/lastfm/artists.dat").map(line
 => line.split("\t")).map(a => (a(1), a(0))).filter(_._1 != 
"name").collect.toMap
+val tagsMap = 
sc.textFile("/home/rawkintrevo/gits/MahoutExamples/data/lastfm/tags.dat").map(line
 => line.split("\t")).map(a => (a(1), a(0))).filter(_._1 != 
"tagValue").collect.toMap
+
+// Watch your skin- you're not wearing armour. (This will fail on misspelled 
artists
+// This is neccessary because the ids are integer-strings already, and for 
this demo I didn't want to chance them to Integer types (bc more often you'll 
have strings).
+val kilroyUserArtists = svec( 
(userArtistsIDS.columnIDs.get(artistMap("Beck")).get, 1) ::
+  (userArtistsIDS.columnIDs.get(artistMap("David Bowie")).get, 1) ::
+  (userArtistsIDS.columnIDs.get(artistMap("Gary Numan")).get, 1) ::
+  (userArtistsIDS.columnIDs.get(artistMap("Less Than Jake")).get, 1) ::
+  (userArtistsIDS.columnIDs.get(artistMap("Lou Reed")).get, 1) ::
+  (userArtistsIDS.columnIDs.get(artistMap("Parliament")).get, 1) ::
+  (userArtistsIDS.columnIDs.get(artistMap("Radiohead")).get, 1) ::
+  (userArtistsIDS.columnIDs.get(artistMap("Seu Jorge")).get, 1) ::
+  (userArtistsIDS.columnIDs.get(artistMap("The Skatalites")).get, 1) ::
+  (userArtistsIDS.columnIDs.get(artistMap("Reverend Horton Heat")).get, 1) ::
+  (userArtistsIDS.columnIDs.get(artistMap("Talking Heads")).get, 1) ::
+  (userArtistsIDS.columnIDs.get(artistMap("Tom Waits")).get, 1) ::
+  (userArtistsIDS.columnIDs.get(artistMap("Waylon Jennings")).get, 1) ::
+  (userArtistsIDS.columnIDs.get(artistMap("Wu-Tang Clan")).get, 1) :: Nil, 
cardinality = userArtistsIDS.columnIDs.size
+)
+
+val kilroyUserTags = svec(
+  (userTagsIDS.columnIDs.get(tagsMap("classical")).get, 1) ::
+  (userTagsIDS.columnIDs.get(tagsMap("skacore")).get, 1) ::
+  (userTagsIDS.columnIDs.get(tagsMap("why on earth is this just a bonus 
track")).get, 1) ::
+  (userTagsIDS.columnIDs.get(tagsMap("punk rock")).get, 1) :: Nil, cardinality 
= userTagsIDS.columnIDs.size)
+
+val kilroysRecs = (artistReccosLlrDrmListByArtist(0).matrix %*% 
kilroyUserArtists + artistReccosLlrDrmListByArtist(1).matrix %*% 
kilroyUserTags).collect
+
+
+import org.apache.mahout.math.scalabindings.MahoutCollections._
+import collection._
+import JavaConversions._
+
+// Which Users I should Be Friends with.
+println(kilroysRecs(::, 0).toMap.toList.sortWith(_._2 > _._2).take(5))
+
+/**
+  * So there you have it- the basis for a new dating/friend finding app based 
on musical preferences which
+  * is actually a pretty dope idea.
+  *
+  * Solving for which bands a user might like is left as an exercise to the 
reader.
+  */
\ No newline at end of file

http://git-wip-us.apache.org/repos/asf/mahout/blob/b582dc52/website/docs/tutorials/cco-lastfm/index.md
----------------------------------------------------------------------
diff --git a/website/docs/tutorials/cco-lastfm/index.md 
b/website/docs/tutorials/cco-lastfm/index.md
new file mode 100644
index 0000000..0e90805
--- /dev/null
+++ b/website/docs/tutorials/cco-lastfm/index.md
@@ -0,0 +1,149 @@
+---
+layout: tutorial
+title: CCOs with Last.fm
+theme:
+    name: mahout2
+---
+
+Most reccomender examples utilize the MovieLense dataset, but that relies only 
on ratings (which makes the recommender being demonstrated look less trivial).  
Right next to the MovieLense dataset is the LastFM data set.  The LastFM 
dataset has ratings by user, friends of the user, bands listened to by user, 
and tags by user.  This is the kind of exciting data set weâd like to work 
with!
+
+Start by downloading the LastFM dataset from 
+http://files.grouplens.org/datasets/hetrec2011/hetrec2011-lastfm-2k.zip
+
+Iâm going to assume youâve unzipped them to /path/to/lastfm/*
+Weâre going to use a new trick for creating our IndexedDataSets, the `apply` 
function.  `apply` takes an `RDD[(String, String)]` that is an RDD of tuples 
where both elements are strings. We load RDDs, and use Spark to manipulate the 
RDDs into this form.  The files from LastFM are tab seperated- but it should be 
noted, that this could easily be done from log files, but would just take a 
touch more Spark-Fu.  
+
+The second important thing to note is that the first element in each tuple is 
going to be the rows in the resulting matrix, the second element will be the 
column, and at that position there will be a one.  The BiDictionary will 
automatically be created from the strings. 
+
+```
+import org.apache.mahout.sparkbindings.indexeddataset.IndexedDatasetSpark
+
+val userTagsRDD = sc.textFile("/path/to/lastfm/user_taggedartists.dat")
+.map(line => line.split("\t"))
+.map(a => (a(0), a(2)))
+.filter(_._1 != "userID")
+val userTagsIDS = IndexedDatasetSpark.apply(userTagsRDD)(sc)
+
+val userArtistsRDD = sc.textFile("/path/to/lastfm/user_artists.dat")
+.map(line => line.split("\t"))
+.map(a => (a(0), a(1)))
+.filter(_._1 != "userID")
+val userArtistsIDS = IndexedDatasetSpark.apply(userArtistsRDD)(sc)
+
+val userFriendsRDD = sc.textFile("/path/to/lastfm/user_friends.dat")
+.map(line => line.split("\t"))
+.map(a => (a(0), a(1)))
+.filter(_._1 != "userID")
+val userFriendsIDS = IndexedDatasetSpark.apply(userFriendsRDD)(sc)
+```
+
+How much easier was that?! In each RDD creations we:
+
+Load our data using sc.textFile
+    
+    sc.textFile("/path/to/lastfm/user_taggedartists.dat")
+
+Split the data into an array based on tabs (\t)
+
+    .map(line => line.split("\t"))
+
+Pull the userID column into the first position of the tuple, and the other 
attribute we want into the second position.
+
+    .map(a => (a(0), a(1)))
+
+Remove the header (the only line that will have âuserIDâ in that position)
+
+    .filter(_._1 != "userID")
+
+Then we easily create an IndexedDataSet using the `apply` method. 
+val userTagsIDS = IndexedDatasetSpark.apply(userTagsRDD)(sc)
+Note the `(sc)` at the end. You may or may not need that.  `sc` is the 
SparkContext and should be passed as an implicit parameter, however the REPL 
environment (e.g. Mahout Shell or notebooks) has a hard time with the 
implicits, so I had to pass it explicitly.  
+
+Now we compute our co-occurrence matrices:
+```scala
+import org.apache.mahout.math.cf.SimilarityAnalysis
+
+val artistReccosLlrDrmListByArtist = SimilarityAnalysis.cooccurrencesIDSs(
+Array(userArtistsIDS, userTagsIDS, userFriendsIDS), 
+maxInterestingItemsPerThing = 20,
+maxNumInteractions = 500, 
+randomSeed = 1234)
+```
+
+
+Letâs see an example of how this would work-
+
+First we have a small problem. If you look at our original input files, the 
userIDs, artistIDs, and tags were all integers. We loaded them as strings and 
if you look at the BiDictionaries associated with each IDS, youâll see they 
map the original integers as strings to the integer indices of our matrix. Not 
super helpful.  There are other files which contain mappings from LastFM ID to 
human readable band and tag names.  I could have sorted this out in the 
begining but I chose to do it on the backside because it is a bit of clever 
Spark/Scala only needed to work around a quirk in this particular dataset.  We 
have to reverse map a few things if we want to input âhuman readableâ 
attributes, which I did.  If this doesnât make sense, please donât be 
discouraged- the important part was above, this is just some magic for working 
with this dataset in a pretty way. 
+
+First I load, and create incore maps from the mapping files:
+
+```scala
+val artistMap = sc.textFile("/path/to/lastfm/artists.dat")
+  .map(line => line.split("\t"))
+  .map(a => (a(1), a(0)))
+  .filter(_._1 != "name")
+  .collect
+  .toMap
+
+val tagsMap = sc.textFile("/path/tolastfm/tags.dat")
+  .map(line => line.split("\t"))
+  .map(a => (a(1), a(0)))
+  .filter(_._1 != "tagValue")
+  .collect
+  .toMap
+
+```
+
+This will create some `Map`s that I can use to type readable names for the 
artist and tags to create my âhistoryâ.
+
+```scala
+val kilroyUserArtists = svec( 
(userArtistsIDS.columnIDs.get(artistMap("Beck")).get, 1) ::
+ (userArtistsIDS.columnIDs.get(artistMap("David Bowie")).get, 1) ::
+ (userArtistsIDS.columnIDs.get(artistMap("Gary Numan")).get, 1) ::
+ (userArtistsIDS.columnIDs.get(artistMap("Less Than Jake")).get, 1) ::
+ (userArtistsIDS.columnIDs.get(artistMap("Lou Reed")).get, 1) ::
+ (userArtistsIDS.columnIDs.get(artistMap("Parliament")).get, 1) ::
+ (userArtistsIDS.columnIDs.get(artistMap("Radiohead")).get, 1) ::
+ (userArtistsIDS.columnIDs.get(artistMap("Seu Jorge")).get, 1) ::
+ (userArtistsIDS.columnIDs.get(artistMap("The Skatalites")).get, 1) ::
+ (userArtistsIDS.columnIDs.get(artistMap("Reverend Horton Heat")).get, 1) ::
+ (userArtistsIDS.columnIDs.get(artistMap("Talking Heads")).get, 1) ::
+ (userArtistsIDS.columnIDs.get(artistMap("Tom Waits")).get, 1) ::
+ (userArtistsIDS.columnIDs.get(artistMap("Waylon Jennings")).get, 1) ::
+ (userArtistsIDS.columnIDs.get(artistMap("Wu-Tang Clan")).get, 1) :: Nil, 
+ cardinality = userArtistsIDS.columnIDs.size
+)
+
+
+
+val kilroyUserTags = svec(
+ (userTagsIDS.columnIDs.get(tagsMap("classical")).get, 1) ::
+ (userTagsIDS.columnIDs.get(tagsMap("skacore")).get, 1) ::
+ (userTagsIDS.columnIDs.get(tagsMap("why on earth is this just a bonus 
track")).get, 1) ::
+ (userTagsIDS.columnIDs.get(tagsMap("punk rock")).get, 1) :: Nil,
+ cardinality = userTagsIDS.columnIDs.size)
+```
+
+So what we have then is me typing in a name to `artistMap` where the keys are 
human readable names of my favorite bands, which returns the value which is the 
LastFM ID, which in turn is the key in the BiDictionary map, and returns the 
matrix position.  Iâm making a sparse vector where I want the index at the 
value I just fetched (which in an awry way refers to the artist I specified) to 
have the value 1.  
+
+Same idea for the tags. 
+
+I now have two history vectors.  I didnât make one for the users table, 
because I donât have any friends on LastFM yet. Thatâs about to change 
though, because Iâm about to have some friends recommended to me. 
+
+val kilroysRecs = (artistReccosLlrDrmListByArtist(0).matrix %*% 
kilroyUserArtists + artistReccosLlrDrmListByArtist(1).matrix %*% 
kilroyUserTags).collect
+Finally letâs sort that vector out and get some user ids and strengths. 
+```scala
+import org.apache.mahout.math.scalabindings.MahoutCollections._
+import collection._
+import JavaConversions._
+
+// Which Users I should Be Friends with.
+println(kilroysRecs(::, 0).toMap.toList.sortWith(_._2 > _._2).take(5))
+
+```
+
+`kilroysRecs` is actually a one column matrix, so we take that, and the 
convert it into something we can sort. We then take the top 5 suggestions.  
Keep in mind, this will return the Mahout user ID, which you would also have to 
reverse map back to the lastFM userID.  The lastFM userID is just another 
Integer, and not particularly exciting so I left that out. 
+
+If you wanted to recommend artists like a normal recommendation engine- you 
would change the first position in all of the input matrices to be 
âartistIDâ. This is left as an exercise to the user. 
+
+[Full Scala Worksheet](cco-lastfm.scala)

http://git-wip-us.apache.org/repos/asf/mahout/blob/b582dc52/website/docs/tutorials/eigenfaces/eigenfaces.png
----------------------------------------------------------------------
diff --git a/website/docs/tutorials/eigenfaces/eigenfaces.png 
b/website/docs/tutorials/eigenfaces/eigenfaces.png
new file mode 100644
index 0000000..b388575
Binary files /dev/null and b/website/docs/tutorials/eigenfaces/eigenfaces.png 
differ

http://git-wip-us.apache.org/repos/asf/mahout/blob/b582dc52/website/docs/tutorials/eigenfaces/index.md
----------------------------------------------------------------------
diff --git a/website/docs/tutorials/eigenfaces/index.md 
b/website/docs/tutorials/eigenfaces/index.md
index 0db0d14..08f3bb6 100644
--- a/website/docs/tutorials/eigenfaces/index.md
+++ b/website/docs/tutorials/eigenfaces/index.md
@@ -110,4 +110,19 @@ for (i <- 0 until 20){
     val image = Image(w, h, output)
     image.output(new File(s"/tmp/eigenfaces/${i}.png"))
 }
-```
\ No newline at end of file
+```
+
+### View the Eigenfaces
+
+If using Zeppelin, the following can be used to generate a fun table of the 
Eigenfaces:
+
+```python
+%python
+ 
+r = 4
+c = 5
+print '%html\n<table style="width:100%">' + "".join(["<tr>" + "".join([ 
'<td><img src="/tmp/eigenfaces/%i.png"></td>' % (i + j) for j in range(0, c) ]) 
+ "</tr>" for i in range(0, r * c, r +1 ) ]) + '</table>'
+
+```
+
+![Eigenfaces](eigenfaces.png)
\ No newline at end of file

http://git-wip-us.apache.org/repos/asf/mahout/blob/b582dc52/website/docs/tutorials/intro-cooccurrence-spark/index.md
----------------------------------------------------------------------
diff --git a/website/docs/tutorials/intro-cooccurrence-spark/index.md 
b/website/docs/tutorials/intro-cooccurrence-spark/index.md
new file mode 100644
index 0000000..d7d0185
--- /dev/null
+++ b/website/docs/tutorials/intro-cooccurrence-spark/index.md
@@ -0,0 +1,446 @@
+---
+layout: algorithm
+title: Intro to Cooccurrence Recommenders with Spark
+theme:
+    name: retro-mahout
+---
+
+# Intro to Cooccurrence Recommenders with Spark
+
+Mahout provides several important building blocks for creating recommendations 
using Spark. *spark-itemsimilarity* can 
+be used to create "other people also liked these things" type recommendations 
and paired with a search engine can 
+personalize recommendations for individual users. *spark-rowsimilarity* can 
provide non-personalized content based 
+recommendations and when paired with a search engine can be used to 
personalize content based recommendations.
+
+![image](http://s6.postimg.org/r0m8bpjw1/recommender_architecture.png)
+
+This is a simplified Lambda architecture with Mahout's *spark-itemsimilarity* 
playing the batch model building role and a search engine playing the realtime 
serving role.
+
+You will create two collections, one for user history and one for item 
"indicators". Indicators are user interactions that lead to the wished for 
interaction. So for example if you wish a user to purchase something and you 
collect all users purchase interactions *spark-itemsimilarity* will create a 
purchase indicator from them. But you can also use other user interactions in a 
cross-cooccurrence calculation, to create purchase indicators. 
+
+User history is used as a query on the item collection with its cooccurrence 
and cross-cooccurrence indicators (there may be several indicators). The 
primary interaction or action is picked to be the thing you want to recommend, 
other actions are believed to be corelated but may not indicate exactly the 
same user intent. For instance in an ecom recommender a purchase is a very good 
primary action, but you may also know product detail-views, or 
additions-to-wishlists. These can be considered secondary actions which may all 
be used to calculate cross-cooccurrence indicators. The user history that forms 
the recommendations query will contain recorded primary and secondary actions 
all targetted towards the correct indicator fields.
+
+## References
+
+1. A free ebook, which talks about the general idea: [Practical Machine 
Learning](https://www.mapr.com/practical-machine-learning)
+2. A slide deck, which talks about mixing actions or other indicators: 
[Creating a Unified 
Recommender](http://occamsmachete.com/ml/2014/10/07/creating-a-unified-recommender-with-mahout-and-a-search-engine/)
+3. Two blog posts: [What's New in Recommenders: part 
#1](http://occamsmachete.com/ml/2014/08/11/mahout-on-spark-whats-new-in-recommenders/)
+and  [What's New in Recommenders: part 
#2](http://occamsmachete.com/ml/2014/09/09/mahout-on-spark-whats-new-in-recommenders-part-2/)
+3. A post describing the loglikelihood ratio:  [Surprise and 
Coinsidense](http://tdunning.blogspot.com/2008/03/surprise-and-coincidence.html)
  LLR is used to reduce noise in the data while keeping the calculations O(n) 
complexity.
+
+Below are the command line jobs but the drivers and associated code can also 
be customized and accessed from the Scala APIs.
+
+## 1. spark-itemsimilarity
+*spark-itemsimilarity* is the Spark counterpart of the of the Mahout mapreduce 
job called *itemsimilarity*. It takes in elements of interactions, which have 
userID, itemID, and optionally a value. It will produce one of more indicator 
matrices created by comparing every user's interactions with every other user. 
The indicator matrix is an item x item matrix where the values are 
log-likelihood ratio strengths. For the legacy mapreduce version, there were 
several possible similarity measures but these are being deprecated in favor of 
LLR because in practice it performs the best.
+
+Mahout's mapreduce version of itemsimilarity takes a text file that is 
expected to have user and item IDs that conform to 
+Mahout's ID requirements--they are non-negative integers that can be viewed as 
row and column numbers in a matrix.
+
+*spark-itemsimilarity* also extends the notion of cooccurrence to 
cross-cooccurrence, in other words the Spark version will 
+account for multi-modal interactions and create cross-cooccurrence indicator 
matrices allowing the use of much more data in 
+creating recommendations or similar item lists. People try to do this by 
mixing different actions and giving them weights. 
+For instance they might say an item-view is 0.2 of an item purchase. In 
practice this is often not helpful. Spark-itemsimilarity's
+cross-cooccurrence is a more principled way to handle this case. In effect it 
scrubs secondary actions with the action you want
+to recommend.   
+
+
+    spark-itemsimilarity Mahout 1.0
+    Usage: spark-itemsimilarity [options]
+    
+    Disconnected from the target VM, address: '127.0.0.1:64676', transport: 
'socket'
+    Input, output options
+      -i <value> | --input <value>
+            Input path, may be a filename, directory name, or comma delimited 
list of HDFS supported URIs (required)
+      -i2 <value> | --input2 <value>
+            Secondary input path for cross-similarity calculation, same 
restrictions as "--input" (optional). Default: empty.
+      -o <value> | --output <value>
+            Path for output, any local or HDFS supported URI (required)
+    
+    Algorithm control options:
+      -mppu <value> | --maxPrefs <value>
+            Max number of preferences to consider per user (optional). 
Default: 500
+      -m <value> | --maxSimilaritiesPerItem <value>
+            Limit the number of similarities per item to this number 
(optional). Default: 100
+    
+    Note: Only the Log Likelihood Ratio (LLR) is supported as a similarity 
measure.
+    
+    Input text file schema options:
+      -id <value> | --inDelim <value>
+            Input delimiter character (optional). Default: "[,\t]"
+      -f1 <value> | --filter1 <value>
+            String (or regex) whose presence indicates a datum for the primary 
item set (optional). Default: no filter, all data is used
+      -f2 <value> | --filter2 <value>
+            String (or regex) whose presence indicates a datum for the 
secondary item set (optional). If not present no secondary dataset is collected
+      -rc <value> | --rowIDColumn <value>
+            Column number (0 based Int) containing the row ID string 
(optional). Default: 0
+      -ic <value> | --itemIDColumn <value>
+            Column number (0 based Int) containing the item ID string 
(optional). Default: 1
+      -fc <value> | --filterColumn <value>
+            Column number (0 based Int) containing the filter string 
(optional). Default: -1 for no filter
+    
+    Using all defaults the input is expected of the form: "userID<tab>itemId" 
or "userID<tab>itemID<tab>any-text..." and all rows will be used
+    
+    File discovery options:
+      -r | --recursive
+            Searched the -i path recursively for files that match 
--filenamePattern (optional), Default: false
+      -fp <value> | --filenamePattern <value>
+            Regex to match in determining input files (optional). Default: 
filename in the --input option or "^part-.*" if --input is a directory
+    
+    Output text file schema options:
+      -rd <value> | --rowKeyDelim <value>
+            Separates the rowID key from the vector values list (optional). 
Default: "\t"
+      -cd <value> | --columnIdStrengthDelim <value>
+            Separates column IDs from their values in the vector values list 
(optional). Default: ":"
+      -td <value> | --elementDelim <value>
+            Separates vector element values in the values list (optional). 
Default: " "
+      -os | --omitStrength
+            Do not write the strength to the output files (optional), Default: 
false.
+    This option is used to output indexable data for creating a search engine 
recommender.
+    
+    Default delimiters will produce output of the form: 
"itemID1<tab>itemID2:value2<space>itemID10:value10..."
+    
+    Spark config options:
+      -ma <value> | --master <value>
+            Spark Master URL (optional). Default: "local". Note that you can 
specify the number of cores to get a performance improvement, for example 
"local[4]"
+      -sem <value> | --sparkExecutorMem <value>
+            Max Java heap available as "executor memory" on each node 
(optional). Default: 4g
+      -rs <value> | --randomSeed <value>
+            
+      -h | --help
+            prints this usage text
+
+This looks daunting but defaults to simple fairly sane values to take exactly 
the same input as legacy code and is pretty flexible. It allows the user to 
point to a single text file, a directory full of files, or a tree of 
directories to be traversed recursively. The files included can be specified 
with either a regex-style pattern or filename. The schema for the file is 
defined by column numbers, which map to the important bits of data including 
IDs and values. The files can even contain filters, which allow unneeded rows 
to be discarded or used for cross-cooccurrence calculations.
+
+See ItemSimilarityDriver.scala in Mahout's spark module if you want to 
customize the code. 
+
+### Defaults in the _**spark-itemsimilarity**_ CLI
+
+If all defaults are used the input can be as simple as:
+
+    userID1,itemID1
+    userID2,itemID2
+    ...
+
+With the command line:
+
+
+    bash$ mahout spark-itemsimilarity --input in-file --output out-dir
+
+
+This will use the "local" Spark context and will output the standard text 
version of a DRM
+
+    itemID1<tab>itemID2:value2<space>itemID10:value10...
+
+### <a name="multiple-actions">How To Use Multiple User Actions</a>
+
+Often we record various actions the user takes for later analytics. These can 
now be used to make recommendations. 
+The idea of a recommender is to recommend the action you want the user to 
make. For an ecom app this might be 
+a purchase action. It is usually not a good idea to just treat other actions 
the same as the action you want to recommend. 
+For instance a view of an item does not indicate the same intent as a purchase 
and if you just mixed the two together you 
+might even make worse recommendations. It is tempting though since there are 
so many more views than purchases. With *spark-itemsimilarity*
+we can now use both actions. Mahout will use cross-action cooccurrence 
analysis to limit the views to ones that do predict purchases.
+We do this by treating the primary action (purchase) as data for the indicator 
matrix and use the secondary action (view) 
+to calculate the cross-cooccurrence indicator matrix.  
+
+*spark-itemsimilarity* can read separate actions from separate files or from a 
mixed action log by filtering certain lines. For a mixed 
+action log of the form:
+
+    u1,purchase,iphone
+    u1,purchase,ipad
+    u2,purchase,nexus
+    u2,purchase,galaxy
+    u3,purchase,surface
+    u4,purchase,iphone
+    u4,purchase,galaxy
+    u1,view,iphone
+    u1,view,ipad
+    u1,view,nexus
+    u1,view,galaxy
+    u2,view,iphone
+    u2,view,ipad
+    u2,view,nexus
+    u2,view,galaxy
+    u3,view,surface
+    u3,view,nexus
+    u4,view,iphone
+    u4,view,ipad
+    u4,view,galaxy
+
+###Command Line
+
+
+Use the following options:
+
+    bash$ mahout spark-itemsimilarity \
+       --input in-file \     # where to look for data
+        --output out-path \   # root dir for output
+        --master masterUrl \  # URL of the Spark master server
+        --filter1 purchase \  # word that flags input for the primary action
+        --filter2 view \      # word that flags input for the secondary action
+        --itemIDPosition 2 \  # column that has the item ID
+        --rowIDPosition 0 \   # column that has the user ID
+        --filterPosition 1    # column that has the filter word
+
+
+
+### Output
+
+The output of the job will be the standard text version of two Mahout DRMs. 
This is a case where we are calculating 
+cross-cooccurrence so a primary indicator matrix and cross-cooccurrence 
indicator matrix will be created
+
+    out-path
+      |-- similarity-matrix - TDF part files
+      \-- cross-similarity-matrix - TDF part-files
+
+The similarity-matrix will contain the lines:
+
+    galaxy\tnexus:1.7260924347106847
+    ipad\tiphone:1.7260924347106847
+    nexus\tgalaxy:1.7260924347106847
+    iphone\tipad:1.7260924347106847
+    surface
+
+The cross-similarity-matrix will contain:
+
+    iphone\tnexus:1.7260924347106847 iphone:1.7260924347106847 
ipad:1.7260924347106847 galaxy:1.7260924347106847
+    ipad\tnexus:0.6795961471815897 iphone:0.6795961471815897 
ipad:0.6795961471815897 galaxy:0.6795961471815897
+    nexus\tnexus:0.6795961471815897 iphone:0.6795961471815897 
ipad:0.6795961471815897 galaxy:0.6795961471815897
+    galaxy\tnexus:1.7260924347106847 iphone:1.7260924347106847 
ipad:1.7260924347106847 galaxy:1.7260924347106847
+    surface\tsurface:4.498681156950466 nexus:0.6795961471815897
+
+**Note:** You can run this multiple times to use more than two actions or you 
can use the underlying 
+SimilarityAnalysis.cooccurrence API, which will more efficiently calculate any 
number of cross-cooccurrence indicators.
+
+### Log File Input
+ 
+A common method of storing data is in log files. If they are written using 
some delimiter they can be consumed directly by spark-itemsimilarity. For 
instance input of the form:
+
+    2014-06-23 14:46:53.115\tu1\tpurchase\trandom text\tiphone
+    2014-06-23 14:46:53.115\tu1\tpurchase\trandom text\tipad
+    2014-06-23 14:46:53.115\tu2\tpurchase\trandom text\tnexus
+    2014-06-23 14:46:53.115\tu2\tpurchase\trandom text\tgalaxy
+    2014-06-23 14:46:53.115\tu3\tpurchase\trandom text\tsurface
+    2014-06-23 14:46:53.115\tu4\tpurchase\trandom text\tiphone
+    2014-06-23 14:46:53.115\tu4\tpurchase\trandom text\tgalaxy
+    2014-06-23 14:46:53.115\tu1\tview\trandom text\tiphone
+    2014-06-23 14:46:53.115\tu1\tview\trandom text\tipad
+    2014-06-23 14:46:53.115\tu1\tview\trandom text\tnexus
+    2014-06-23 14:46:53.115\tu1\tview\trandom text\tgalaxy
+    2014-06-23 14:46:53.115\tu2\tview\trandom text\tiphone
+    2014-06-23 14:46:53.115\tu2\tview\trandom text\tipad
+    2014-06-23 14:46:53.115\tu2\tview\trandom text\tnexus
+    2014-06-23 14:46:53.115\tu2\tview\trandom text\tgalaxy
+    2014-06-23 14:46:53.115\tu3\tview\trandom text\tsurface
+    2014-06-23 14:46:53.115\tu3\tview\trandom text\tnexus
+    2014-06-23 14:46:53.115\tu4\tview\trandom text\tiphone
+    2014-06-23 14:46:53.115\tu4\tview\trandom text\tipad
+    2014-06-23 14:46:53.115\tu4\tview\trandom text\tgalaxy    
+
+Can be parsed with the following CLI and run on the cluster producing the same 
output as the above example.
+
+    bash$ mahout spark-itemsimilarity \
+        --input in-file \
+        --output out-path \
+        --master spark://sparkmaster:4044 \
+        --filter1 purchase \
+        --filter2 view \
+        --inDelim "\t" \
+        --itemIDPosition 4 \
+        --rowIDPosition 1 \
+        --filterPosition 2
+
+## 2. spark-rowsimilarity
+
+*spark-rowsimilarity* is the companion to *spark-itemsimilarity* the primary 
difference is that it takes a text file version of 
+a matrix of sparse vectors with optional application specific IDs and it finds 
similar rows rather than items (columns). Its use is
+not limited to collaborative filtering. The input is in text-delimited form 
where there are three delimiters used. By 
+default it reads 
(rowID&lt;tab>columnID1:strength1&lt;space>columnID2:strength2...) Since this 
job only supports LLR similarity,
+ which does not use the input strengths, they may be omitted in the input. It 
writes 
+(rowID&lt;tab>rowID1:strength1&lt;space>rowID2:strength2...) 
+The output is sorted by strength descending. The output can be interpreted as 
a row ID from the primary input followed 
+by a list of the most similar rows.
+
+The command line interface is:
+
+    spark-rowsimilarity Mahout 1.0
+    Usage: spark-rowsimilarity [options]
+    
+    Input, output options
+      -i <value> | --input <value>
+            Input path, may be a filename, directory name, or comma delimited 
list of HDFS supported URIs (required)
+      -o <value> | --output <value>
+            Path for output, any local or HDFS supported URI (required)
+    
+    Algorithm control options:
+      -mo <value> | --maxObservations <value>
+            Max number of observations to consider per row (optional). 
Default: 500
+      -m <value> | --maxSimilaritiesPerRow <value>
+            Limit the number of similarities per item to this number 
(optional). Default: 100
+    
+    Note: Only the Log Likelihood Ratio (LLR) is supported as a similarity 
measure.
+    Disconnected from the target VM, address: '127.0.0.1:49162', transport: 
'socket'
+    
+    Output text file schema options:
+      -rd <value> | --rowKeyDelim <value>
+            Separates the rowID key from the vector values list (optional). 
Default: "\t"
+      -cd <value> | --columnIdStrengthDelim <value>
+            Separates column IDs from their values in the vector values list 
(optional). Default: ":"
+      -td <value> | --elementDelim <value>
+            Separates vector element values in the values list (optional). 
Default: " "
+      -os | --omitStrength
+            Do not write the strength to the output files (optional), Default: 
false.
+    This option is used to output indexable data for creating a search engine 
recommender.
+    
+    Default delimiters will produce output of the form: 
"itemID1<tab>itemID2:value2<space>itemID10:value10..."
+    
+    File discovery options:
+      -r | --recursive
+            Searched the -i path recursively for files that match 
--filenamePattern (optional), Default: false
+      -fp <value> | --filenamePattern <value>
+            Regex to match in determining input files (optional). Default: 
filename in the --input option or "^part-.*" if --input is a directory
+    
+    Spark config options:
+      -ma <value> | --master <value>
+            Spark Master URL (optional). Default: "local". Note that you can 
specify the number of cores to get a performance improvement, for example 
"local[4]"
+      -sem <value> | --sparkExecutorMem <value>
+            Max Java heap available as "executor memory" on each node 
(optional). Default: 4g
+      -rs <value> | --randomSeed <value>
+            
+      -h | --help
+            prints this usage text
+
+See RowSimilarityDriver.scala in Mahout's spark module if you want to 
customize the code. 
+
+# 3. Using *spark-rowsimilarity* with Text Data
+
+Another use case for *spark-rowsimilarity* is in finding similar textual 
content. For instance given the tags associated with 
+a blog post,
+ which other posts have similar tags. In this case the columns are tags and 
the rows are posts. Since LLR is 
+the only similarity method supported this is not the optimal way to determine 
general "bag-of-words" document similarity. 
+LLR is used more as a quality filter than as a similarity measure. However 
*spark-rowsimilarity* will produce 
+lists of similar docs for every doc if input is docs with lists of terms. The 
Apache [Lucene](http://lucene.apache.org) project provides several methods of 
[analyzing and 
tokenizing](http://lucene.apache.org/core/4_9_0/core/org/apache/lucene/analysis/package-summary.html#package_description)
 documents.
+
+# <a name="unified-recommender">4. Creating a Multimodal Recommender</a>
+
+Using the output of *spark-itemsimilarity* and *spark-rowsimilarity* you can 
build a miltimodal cooccurrence and content based
+ recommender that can be used in both or either mode depending on indicators 
available and the history available at 
+runtime for a user. Some slide describing this method can be found 
[here](http://occamsmachete.com/ml/2014/10/07/creating-a-unified-recommender-with-mahout-and-a-search-engine/)
+
+## Requirements
+
+1. Mahout SNAPSHOT-1.0 or later
+2. Hadoop
+3. Spark, the correct version for your version of Mahout and Hadoop
+4. A search engine like Solr or Elasticsearch
+
+## Indicators
+
+Indicators come in 3 types
+
+1. **Cooccurrence**: calculated with *spark-itemsimilarity* from user actions
+2. **Content**: calculated from item metadata or content using 
*spark-rowsimilarity*
+3. **Intrinsic**: assigned to items as metadata. Can be anything that 
describes the item.
+
+The query for recommendations will be a mix of values meant to match one of 
your indicators. The query can be constructed 
+from user history and values derived from context (category being viewed for 
instance) or special precalculated data 
+(popularity rank for instance). This blending of indicators allows for 
creating many flavors or recommendations to fit 
+a very wide variety of circumstances.
+
+With the right mix of indicators developers can construct a single query that 
works for completely new items and new users 
+while working well for items with lots of interactions and users with many 
recorded actions. In other words by adding in content and intrinsic 
+indicators developers can create a solution for the "cold-start" problem that 
gracefully improves with more user history
+and as items have more interactions. It is also possible to create a 
completely content-based recommender that personalizes 
+recommendations.
+
+## Example with 3 Indicators
+
+You will need to decide how you store user action data so they can be 
processed by the item and row similarity jobs and 
+this is most easily done by using text files as described above. The data that 
is processed by these jobs is considered the 
+training data. You will need some amount of user history in your recs query. 
It is typical to use the most recent user history 
+but need not be exactly what is in the training set, which may include a 
greater volume of historical data. Keeping the user 
+history for query purposes could be done with a database by storing it in a 
users table. In the example above the two 
+collaborative filtering actions are "purchase" and "view", but let's also add 
tags (taken from catalog categories or other 
+descriptive metadata). 
+
+We will need to create 1 cooccurrence indicator from the primary action 
(purchase) 1 cross-action cooccurrence indicator 
+from the secondary action (view) 
+and 1 content indicator (tags). We'll have to run *spark-itemsimilarity* once 
and *spark-rowsimilarity* once.
+
+We have described how to create the collaborative filtering indicators for 
purchase and view (the [How to use Multiple User 
+Actions](#multiple-actions) section) but tags will be a slightly different 
process. We want to use the fact that 
+certain items have tags similar to the ones associated with a user's 
purchases. This is not a collaborative filtering indicator 
+but rather a "content" or "metadata" type indicator since you are not using 
other users' history, only the 
+individual that you are making recs for. This means that this method will make 
recommendations for items that have 
+no collaborative filtering data, as happens with new items in a catalog. New 
items may have tags assigned but no one
+ has purchased or viewed them yet. In the final query we will mix all 3 
indicators.
+
+##Content Indicator
+
+To create a content-indicator we'll make use of the fact that the user has 
purchased items with certain tags. We want to find 
+items with the most similar tags. Notice that other users' behavior is not 
considered--only other item's tags. This defines a 
+content or metadata indicator. They are used when you want to find items that 
are similar to other items by using their 
+content or metadata, not by which users interacted with them.
+
+**Note**: It may be advisable to treat tags as cross-cooccurrence indicators 
but for the sake of an example they are treated here as content only.
+
+For this we need input of the form:
+
+    itemID<tab>list-of-tags
+    ...
+
+The full collection will look like the tags column from a catalog DB. For our 
ecom example it might be:
+
+    3459860b<tab>men long-sleeve chambray clothing casual
+    9446577d<tab>women tops chambray clothing casual
+    ...
+
+We'll use *spark-rowimilairity* because we are looking for similar rows, which 
encode items in this case. As with the 
+collaborative filtering indicators we use the --omitStrength option. The 
strengths created are 
+probabilistic log-likelihood ratios and so are used to filter unimportant 
similarities. Once the filtering or downsampling 
+is finished we no longer need the strengths. We will get an indicator matrix 
of the form:
+
+    itemID<tab>list-of-item IDs
+    ...
+
+This is a content indicator since it has found other items with similar 
content or metadata.
+
+    3459860b<tab>3459860b 3459860b 6749860c 5959860a 3434860a 3477860a
+    9446577d<tab>9446577d 9496577d 0943577d 8346577d 9442277d 9446577e
+    ...  
+    
+We now have three indicators, two collaborative filtering type and one content 
type.
+
+##  Multimodal Recommender Query
+
+The actual form of the query for recommendations will vary depending on your 
search engine but the intent is the same. For a given user, map their history 
of an action or content to the correct indicator field and perform an OR'd 
query. 
+
+We have 3 indicators, these are indexed by the search engine into 3 fields, 
we'll call them "purchase", "view", and "tags". 
+We take the user's history that corresponds to each indicator and create a 
query of the form:
+
+    Query:
+      field: purchase; q:user's-purchase-history
+      field: view; q:user's view-history
+      field: tags; q:user's-tags-associated-with-purchases
+      
+The query will result in an ordered list of items recommended for purchase but 
skewed towards items with similar tags to 
+the ones the user has already purchased. 
+
+This is only an example and not necessarily the optimal way to create recs. It 
illustrates how business decisions can be 
+translated into recommendations. This technique can be used to skew 
recommendations towards intrinsic indicators also. 
+For instance you may want to put personalized popular item recs in a special 
place in the UI. Create a popularity indicator 
+by tagging items with some category of popularity (hot, warm, cold for 
instance) then
+index that as a new indicator field and include the corresponding value in a 
query 
+on the popularity field. If we use the ecom example but use the query to get 
"hot" recommendations it might look like this:
+
+    Query:
+      field: purchase; q:user's-purchase-history
+      field: view; q:user's view-history
+      field: popularity; q:"hot"
+
+This will return recommendations favoring ones that have the intrinsic 
indicator "hot".
+
+## Notes
+1. Use as much user action history as you can gather. Choose a primary action 
that is closest to what you want to recommend and the others will be used to 
create cross-cooccurrence indicators. Using more data in this fashion will 
almost always produce better recommendations.
+2. Content can be used where there is no recorded user behavior or when items 
change too quickly to get much interaction history. They can be used alone or 
mixed with other indicators.
+3. Most search engines support "boost" factors so you can favor one or more 
indicators. In the example query, if you want tags to only have a small effect 
you could boost the CF indicators.
+4. In the examples we have used space delimited strings for lists of IDs in 
indicators and in queries. It may be better to use arrays of strings if your 
storage system and search engine support them. For instance Solr allows 
multi-valued fields, which correspond to arrays.

http://git-wip-us.apache.org/repos/asf/mahout/blob/b582dc52/website/docs/tutorials/playing-with-samsara-flink-batch.md
----------------------------------------------------------------------
diff --git a/website/docs/tutorials/playing-with-samsara-flink-batch.md 
b/website/docs/tutorials/playing-with-samsara-flink-batch.md
index 4bbcd33..752f01c 100644
--- a/website/docs/tutorials/playing-with-samsara-flink-batch.md
+++ b/website/docs/tutorials/playing-with-samsara-flink-batch.md
@@ -1,5 +1,5 @@
 ---
-layout: default
+layout: tutorial
 title: 
 theme:
    name: retro-mahout

[2/2] mahout git commit: WEBSITE Added LastFM tutorial and screenshots

Reply via email to