Modified: websites/staging/mahout/trunk/content/general/reference-reading.html
==============================================================================
--- websites/staging/mahout/trunk/content/general/reference-reading.html 
(original)
+++ websites/staging/mahout/trunk/content/general/reference-reading.html Fri 
Apr  8 18:41:08 2016
@@ -146,6 +146,7 @@
                   <li class="nav-header">Engines</li>
                   <li><a href="/users/sparkbindings/home.html">Spark</a></li>
                   <li><a 
href="/users/environment/h2o-internals.html">H2O</a></li>
+                  <li><a href="/users/flinkbindings/home.html">Flink</a></li>
                   <li class="nav-header">References</li>
                   <li><a 
href="/users/environment/in-core-reference.html">In-Core Algebraic DSL 
Reference</a></li>
                   <li><a 
href="/users/environment/out-of-core-reference.html">Distributed Algebraic DSL 
Reference</a></li>
@@ -263,10 +264,21 @@
 
   <div id="content-wrap" class="clearfix">
    <div id="main">
-    <h1 id="reference-reading">Reference Reading</h1>
+    <style type="text/css">
+/* The following code is added by mdx_elementid.py
+   It was originally lifted from http://subversion.apache.org/style/site.css */
+/*
+ * Hide class="elementid-permalink", except when an enclosing heading
+ * has the :hover property.
+ */
+.headerlink, .elementid-permalink {
+  visibility: hidden;
+}
+h2:hover > .headerlink, h3:hover > .headerlink, h1:hover > .headerlink, 
h6:hover > .headerlink, h4:hover > .headerlink, h5:hover > .headerlink, 
dt:hover > .elementid-permalink { visibility: visible }</style>
+<h1 id="reference-reading">Reference Reading<a class="headerlink" 
href="#reference-reading" title="Permanent link">&para;</a></h1>
 <p>Here we provide references to books and courses about data analysis in 
general, which might also be helpful in the context of Mahout.</p>
 <p><a name="ReferenceReading-GeneralBackgroundMaterials"></a></p>
-<h2 id="general-background-materials">General Background Materials</h2>
+<h2 id="general-background-materials">General Background Materials<a 
class="headerlink" href="#general-background-materials" title="Permanent 
link">&para;</a></h2>
 <p>Don't be overwhelmed by all the maths, you can do a lot in Mahout with some
 basic knowledge. The books will help you understand your
 data better, and ask better questions both of Mahout's APIs, and also of
@@ -296,20 +308,20 @@ Carroll.</li>
 <li><a 
href="http://www.amazon.com/Understanding-Computational-Bayesian-Statistics-Wiley/dp/0470046090";>Understanding
 Computational Bayesian Statistics</a>, Bolstadt</li>
 <li><a href="http://www.stat.columbia.edu/~gelman/book/";>Bayesian Data 
Analysis, Gelman et al.</a></li>
 </ul>
-<h2 
id="for-statistics-related-to-machine-learning-these-are-particularly-helpful">For
 statistics related to machine learning, these are particularly helpful:</h2>
+<h2 
id="for-statistics-related-to-machine-learning-these-are-particularly-helpful">For
 statistics related to machine learning, these are particularly helpful:<a 
class="headerlink" 
href="#for-statistics-related-to-machine-learning-these-are-particularly-helpful"
 title="Permanent link">&para;</a></h2>
 <ul>
 <li><a 
href="http://research.microsoft.com/en-us/um/people/cmbishop/PRML/index.htm";>Pattern
 Recognition and Machine Learning by Chris Bishop</a></li>
 <li><a href="http://www-stat.stanford.edu/~tibs/ElemStatLearn/";>Elements of 
Statistical Learning</a> by Trevor Hastie, Robert Tibshirani, Jerome Friedman 
</li>
 <li><a 
href="http://research.microsoft.com/en-us/um/people/cmbishop/PRML/index.htm";>http://research.microsoft.com/en-us/um/people/cmbishop/PRML/index.htm</a></li>
 </ul>
-<h2 id="for-matrix-computationsdecompositionfactorization-etc">For matrix 
computations/decomposition/factorization etc.:</h2>
+<h2 id="for-matrix-computationsdecompositionfactorization-etc">For matrix 
computations/decomposition/factorization etc.:<a class="headerlink" 
href="#for-matrix-computationsdecompositionfactorization-etc" title="Permanent 
link">&para;</a></h2>
 <ul>
 <li>Peter V. O'Neil <a 
href="http://www.amazon.com/Introduction-Linear-Algebra-Theory-Applications/dp/053400606X";>Introduction
 to Linear Algebra</a>, great book for beginners (with some knowledge in 
calculus). It is not comprehensive, but, it will be a good place to start and 
the author starts by explaining the concepts with regards to vector spaces 
which I found to be a more natural way of explaining.</li>
 <li>David S. Watkins <a 
href="http://www.amazon.com/Fundamentals-Matrix-Computations-Applied-Mathematics/dp/0470528338/";>Fundamentals
 of Matrix Computations</a></li>
 <li><a 
href="http://www.amazon.com/Computations-Hopkins-Studies-Mathematical-Sciences/dp/0801854148/ref=sr_1_2?s=books&amp;ie=UTF8&amp;qid=1394307676&amp;sr=1-2&amp;keywords=golub+van+loan";>Matrix
 Computations</a> is the classic text for numerical linear algebra. Can't go 
wrong with it - great for researchers.  </li>
 <li>Nick Trefethen's <a 
href="http://people.maths.ox.ac.uk/trefethen/books.html";>Numerical Linear 
Algebra</a>.  It's a bit more approachable for practitioners. Many chapters on 
SVD, there are even chapters on Lanczos.</li>
 </ul>
-<h2 id="books-specifically-on-r">Books specifically on R:</h2>
+<h2 id="books-specifically-on-r">Books specifically on R:<a class="headerlink" 
href="#books-specifically-on-r" title="Permanent link">&para;</a></h2>
 <ul>
 <li>Learning about R is a difficult thing. The best introduction is in MASS <a 
href="http://www.stats.ox.ac.uk/pub/MASS4/";>http://www.stats.ox.ac.uk/pub/MASS4/</a></li>
 <li><a href="http://www.r-tutor.com/r-introduction";>R Tutor</a></li>

Modified: websites/staging/mahout/trunk/content/general/release-notes.html
==============================================================================
--- websites/staging/mahout/trunk/content/general/release-notes.html (original)
+++ websites/staging/mahout/trunk/content/general/release-notes.html Fri Apr  8 
18:41:08 2016
@@ -146,6 +146,7 @@
                   <li class="nav-header">Engines</li>
                   <li><a href="/users/sparkbindings/home.html">Spark</a></li>
                   <li><a 
href="/users/environment/h2o-internals.html">H2O</a></li>
+                  <li><a href="/users/flinkbindings/home.html">Flink</a></li>
                   <li class="nav-header">References</li>
                   <li><a 
href="/users/environment/in-core-reference.html">In-Core Algebraic DSL 
Reference</a></li>
                   <li><a 
href="/users/environment/out-of-core-reference.html">Distributed Algebraic DSL 
Reference</a></li>

Modified: websites/staging/mahout/trunk/content/general/who-we-are.html
==============================================================================
--- websites/staging/mahout/trunk/content/general/who-we-are.html (original)
+++ websites/staging/mahout/trunk/content/general/who-we-are.html Fri Apr  8 
18:41:08 2016
@@ -146,6 +146,7 @@
                   <li class="nav-header">Engines</li>
                   <li><a href="/users/sparkbindings/home.html">Spark</a></li>
                   <li><a 
href="/users/environment/h2o-internals.html">H2O</a></li>
+                  <li><a href="/users/flinkbindings/home.html">Flink</a></li>
                   <li class="nav-header">References</li>
                   <li><a 
href="/users/environment/in-core-reference.html">In-Core Algebraic DSL 
Reference</a></li>
                   <li><a 
href="/users/environment/out-of-core-reference.html">Distributed Algebraic DSL 
Reference</a></li>

Added: websites/staging/mahout/trunk/content/images/flink_squirrel_100_color.png
==============================================================================
Binary file - no diff available.

Propchange: 
websites/staging/mahout/trunk/content/images/flink_squirrel_100_color.png
------------------------------------------------------------------------------
    svn:mime-type = image/png

Modified: websites/staging/mahout/trunk/content/index.html
==============================================================================
--- websites/staging/mahout/trunk/content/index.html (original)
+++ websites/staging/mahout/trunk/content/index.html Fri Apr  8 18:41:08 2016
@@ -146,6 +146,7 @@
                   <li class="nav-header">Engines</li>
                   <li><a href="/users/sparkbindings/home.html">Spark</a></li>
                   <li><a 
href="/users/environment/h2o-internals.html">H2O</a></li>
+                  <li><a href="/users/flinkbindings/home.html">Flink</a></li>
                   <li class="nav-header">References</li>
                   <li><a 
href="/users/environment/in-core-reference.html">In-Core Algebraic DSL 
Reference</a></li>
                   <li><a 
href="/users/environment/out-of-core-reference.html">Distributed Algebraic DSL 
Reference</a></li>

Modified: websites/staging/mahout/trunk/content/overview.html
==============================================================================
--- websites/staging/mahout/trunk/content/overview.html (original)
+++ websites/staging/mahout/trunk/content/overview.html Fri Apr  8 18:41:08 2016
@@ -146,6 +146,7 @@
                   <li class="nav-header">Engines</li>
                   <li><a href="/users/sparkbindings/home.html">Spark</a></li>
                   <li><a 
href="/users/environment/h2o-internals.html">H2O</a></li>
+                  <li><a href="/users/flinkbindings/home.html">Flink</a></li>
                   <li class="nav-header">References</li>
                   <li><a 
href="/users/environment/in-core-reference.html">In-Core Algebraic DSL 
Reference</a></li>
                   <li><a 
href="/users/environment/out-of-core-reference.html">Distributed Algebraic DSL 
Reference</a></li>
@@ -263,8 +264,19 @@
 
   <div id="content-wrap" class="clearfix">
    <div id="main">
-    <p><a name="Overview-OverviewofMahout"></a></p>
-<h1 id="overview-of-mahout">Overview of Mahout</h1>
+    <style type="text/css">
+/* The following code is added by mdx_elementid.py
+   It was originally lifted from http://subversion.apache.org/style/site.css */
+/*
+ * Hide class="elementid-permalink", except when an enclosing heading
+ * has the :hover property.
+ */
+.headerlink, .elementid-permalink {
+  visibility: hidden;
+}
+h2:hover > .headerlink, h3:hover > .headerlink, h1:hover > .headerlink, 
h6:hover > .headerlink, h4:hover > .headerlink, h5:hover > .headerlink, 
dt:hover > .elementid-permalink { visibility: visible }</style>
+<p><a name="Overview-OverviewofMahout"></a></p>
+<h1 id="overview-of-mahout">Overview of Mahout<a class="headerlink" 
href="#overview-of-mahout" title="Permanent link">&para;</a></h1>
 <p>Mahout's goal is to build scalable machine learning libraries. With
 scalable we mean: 
 <em> Scalable to reasonably large data sets. Our core algorithms for

Modified: websites/staging/mahout/trunk/content/users/algorithms/d-qr.html
==============================================================================
--- websites/staging/mahout/trunk/content/users/algorithms/d-qr.html (original)
+++ websites/staging/mahout/trunk/content/users/algorithms/d-qr.html Fri Apr  8 
18:41:08 2016
@@ -146,6 +146,7 @@
                   <li class="nav-header">Engines</li>
                   <li><a href="/users/sparkbindings/home.html">Spark</a></li>
                   <li><a 
href="/users/environment/h2o-internals.html">H2O</a></li>
+                  <li><a href="/users/flinkbindings/home.html">Flink</a></li>
                   <li class="nav-header">References</li>
                   <li><a 
href="/users/environment/in-core-reference.html">In-Core Algebraic DSL 
Reference</a></li>
                   <li><a 
href="/users/environment/out-of-core-reference.html">Distributed Algebraic DSL 
Reference</a></li>
@@ -263,12 +264,23 @@
 
   <div id="content-wrap" class="clearfix">
    <div id="main">
-    <h1 id="distributed-cholesky-qr">Distributed Cholesky QR</h1>
-<h2 id="intro">Intro</h2>
+    <style type="text/css">
+/* The following code is added by mdx_elementid.py
+   It was originally lifted from http://subversion.apache.org/style/site.css */
+/*
+ * Hide class="elementid-permalink", except when an enclosing heading
+ * has the :hover property.
+ */
+.headerlink, .elementid-permalink {
+  visibility: hidden;
+}
+h2:hover > .headerlink, h3:hover > .headerlink, h1:hover > .headerlink, 
h6:hover > .headerlink, h4:hover > .headerlink, h5:hover > .headerlink, 
dt:hover > .elementid-permalink { visibility: visible }</style>
+<h1 id="distributed-cholesky-qr">Distributed Cholesky QR<a class="headerlink" 
href="#distributed-cholesky-qr" title="Permanent link">&para;</a></h1>
+<h2 id="intro">Intro<a class="headerlink" href="#intro" title="Permanent 
link">&para;</a></h2>
 <p>Mahout has a distributed implementation of QR decomposition for tall thin 
matricies[1].</p>
-<h2 id="algorithm">Algorithm</h2>
+<h2 id="algorithm">Algorithm<a class="headerlink" href="#algorithm" 
title="Permanent link">&para;</a></h2>
 <p>For the classic QR decomposition of the form 
<code>\(\mathbf{A}=\mathbf{QR},\mathbf{A}\in\mathbb{R}^{m\times n}\)</code> a 
distributed version is fairly easily achieved if <code>\(\mathbf{A}\)</code> is 
tall and thin such that <code>\(\mathbf{A}^{\top}\mathbf{A}\)</code> fits in 
memory, i.e. <em>m</em> is large but <em>n</em> &lt; ~5000 Under such 
circumstances, only <code>\(\mathbf{A}\)</code> and <code>\(\mathbf{Q}\)</code> 
are distributed matricies and <code>\(\mathbf{A^{\top}A}\)</code> and 
<code>\(\mathbf{R}\)</code> are in-core products. We just compute the in-core 
version of the Cholesky decomposition in the form of 
<code>\(\mathbf{LL}^{\top}= \mathbf{A}^{\top}\mathbf{A}\)</code>.  After that 
we take <code>\(\mathbf{R}= \mathbf{L}^{\top}\)</code> and 
<code>\(\mathbf{Q}=\mathbf{A}\left(\mathbf{L}^{\top}\right)^{-1}\)</code>.  The 
latter is easily achieved by multiplying each verticle block of 
<code>\(\mathbf{A}\)</code> by 
<code>\(\left(\mathbf{L}^{\top}\right)^{-1}\)</code
 >.  (There is no actual matrix inversion happening). </p>
-<h2 id="implementation">Implementation</h2>
+<h2 id="implementation">Implementation<a class="headerlink" 
href="#implementation" title="Permanent link">&para;</a></h2>
 <p>Mahout <code>dqrThin(...)</code> is implemented in the mahout 
<code>math-scala</code> algebraic optimizer which translates Mahout's R-like 
linear algebra operators into a physical plan for both Spark and H2O 
distributed engines.</p>
 <div class="codehilite"><pre><span class="n">def</span> <span 
class="n">dqrThin</span><span class="p">[</span><span class="n">K</span><span 
class="p">:</span> <span class="n">ClassTag</span><span 
class="p">](</span><span class="n">A</span><span class="p">:</span> <span 
class="n">DrmLike</span><span class="p">[</span><span class="n">K</span><span 
class="p">],</span> <span class="n">checkRankDeficiency</span><span 
class="p">:</span> <span class="n">Boolean</span> <span class="p">=</span> 
<span class="n">true</span><span class="p">):</span> <span 
class="p">(</span><span class="n">DrmLike</span><span class="p">[</span><span 
class="n">K</span><span class="p">],</span> <span class="n">Matrix</span><span 
class="p">)</span> <span class="p">=</span> <span class="p">{</span>        
     <span class="k">if</span> <span class="p">(</span><span 
class="n">drmA</span><span class="p">.</span><span class="n">ncol</span> <span 
class="o">&gt;</span> 5000<span class="p">)</span>
@@ -289,7 +301,7 @@
 </pre></div>
 
 
-<h2 id="usage">Usage</h2>
+<h2 id="usage">Usage<a class="headerlink" href="#usage" title="Permanent 
link">&para;</a></h2>
 <p>The scala <code>dqrThin(...)</code> method can easily be called in any 
Spark or H2O application built with the <code>math-scala</code> library and the 
corresponding <code>Spark</code> or <code>H2O</code> engine module as 
follows:</p>
 <div class="codehilite"><pre><span class="n">import</span> <span 
class="n">org</span><span class="p">.</span><span class="n">apache</span><span 
class="p">.</span><span class="n">mahout</span><span class="p">.</span><span 
class="n">math</span><span class="p">.</span><span class="n">_</span>
 <span class="n">import</span> <span class="n">decompositions</span><span 
class="p">.</span><span class="n">_</span>
@@ -299,7 +311,7 @@
 </pre></div>
 
 
-<h2 id="references">References</h2>
+<h2 id="references">References<a class="headerlink" href="#references" 
title="Permanent link">&para;</a></h2>
 <p>[1]: <a 
href="http://mahout.apache.org/users/sparkbindings/ScalaSparkBindings.pdf";>Mahout
 Scala and Mahout Spark Bindings for Linear Algebra Subroutines</a></p>
 <p>[2]: <a 
href="http://mahout.apache.org/users/sparkbindings/home.html";>Mahout Spark and 
Scala Bindings</a></p>
    </div>

Modified: websites/staging/mahout/trunk/content/users/algorithms/d-ssvd.html
==============================================================================
--- websites/staging/mahout/trunk/content/users/algorithms/d-ssvd.html 
(original)
+++ websites/staging/mahout/trunk/content/users/algorithms/d-ssvd.html Fri Apr  
8 18:41:08 2016
@@ -146,6 +146,7 @@
                   <li class="nav-header">Engines</li>
                   <li><a href="/users/sparkbindings/home.html">Spark</a></li>
                   <li><a 
href="/users/environment/h2o-internals.html">H2O</a></li>
+                  <li><a href="/users/flinkbindings/home.html">Flink</a></li>
                   <li class="nav-header">References</li>
                   <li><a 
href="/users/environment/in-core-reference.html">In-Core Algebraic DSL 
Reference</a></li>
                   <li><a 
href="/users/environment/out-of-core-reference.html">Distributed Algebraic DSL 
Reference</a></li>
@@ -263,10 +264,21 @@
 
   <div id="content-wrap" class="clearfix">
    <div id="main">
-    <h1 id="distributed-stochastic-singular-value-decomposition">Distributed 
Stochastic Singular Value Decomposition</h1>
-<h2 id="intro">Intro</h2>
+    <style type="text/css">
+/* The following code is added by mdx_elementid.py
+   It was originally lifted from http://subversion.apache.org/style/site.css */
+/*
+ * Hide class="elementid-permalink", except when an enclosing heading
+ * has the :hover property.
+ */
+.headerlink, .elementid-permalink {
+  visibility: hidden;
+}
+h2:hover > .headerlink, h3:hover > .headerlink, h1:hover > .headerlink, 
h6:hover > .headerlink, h4:hover > .headerlink, h5:hover > .headerlink, 
dt:hover > .elementid-permalink { visibility: visible }</style>
+<h1 id="distributed-stochastic-singular-value-decomposition">Distributed 
Stochastic Singular Value Decomposition<a class="headerlink" 
href="#distributed-stochastic-singular-value-decomposition" title="Permanent 
link">&para;</a></h1>
+<h2 id="intro">Intro<a class="headerlink" href="#intro" title="Permanent 
link">&para;</a></h2>
 <p>Mahout has a distributed implementation of Stochastic Singular Value 
Decomposition [1] using the parallelization strategy comprehensively defined in 
Nathan Halko's dissertation <a 
href="http://amath.colorado.edu/faculty/martinss/Pubs/2012_halko_dissertation.pdf";>"Randomized
 methods for computing low-rank approximations of matrices"</a> [2].</p>
-<h2 id="modified-ssvd-algorithm">Modified SSVD Algorithm</h2>
+<h2 id="modified-ssvd-algorithm">Modified SSVD Algorithm<a class="headerlink" 
href="#modified-ssvd-algorithm" title="Permanent link">&para;</a></h2>
 <p>Given an <code>\(m\times n\)</code>
 matrix <code>\(\mathbf{A}\)</code>, a target rank 
<code>\(k\in\mathbb{N}_{1}\)</code>
 , an oversampling parameter <code>\(p\in\mathbb{N}_{1}\)</code>, 
@@ -312,7 +324,7 @@ SVD <code>\(\mathbf{A\approx U}\boldsymb
 Another way is 
<code>\(\mathbf{V}=\mathbf{A}^{\top}\mathbf{U}\boldsymbol{\Sigma}^{-1}\)</code>.</p>
 </li>
 </ol>
-<h2 id="implementation">Implementation</h2>
+<h2 id="implementation">Implementation<a class="headerlink" 
href="#implementation" title="Permanent link">&para;</a></h2>
 <p>Mahout <code>dssvd(...)</code> is implemented in the mahout 
<code>math-scala</code> algebraic optimizer which translates Mahout's R-like 
linear algebra operators into a physical plan for both Spark and H2O 
distributed engines.</p>
 <div class="codehilite"><pre>def dssvd<span class="p">[</span>K: ClassTag<span 
class="p">](</span>drmA: DrmLike<span class="p">[</span>K<span 
class="p">],</span> k: Int<span class="p">,</span> p: Int <span 
class="o">=</span> <span class="m">15</span><span class="p">,</span> q: Int 
<span class="o">=</span> <span class="m">0</span><span class="p">)</span>:
     <span class="p">(</span>DrmLike<span class="p">[</span>K<span 
class="p">],</span> DrmLike<span class="p">[</span>Int<span class="p">],</span> 
Vector<span class="p">)</span> <span class="o">=</span> <span class="p">{</span>
@@ -374,7 +386,7 @@ Another way is <code>\(\mathbf{V}=\mathb
 
 
 <p>Note: As a side effect of checkpointing, U and V values are returned as 
logical operators (i.e. they are neither checkpointed nor computed).  Therefore 
there is no physical work actually done to compute <code>\(\mathbf{U}\)</code> 
or <code>\(\mathbf{V}\)</code> until they are used in a subsequent 
expression.</p>
-<h2 id="usage">Usage</h2>
+<h2 id="usage">Usage<a class="headerlink" href="#usage" title="Permanent 
link">&para;</a></h2>
 <p>The scala <code>dssvd(...)</code> method can easily be called in any Spark 
or H2O application built with the <code>math-scala</code> library and the 
corresponding <code>Spark</code> or <code>H2O</code> engine module as 
follows:</p>
 <div class="codehilite"><pre><span class="n">import</span> <span 
class="n">org</span><span class="p">.</span><span class="n">apache</span><span 
class="p">.</span><span class="n">mahout</span><span class="p">.</span><span 
class="n">math</span><span class="p">.</span><span class="n">_</span>
 <span class="n">import</span> <span class="n">decompositions</span><span 
class="p">.</span><span class="n">_</span>
@@ -385,7 +397,7 @@ Another way is <code>\(\mathbf{V}=\mathb
 </pre></div>
 
 
-<h2 id="references">References</h2>
+<h2 id="references">References<a class="headerlink" href="#references" 
title="Permanent link">&para;</a></h2>
 <p>[1]: <a 
href="http://mahout.apache.org/users/sparkbindings/ScalaSparkBindings.pdf";>Mahout
 Scala and Mahout Spark Bindings for Linear Algebra Subroutines</a></p>
 <p>[2]: <a 
href="http://amath.colorado.edu/faculty/martinss/Pubs/2012_halko_dissertation.pdf";>Randomized
 methods for computing low-rank
 approximations of matrices</a></p>

Modified: 
websites/staging/mahout/trunk/content/users/algorithms/intro-cooccurrence-spark.html
==============================================================================
--- 
websites/staging/mahout/trunk/content/users/algorithms/intro-cooccurrence-spark.html
 (original)
+++ 
websites/staging/mahout/trunk/content/users/algorithms/intro-cooccurrence-spark.html
 Fri Apr  8 18:41:08 2016
@@ -146,6 +146,7 @@
                   <li class="nav-header">Engines</li>
                   <li><a href="/users/sparkbindings/home.html">Spark</a></li>
                   <li><a 
href="/users/environment/h2o-internals.html">H2O</a></li>
+                  <li><a href="/users/flinkbindings/home.html">Flink</a></li>
                   <li class="nav-header">References</li>
                   <li><a 
href="/users/environment/in-core-reference.html">In-Core Algebraic DSL 
Reference</a></li>
                   <li><a 
href="/users/environment/out-of-core-reference.html">Distributed Algebraic DSL 
Reference</a></li>
@@ -263,7 +264,18 @@
 
   <div id="content-wrap" class="clearfix">
    <div id="main">
-    <h1 id="intro-to-cooccurrence-recommenders-with-spark">Intro to 
Cooccurrence Recommenders with Spark</h1>
+    <style type="text/css">
+/* The following code is added by mdx_elementid.py
+   It was originally lifted from http://subversion.apache.org/style/site.css */
+/*
+ * Hide class="elementid-permalink", except when an enclosing heading
+ * has the :hover property.
+ */
+.headerlink, .elementid-permalink {
+  visibility: hidden;
+}
+h2:hover > .headerlink, h3:hover > .headerlink, h1:hover > .headerlink, 
h6:hover > .headerlink, h4:hover > .headerlink, h5:hover > .headerlink, 
dt:hover > .elementid-permalink { visibility: visible }</style>
+<h1 id="intro-to-cooccurrence-recommenders-with-spark">Intro to Cooccurrence 
Recommenders with Spark<a class="headerlink" 
href="#intro-to-cooccurrence-recommenders-with-spark" title="Permanent 
link">&para;</a></h1>
 <p>Mahout provides several important building blocks for creating 
recommendations using Spark. <em>spark-itemsimilarity</em> can 
 be used to create "other people also liked these things" type recommendations 
and paired with a search engine can 
 personalize recommendations for individual users. <em>spark-rowsimilarity</em> 
can provide non-personalized content based 
@@ -272,7 +284,7 @@ recommendations and when paired with a s
 <p>This is a simplified Lambda architecture with Mahout's 
<em>spark-itemsimilarity</em> playing the batch model building role and a 
search engine playing the realtime serving role.</p>
 <p>You will create two collections, one for user history and one for item 
"indicators". Indicators are user interactions that lead to the wished for 
interaction. So for example if you wish a user to purchase something and you 
collect all users purchase interactions <em>spark-itemsimilarity</em> will 
create a purchase indicator from them. But you can also use other user 
interactions in a cross-cooccurrence calculation, to create purchase 
indicators. </p>
 <p>User history is used as a query on the item collection with its 
cooccurrence and cross-cooccurrence indicators (there may be several 
indicators). The primary interaction or action is picked to be the thing you 
want to recommend, other actions are believed to be corelated but may not 
indicate exactly the same user intent. For instance in an ecom recommender a 
purchase is a very good primary action, but you may also know product 
detail-views, or additions-to-wishlists. These can be considered secondary 
actions which may all be used to calculate cross-cooccurrence indicators. The 
user history that forms the recommendations query will contain recorded primary 
and secondary actions all targetted towards the correct indicator fields.</p>
-<h2 id="references">References</h2>
+<h2 id="references">References<a class="headerlink" href="#references" 
title="Permanent link">&para;</a></h2>
 <ol>
 <li>A free ebook, which talks about the general idea: <a 
href="https://www.mapr.com/practical-machine-learning";>Practical Machine 
Learning</a></li>
 <li>A slide deck, which talks about mixing actions or other indicators: <a 
href="http://occamsmachete.com/ml/2014/10/07/creating-a-unified-recommender-with-mahout-and-a-search-engine/";>Creating
 a Unified Recommender</a></li>
@@ -281,7 +293,7 @@ and  <a href="http://occamsmachete.com/m
 <li>A post describing the loglikelihood ratio:  <a 
href="http://tdunning.blogspot.com/2008/03/surprise-and-coincidence.html";>Surprise
 and Coinsidense</a>  LLR is used to reduce noise in the data while keeping the 
calculations O(n) complexity.</li>
 </ol>
 <p>Below are the command line jobs but the drivers and associated code can 
also be customized and accessed from the Scala APIs.</p>
-<h2 id="1-spark-itemsimilarity">1. spark-itemsimilarity</h2>
+<h2 id="1-spark-itemsimilarity">1. spark-itemsimilarity<a class="headerlink" 
href="#1-spark-itemsimilarity" title="Permanent link">&para;</a></h2>
 <p><em>spark-itemsimilarity</em> is the Spark counterpart of the of the Mahout 
mapreduce job called <em>itemsimilarity</em>. It takes in elements of 
interactions, which have userID, itemID, and optionally a value. It will 
produce one of more indicator matrices created by comparing every user's 
interactions with every other user. The indicator matrix is an item x item 
matrix where the values are log-likelihood ratio strengths. For the legacy 
mapreduce version, there were several possible similarity measures but these 
are being deprecated in favor of LLR because in practice it performs the 
best.</p>
 <p>Mahout's mapreduce version of itemsimilarity takes a text file that is 
expected to have user and item IDs that conform to 
 Mahout's ID requirements--they are non-negative integers that can be viewed as 
row and column numbers in a matrix.</p>
@@ -360,7 +372,7 @@ to recommend.   </p>
 
 <p>This looks daunting but defaults to simple fairly sane values to take 
exactly the same input as legacy code and is pretty flexible. It allows the 
user to point to a single text file, a directory full of files, or a tree of 
directories to be traversed recursively. The files included can be specified 
with either a regex-style pattern or filename. The schema for the file is 
defined by column numbers, which map to the important bits of data including 
IDs and values. The files can even contain filters, which allow unneeded rows 
to be discarded or used for cross-cooccurrence calculations.</p>
 <p>See ItemSimilarityDriver.scala in Mahout's spark module if you want to 
customize the code. </p>
-<h3 id="defaults-in-the-spark-itemsimilarity-cli">Defaults in the 
<em><strong>spark-itemsimilarity</strong></em> CLI</h3>
+<h3 id="defaults-in-the-spark-itemsimilarity-cli">Defaults in the 
<em><strong>spark-itemsimilarity</strong></em> CLI<a class="headerlink" 
href="#defaults-in-the-spark-itemsimilarity-cli" title="Permanent 
link">&para;</a></h3>
 <p>If all defaults are used the input can be as simple as:</p>
 <div class="codehilite"><pre><span class="n">userID1</span><span 
class="p">,</span><span class="n">itemID1</span>
 <span class="n">userID2</span><span class="p">,</span><span 
class="n">itemID2</span>
@@ -378,7 +390,7 @@ to recommend.   </p>
 </pre></div>
 
 
-<h3 id="wzxhzdk18how-to-use-multiple-user-actionswzxhzdk19"><a 
name="multiple-actions">How To Use Multiple User Actions</a></h3>
+<h3 id="how-to-use-multiple-user-actions"><a name="multiple-actions">How To 
Use Multiple User Actions</a><a class="headerlink" 
href="#how-to-use-multiple-user-actions" title="Permanent link">&para;</a></h3>
 <p>Often we record various actions the user takes for later analytics. These 
can now be used to make recommendations. 
 The idea of a recommender is to recommend the action you want the user to 
make. For an ecom app this might be 
 a purchase action. It is usually not a good idea to just treat other actions 
the same as the action you want to recommend. 
@@ -412,7 +424,7 @@ action log of the form:</p>
 </pre></div>
 
 
-<h3 id="command-line">Command Line</h3>
+<h3 id="command-line">Command Line<a class="headerlink" href="#command-line" 
title="Permanent link">&para;</a></h3>
 <p>Use the following options:</p>
 <div class="codehilite"><pre><span class="n">bash</span>$ <span 
class="n">mahout</span> <span class="n">spark</span><span 
class="o">-</span><span class="n">itemsimilarity</span> <span class="o">\</span>
     <span class="o">--</span><span class="n">input</span> <span 
class="n">in</span><span class="o">-</span><span class="n">file</span> <span 
class="o">\</span>     # <span class="n">where</span> <span class="n">to</span> 
<span class="n">look</span> <span class="k">for</span> <span 
class="n">data</span>
@@ -426,7 +438,7 @@ action log of the form:</p>
 </pre></div>
 
 
-<h3 id="output">Output</h3>
+<h3 id="output">Output<a class="headerlink" href="#output" title="Permanent 
link">&para;</a></h3>
 <p>The output of the job will be the standard text version of two Mahout DRMs. 
This is a case where we are calculating 
 cross-cooccurrence so a primary indicator matrix and cross-cooccurrence 
indicator matrix will be created</p>
 <div class="codehilite"><pre><span class="n">out</span><span 
class="o">-</span><span class="n">path</span>
@@ -455,7 +467,7 @@ cross-cooccurrence so a primary indicato
 
 <p><strong>Note:</strong> You can run this multiple times to use more than two 
actions or you can use the underlying 
 SimilarityAnalysis.cooccurrence API, which will more efficiently calculate any 
number of cross-cooccurrence indicators.</p>
-<h3 id="log-file-input">Log File Input</h3>
+<h3 id="log-file-input">Log File Input<a class="headerlink" 
href="#log-file-input" title="Permanent link">&para;</a></h3>
 <p>A common method of storing data is in log files. If they are written using 
some delimiter they can be consumed directly by spark-itemsimilarity. For 
instance input of the form:</p>
 <div class="codehilite"><pre>2014<span class="o">-</span>06<span 
class="o">-</span>23 14<span class="p">:</span>46<span 
class="p">:</span>53<span class="p">.</span>115<span class="o">\</span><span 
class="n">tu1</span><span class="o">\</span><span 
class="n">tpurchase</span><span class="o">\</span><span 
class="n">trandom</span> <span class="n">text</span><span 
class="o">\</span><span class="n">tiphone</span>
 2014<span class="o">-</span>06<span class="o">-</span>23 14<span 
class="p">:</span>46<span class="p">:</span>53<span class="p">.</span>115<span 
class="o">\</span><span class="n">tu1</span><span class="o">\</span><span 
class="n">tpurchase</span><span class="o">\</span><span 
class="n">trandom</span> <span class="n">text</span><span 
class="o">\</span><span class="n">tipad</span>
@@ -494,7 +506,7 @@ SimilarityAnalysis.cooccurrence API, whi
 </pre></div>
 
 
-<h2 id="2-spark-rowsimilarity">2. spark-rowsimilarity</h2>
+<h2 id="2-spark-rowsimilarity">2. spark-rowsimilarity<a class="headerlink" 
href="#2-spark-rowsimilarity" title="Permanent link">&para;</a></h2>
 <p><em>spark-rowsimilarity</em> is the companion to 
<em>spark-itemsimilarity</em> the primary difference is that it takes a text 
file version of 
 a matrix of sparse vectors with optional application specific IDs and it finds 
similar rows rather than items (columns). Its use is
 not limited to collaborative filtering. The input is in text-delimited form 
where there are three delimiters used. By 
@@ -554,25 +566,25 @@ by a list of the most similar rows.</p>
 
 
 <p>See RowSimilarityDriver.scala in Mahout's spark module if you want to 
customize the code. </p>
-<h1 id="3-using-spark-rowsimilarity-with-text-data">3. Using 
<em>spark-rowsimilarity</em> with Text Data</h1>
+<h1 id="3-using-spark-rowsimilarity-with-text-data">3. Using 
<em>spark-rowsimilarity</em> with Text Data<a class="headerlink" 
href="#3-using-spark-rowsimilarity-with-text-data" title="Permanent 
link">&para;</a></h1>
 <p>Another use case for <em>spark-rowsimilarity</em> is in finding similar 
textual content. For instance given the tags associated with 
 a blog post,
  which other posts have similar tags. In this case the columns are tags and 
the rows are posts. Since LLR is 
 the only similarity method supported this is not the optimal way to determine 
general "bag-of-words" document similarity. 
 LLR is used more as a quality filter than as a similarity measure. However 
<em>spark-rowsimilarity</em> will produce 
 lists of similar docs for every doc if input is docs with lists of terms. The 
Apache <a href="http://lucene.apache.org";>Lucene</a> project provides several 
methods of <a 
href="http://lucene.apache.org/core/4_9_0/core/org/apache/lucene/analysis/package-summary.html#package_description";>analyzing
 and tokenizing</a> documents.</p>
-<h1 id="wzxhzdk244-creating-a-multimodal-recommenderwzxhzdk25"><a 
name="unified-recommender">4. Creating a Multimodal Recommender</a></h1>
+<h1 id="4-creating-a-multimodal-recommender"><a name="unified-recommender">4. 
Creating a Multimodal Recommender</a><a class="headerlink" 
href="#4-creating-a-multimodal-recommender" title="Permanent 
link">&para;</a></h1>
 <p>Using the output of <em>spark-itemsimilarity</em> and 
<em>spark-rowsimilarity</em> you can build a miltimodal cooccurrence and 
content based
  recommender that can be used in both or either mode depending on indicators 
available and the history available at 
 runtime for a user. Some slide describing this method can be found <a 
href="http://occamsmachete.com/ml/2014/10/07/creating-a-unified-recommender-with-mahout-and-a-search-engine/";>here</a></p>
-<h2 id="requirements">Requirements</h2>
+<h2 id="requirements">Requirements<a class="headerlink" href="#requirements" 
title="Permanent link">&para;</a></h2>
 <ol>
 <li>Mahout SNAPSHOT-1.0 or later</li>
 <li>Hadoop</li>
 <li>Spark, the correct version for your version of Mahout and Hadoop</li>
 <li>A search engine like Solr or Elasticsearch</li>
 </ol>
-<h2 id="indicators">Indicators</h2>
+<h2 id="indicators">Indicators<a class="headerlink" href="#indicators" 
title="Permanent link">&para;</a></h2>
 <p>Indicators come in 3 types</p>
 <ol>
 <li><strong>Cooccurrence</strong>: calculated with 
<em>spark-itemsimilarity</em> from user actions</li>
@@ -588,7 +600,7 @@ while working well for items with lots o
 indicators developers can create a solution for the "cold-start" problem that 
gracefully improves with more user history
 and as items have more interactions. It is also possible to create a 
completely content-based recommender that personalizes 
 recommendations.</p>
-<h2 id="example-with-3-indicators">Example with 3 Indicators</h2>
+<h2 id="example-with-3-indicators">Example with 3 Indicators<a 
class="headerlink" href="#example-with-3-indicators" title="Permanent 
link">&para;</a></h2>
 <p>You will need to decide how you store user action data so they can be 
processed by the item and row similarity jobs and 
 this is most easily done by using text files as described above. The data that 
is processed by these jobs is considered the 
 training data. You will need some amount of user history in your recs query. 
It is typical to use the most recent user history 
@@ -606,7 +618,7 @@ but rather a "content" or "metadata" typ
 individual that you are making recs for. This means that this method will make 
recommendations for items that have 
 no collaborative filtering data, as happens with new items in a catalog. New 
items may have tags assigned but no one
  has purchased or viewed them yet. In the final query we will mix all 3 
indicators.</p>
-<h2 id="content-indicator">Content Indicator</h2>
+<h2 id="content-indicator">Content Indicator<a class="headerlink" 
href="#content-indicator" title="Permanent link">&para;</a></h2>
 <p>To create a content-indicator we'll make use of the fact that the user has 
purchased items with certain tags. We want to find 
 items with the most similar tags. Notice that other users' behavior is not 
considered--only other item's tags. This defines a 
 content or metadata indicator. They are used when you want to find items that 
are similar to other items by using their 
@@ -642,7 +654,7 @@ is finished we no longer need the streng
 
 
 <p>We now have three indicators, two collaborative filtering type and one 
content type.</p>
-<h2 id="multimodal-recommender-query">Multimodal Recommender Query</h2>
+<h2 id="multimodal-recommender-query">Multimodal Recommender Query<a 
class="headerlink" href="#multimodal-recommender-query" title="Permanent 
link">&para;</a></h2>
 <p>The actual form of the query for recommendations will vary depending on 
your search engine but the intent is the same. For a given user, map their 
history of an action or content to the correct indicator field and perform an 
OR'd query. </p>
 <p>We have 3 indicators, these are indexed by the search engine into 3 fields, 
we'll call them "purchase", "view", and "tags". 
 We take the user's history that corresponds to each indicator and create a 
query of the form:</p>
@@ -669,7 +681,7 @@ on the popularity field. If we use the e
 
 
 <p>This will return recommendations favoring ones that have the intrinsic 
indicator "hot".</p>
-<h2 id="notes">Notes</h2>
+<h2 id="notes">Notes<a class="headerlink" href="#notes" title="Permanent 
link">&para;</a></h2>
 <ol>
 <li>Use as much user action history as you can gather. Choose a primary action 
that is closest to what you want to recommend and the others will be used to 
create cross-cooccurrence indicators. Using more data in this fashion will 
almost always produce better recommendations.</li>
 <li>Content can be used where there is no recorded user behavior or when items 
change too quickly to get much interaction history. They can be used alone or 
mixed with other indicators.</li>

Modified: 
websites/staging/mahout/trunk/content/users/algorithms/recommender-overview.html
==============================================================================
--- 
websites/staging/mahout/trunk/content/users/algorithms/recommender-overview.html
 (original)
+++ 
websites/staging/mahout/trunk/content/users/algorithms/recommender-overview.html
 Fri Apr  8 18:41:08 2016
@@ -146,6 +146,7 @@
                   <li class="nav-header">Engines</li>
                   <li><a href="/users/sparkbindings/home.html">Spark</a></li>
                   <li><a 
href="/users/environment/h2o-internals.html">H2O</a></li>
+                  <li><a href="/users/flinkbindings/home.html">Flink</a></li>
                   <li class="nav-header">References</li>
                   <li><a 
href="/users/environment/in-core-reference.html">In-Core Algebraic DSL 
Reference</a></li>
                   <li><a 
href="/users/environment/out-of-core-reference.html">Distributed Algebraic DSL 
Reference</a></li>
@@ -263,14 +264,25 @@
 
   <div id="content-wrap" class="clearfix">
    <div id="main">
-    <h1 id="recommender-overview">Recommender Overview</h1>
+    <style type="text/css">
+/* The following code is added by mdx_elementid.py
+   It was originally lifted from http://subversion.apache.org/style/site.css */
+/*
+ * Hide class="elementid-permalink", except when an enclosing heading
+ * has the :hover property.
+ */
+.headerlink, .elementid-permalink {
+  visibility: hidden;
+}
+h2:hover > .headerlink, h3:hover > .headerlink, h1:hover > .headerlink, 
h6:hover > .headerlink, h4:hover > .headerlink, h5:hover > .headerlink, 
dt:hover > .elementid-permalink { visibility: visible }</style>
+<h1 id="recommender-overview">Recommender Overview<a class="headerlink" 
href="#recommender-overview" title="Permanent link">&para;</a></h1>
 <p>Recommenders have changed over the years. Mahout contains a long list of 
them, which you can still use. But to get the best  out of our more modern 
aproach we'll need to think of the Recommender as a "model creation" 
component&mdash;supplied by Mahout's new spark-itemsimilarity job, and a 
"serving" component&mdash;supplied by a modern scalable search engine, like 
Solr.</p>
 <p><img alt="image" src="http://i.imgur.com/fliHMBo.png"; /></p>
 <p>To integrate with your application you will collect user interactions 
storing them in a DB and also in a from usable by Mahout. The simplest way to 
do this is to log user interactions to csv files (user-id, item-id). The DB 
should be setup to contain the last n user interactions, which will form part 
of the query for recommendations.</p>
 <p>Mahout's spark-itemsimilarity will create a table of (item-id, 
list-of-similar-items) in csv form. Think of this as an item collection with 
one field containing the item-ids of similar items. Index this with your search 
engine. </p>
 <p>When your application needs recommendations for a specific person, get the 
latest user history of interactions from the DB and query the indicator 
collection with this history. You will get back an ordered list of item-ids. 
These are your recommendations. You may wish to filter out any that the user 
has already seen but that will depend on your use case.</p>
 <p>All ids for users and items are preserved as string tokens and so work as 
an external key in DBs or as doc ids for search engines, they also work as 
tokens for search queries.</p>
-<h2 id="references">References</h2>
+<h2 id="references">References<a class="headerlink" href="#references" 
title="Permanent link">&para;</a></h2>
 <ol>
 <li>A free ebook, which talks about the general idea: <a 
href="https://www.mapr.com/practical-machine-learning";>Practical Machine 
Learning</a></li>
 <li>A slide deck, which talks about mixing actions or other indicators: <a 
href="http://occamsmachete.com/ml/2014/10/07/creating-a-unified-recommender-with-mahout-and-a-search-engine/";>Creating
 a Multimodal Recommender with Mahout and a Search Engine</a></li>
@@ -278,7 +290,7 @@
 and  <a 
href="http://occamsmachete.com/ml/2014/09/09/mahout-on-spark-whats-new-in-recommenders-part-2/";>What's
 New in Recommenders: part #2</a></li>
 <li>A post describing the loglikelihood ratio:  <a 
href="http://tdunning.blogspot.com/2008/03/surprise-and-coincidence.html";>Surprise
 and Coinsidense</a>  LLR is used to reduce noise in the data while keeping the 
calculations O(n) complexity.</li>
 </ol>
-<h2 id="mahout-model-creation">Mahout Model Creation</h2>
+<h2 id="mahout-model-creation">Mahout Model Creation<a class="headerlink" 
href="#mahout-model-creation" title="Permanent link">&para;</a></h2>
 <p>See the page describing <a 
href="http://mahout.apache.org/users/recommender/intro-cooccurrence-spark.html";><em>spark-itemsimilarity</em></a>
 for more details.</p>
    </div>
   </div>     

Modified: 
websites/staging/mahout/trunk/content/users/algorithms/spark-naive-bayes.html
==============================================================================
--- 
websites/staging/mahout/trunk/content/users/algorithms/spark-naive-bayes.html 
(original)
+++ 
websites/staging/mahout/trunk/content/users/algorithms/spark-naive-bayes.html 
Fri Apr  8 18:41:08 2016
@@ -146,6 +146,7 @@
                   <li class="nav-header">Engines</li>
                   <li><a href="/users/sparkbindings/home.html">Spark</a></li>
                   <li><a 
href="/users/environment/h2o-internals.html">H2O</a></li>
+                  <li><a href="/users/flinkbindings/home.html">Flink</a></li>
                   <li class="nav-header">References</li>
                   <li><a 
href="/users/environment/in-core-reference.html">In-Core Algebraic DSL 
Reference</a></li>
                   <li><a 
href="/users/environment/out-of-core-reference.html">Distributed Algebraic DSL 
Reference</a></li>

Modified: websites/staging/mahout/trunk/content/users/basics/algorithms.html
==============================================================================
--- websites/staging/mahout/trunk/content/users/basics/algorithms.html 
(original)
+++ websites/staging/mahout/trunk/content/users/basics/algorithms.html Fri Apr  
8 18:41:08 2016
@@ -146,6 +146,7 @@
                   <li class="nav-header">Engines</li>
                   <li><a href="/users/sparkbindings/home.html">Spark</a></li>
                   <li><a 
href="/users/environment/h2o-internals.html">H2O</a></li>
+                  <li><a href="/users/flinkbindings/home.html">Flink</a></li>
                   <li class="nav-header">References</li>
                   <li><a 
href="/users/environment/in-core-reference.html">In-Core Algebraic DSL 
Reference</a></li>
                   <li><a 
href="/users/environment/out-of-core-reference.html">Distributed Algebraic DSL 
Reference</a></li>
@@ -263,9 +264,20 @@
 
   <div id="content-wrap" class="clearfix">
    <div id="main">
-    <hr />
-<h2 id="mahout-0101-features-by-engine"><em>Mahout 0.10.1 Features by 
Engine</em></h2>
-<table>
+    <style type="text/css">
+/* The following code is added by mdx_elementid.py
+   It was originally lifted from http://subversion.apache.org/style/site.css */
+/*
+ * Hide class="elementid-permalink", except when an enclosing heading
+ * has the :hover property.
+ */
+.headerlink, .elementid-permalink {
+  visibility: hidden;
+}
+h2:hover > .headerlink, h3:hover > .headerlink, h1:hover > .headerlink, 
h6:hover > .headerlink, h4:hover > .headerlink, h5:hover > .headerlink, 
dt:hover > .elementid-permalink { visibility: visible }</style>
+<hr />
+<h2 id="mahout-0101-features-by-engine"><em>Mahout 0.10.1 Features by 
Engine</em><a class="headerlink" href="#mahout-0101-features-by-engine" 
title="Permanent link">&para;</a></h2>
+<table class="table">
 <thead>
 <tr>
 <th></th>

Modified: websites/staging/mahout/trunk/content/users/basics/collections.html
==============================================================================
--- websites/staging/mahout/trunk/content/users/basics/collections.html 
(original)
+++ websites/staging/mahout/trunk/content/users/basics/collections.html Fri Apr 
 8 18:41:08 2016
@@ -146,6 +146,7 @@
                   <li class="nav-header">Engines</li>
                   <li><a href="/users/sparkbindings/home.html">Spark</a></li>
                   <li><a 
href="/users/environment/h2o-internals.html">H2O</a></li>
+                  <li><a href="/users/flinkbindings/home.html">Flink</a></li>
                   <li class="nav-header">References</li>
                   <li><a 
href="/users/environment/in-core-reference.html">In-Core Algebraic DSL 
Reference</a></li>
                   <li><a 
href="/users/environment/out-of-core-reference.html">Distributed Algebraic DSL 
Reference</a></li>
@@ -263,9 +264,20 @@
 
   <div id="content-wrap" class="clearfix">
    <div id="main">
-    <p>Organize by usage? (classification, recommendation etc.)</p>
+    <style type="text/css">
+/* The following code is added by mdx_elementid.py
+   It was originally lifted from http://subversion.apache.org/style/site.css */
+/*
+ * Hide class="elementid-permalink", except when an enclosing heading
+ * has the :hover property.
+ */
+.headerlink, .elementid-permalink {
+  visibility: hidden;
+}
+h2:hover > .headerlink, h3:hover > .headerlink, h1:hover > .headerlink, 
h6:hover > .headerlink, h4:hover > .headerlink, h5:hover > .headerlink, 
dt:hover > .elementid-permalink { visibility: visible }</style>
+<p>Organize by usage? (classification, recommendation etc.)</p>
 <p><a name="Collections-CollectionsofCollections"></a></p>
-<h2 id="collections-of-collections">Collections of Collections</h2>
+<h2 id="collections-of-collections">Collections of Collections<a 
class="headerlink" href="#collections-of-collections" title="Permanent 
link">&para;</a></h2>
 <ul>
 <li><a href="http://mldata.org/about/";>ML Data</a>
  ... repository supported by Pascal 2.</li>
@@ -280,7 +292,7 @@
  LinkedIn discussion of lots of data sets</li>
 </ul>
 <p><a name="Collections-CategorizationData"></a></p>
-<h2 id="categorization-data">Categorization Data</h2>
+<h2 id="categorization-data">Categorization Data<a class="headerlink" 
href="#categorization-data" title="Permanent link">&para;</a></h2>
 <ul>
 <li><a 
href="http://people.csail.mit.edu/jrennie/20Newsgroups/";>20Newsgroups</a></li>
 <li><a 
href="http://jmlr.csail.mit.edu/papers/volume5/lewis04a/lyrl2004_rcv1v2_README.htm";>RCV1
 data set</a></li>
@@ -292,7 +304,7 @@ There is a newer beta verson here:<a hre
 <li>Lending Club load data <a 
href="https://www.lendingclub.com/info/download-data.action";>https://www.lendingclub.com/info/download-data.action</a></li>
 </ul>
 <p><a name="Collections-RecommendationData"></a></p>
-<h2 id="recommendation-data">Recommendation Data</h2>
+<h2 id="recommendation-data">Recommendation Data<a class="headerlink" 
href="#recommendation-data" title="Permanent link">&para;</a></h2>
 <ul>
 <li><a href="http://library.hud.ac.uk/data/usagedata/";>Book usage and 
recommendation data from the University of Huddersfield</a></li>
 <li><a 
href="http://denoiserthebetter.posterous.com/music-recommendation-datasets";>Last.fm</a>
@@ -302,7 +314,7 @@ There is a newer beta verson here:<a hre
 <li><a href="http://www.grouplens.org/node/73";>GroupLens/MovieLens Movie 
Review Dataset</a></li>
 </ul>
 <p><a name="Collections-MultilingualData"></a></p>
-<h2 id="multilingual-data">Multilingual Data</h2>
+<h2 id="multilingual-data">Multilingual Data<a class="headerlink" 
href="#multilingual-data" title="Permanent link">&para;</a></h2>
 <ul>
 <li><a 
href="http://urd.let.rug.nl/tiedeman/OPUS/OpenSubtitles.php";>http://urd.let.rug.nl/tiedeman/OPUS/OpenSubtitles.php</a>
  - 308,000 subtitle files covering about 18,900 movies in 59 languages
@@ -314,14 +326,14 @@ The original site, OpenSubtitles.org, is
 corpuses of European and Canadian legal tomes.</li>
 </ul>
 <p><a name="Collections-Geospatial"></a></p>
-<h2 id="geospatial">Geospatial</h2>
+<h2 id="geospatial">Geospatial<a class="headerlink" href="#geospatial" 
title="Permanent link">&para;</a></h2>
 <ul>
 <li><a href="http://www.naturalearthdata.com/";>Natural Earth Data</a></li>
 <li><a href="http://wiki.openstreetmap.org/wiki/Main_Page";>Open Street Maps</a>
 And other crowd-sourced mapping data sites.</li>
 </ul>
 <p><a name="Collections-Airline"></a></p>
-<h2 id="airline">Airline</h2>
+<h2 id="airline">Airline<a class="headerlink" href="#airline" title="Permanent 
link">&para;</a></h2>
 <ul>
 <li><a href="http://openflights.org/";>Open Flights</a>
  - Crowd-sourced database of airlines, flights, airports, times, etc.</li>
@@ -329,7 +341,7 @@ And other crowd-sourced mapping data sit
  - 120m CSV records, 12G uncompressed</li>
 </ul>
 <p><a name="Collections-GeneralResources"></a></p>
-<h2 id="general-resources">General Resources</h2>
+<h2 id="general-resources">General Resources<a class="headerlink" 
href="#general-resources" title="Permanent link">&para;</a></h2>
 <ul>
 <li><a href="http://theinfo.org/";>theinfo</a></li>
 <li><a href="http://wordnet.princeton.edu/obtain";>WordNet</a></li>
@@ -337,7 +349,7 @@ And other crowd-sourced mapping data sit
  - freely available web crawl on EC2</li>
 </ul>
 <p><a name="Collections-Stuff"></a></p>
-<h2 id="stuff">Stuff</h2>
+<h2 id="stuff">Stuff<a class="headerlink" href="#stuff" title="Permanent 
link">&para;</a></h2>
 <ul>
 <li><a 
href="http://www.cs.technion.ac.il/~gabr/resources/data/ne_datasets.html";>http://www.cs.technion.ac.il/~gabr/resources/data/ne_datasets.html</a></li>
 <li><a 
href="http://www-2.cs.cmu.edu/afs/cs.cmu.edu/project/theo-20/www/data/";>4 
Universities Data Set</a></li>

Modified: websites/staging/mahout/trunk/content/users/basics/collocations.html
==============================================================================
--- websites/staging/mahout/trunk/content/users/basics/collocations.html 
(original)
+++ websites/staging/mahout/trunk/content/users/basics/collocations.html Fri 
Apr  8 18:41:08 2016
@@ -146,6 +146,7 @@
                   <li class="nav-header">Engines</li>
                   <li><a href="/users/sparkbindings/home.html">Spark</a></li>
                   <li><a 
href="/users/environment/h2o-internals.html">H2O</a></li>
+                  <li><a href="/users/flinkbindings/home.html">Flink</a></li>
                   <li class="nav-header">References</li>
                   <li><a 
href="/users/environment/in-core-reference.html">In-Core Algebraic DSL 
Reference</a></li>
                   <li><a 
href="/users/environment/out-of-core-reference.html">Distributed Algebraic DSL 
Reference</a></li>
@@ -263,8 +264,19 @@
 
   <div id="content-wrap" class="clearfix">
    <div id="main">
-    <p><a name="Collocations-CollocationsinMahout"></a></p>
-<h1 id="collocations-in-mahout">Collocations in Mahout</h1>
+    <style type="text/css">
+/* The following code is added by mdx_elementid.py
+   It was originally lifted from http://subversion.apache.org/style/site.css */
+/*
+ * Hide class="elementid-permalink", except when an enclosing heading
+ * has the :hover property.
+ */
+.headerlink, .elementid-permalink {
+  visibility: hidden;
+}
+h2:hover > .headerlink, h3:hover > .headerlink, h1:hover > .headerlink, 
h6:hover > .headerlink, h4:hover > .headerlink, h5:hover > .headerlink, 
dt:hover > .elementid-permalink { visibility: visible }</style>
+<p><a name="Collocations-CollocationsinMahout"></a></p>
+<h1 id="collocations-in-mahout">Collocations in Mahout<a class="headerlink" 
href="#collocations-in-mahout" title="Permanent link">&para;</a></h1>
 <p>A collocation is defined as a sequence of words or terms which co-occur
 more often than would be expected by chance. Statistically relevant
 combinations of terms identify additional lexical units which can be
@@ -272,7 +284,7 @@ treated as features in a vector-based re
 discussion of collocations can be found on <a 
href="http://en.wikipedia.org/wiki/Collocation";>Wikipedia</a>.</p>
 <p>See there for a more detailed discussion of collocations in the <a 
href="http://comments.gmane.org/gmane.comp.apache.mahout.user/5685";>Reuters 
example</a>.</p>
 <p><a name="Collocations-Log-LikelihoodbasedCollocationIdentification"></a></p>
-<h2 
id="theory-behind-implementation-log-likelihood-based-collocation-identification">Theory
 behind implementation: Log-Likelihood based Collocation Identification</h2>
+<h2 
id="theory-behind-implementation-log-likelihood-based-collocation-identification">Theory
 behind implementation: Log-Likelihood based Collocation Identification<a 
class="headerlink" 
href="#theory-behind-implementation-log-likelihood-based-collocation-identification"
 title="Permanent link">&para;</a></h2>
 <p>Mahout provides an implementation of a collocation identification algorithm
 which scores collocations using log-likelihood ratio. The log-likelihood
 score indicates the relative usefulness of a collocation with regards other
@@ -315,7 +327,7 @@ ngram are treated as the head and tail.<
 occur around other interesting features of the text such as sentence
 boundaries.</p>
 <p><a name="Collocations-GeneratingNGrams"></a></p>
-<h2 id="generating-ngrams">Generating NGrams</h2>
+<h2 id="generating-ngrams">Generating NGrams<a class="headerlink" 
href="#generating-ngrams" title="Permanent link">&para;</a></h2>
 <p>The tools that the collocation identification algorithm are embeeded within
 either consume tokenized text as input or provide the ability to specify an
 implementation of the Lucene Analyzer class perform tokenization in order
@@ -330,11 +342,11 @@ Note that both bigrams and trigrams are
 to the existing algorithm would involve limiting the output to a particular
 gram size as opposed to solely specifiying a max ngram size.</p>
 <p><a 
name="Collocations-RunningtheCollocationIdentificationAlgorithm."></a></p>
-<h2 id="running-the-collocation-identification-algorithm">Running the 
Collocation Identification Algorithm.</h2>
+<h2 id="running-the-collocation-identification-algorithm">Running the 
Collocation Identification Algorithm.<a class="headerlink" 
href="#running-the-collocation-identification-algorithm" title="Permanent 
link">&para;</a></h2>
 <p>There are a couple ways to run the llr-based collocation algorithm in
 mahout</p>
 <p><a name="Collocations-Whencreatingvectorsfromasequencefile"></a></p>
-<h3 id="when-creating-vectors-from-a-sequence-file">When creating vectors from 
a sequence file</h3>
+<h3 id="when-creating-vectors-from-a-sequence-file">When creating vectors from 
a sequence file<a class="headerlink" 
href="#when-creating-vectors-from-a-sequence-file" title="Permanent 
link">&para;</a></h3>
 <p>The llr collocation identifier is integrated into the process that is used
 to create vectors from sequence files of text keys and values. Collocations
 are generated when the --maxNGramSize (-ng) option is not specified and
@@ -396,7 +408,7 @@ times. </p>
 
 
 <p><a name="Collocations-CollocDriver"></a></p>
-<h3 id="collocdriver">CollocDriver</h3>
+<h3 id="collocdriver">CollocDriver<a class="headerlink" href="#collocdriver" 
title="Permanent link">&para;</a></h3>
 <div class="codehilite"><pre><span class="n">bin</span><span 
class="o">/</span><span class="n">mahout</span> <span class="n">org</span><span 
class="p">.</span><span class="n">apache</span><span class="p">.</span><span 
class="n">mahout</span><span class="p">.</span><span 
class="n">vectorizer</span><span class="p">.</span><span 
class="n">collocations</span><span class="p">.</span><span 
class="n">llr</span><span class="p">.</span><span class="n">CollocDriver</span>
 
 <span class="n">Usage</span><span class="p">:</span>                           
           
@@ -440,7 +452,7 @@ times. </p>
 
 
 <p><a name="Collocations-Algorithmdetails"></a></p>
-<h2 id="algorithm-details">Algorithm details</h2>
+<h2 id="algorithm-details">Algorithm details<a class="headerlink" 
href="#algorithm-details" title="Permanent link">&para;</a></h2>
 <p>This section describes the implementation of the collocation identification
 algorithm in terms of the map-reduce phases that are used to generate
 ngrams and count the frequencies required to perform the log-likelihood
@@ -449,10 +461,10 @@ CamelCase can be found in the mahout-uti
 org.apache.mahout.utils.nlp.collocations.llr</p>
 <p>The algorithm is implemented in two map-reduce passes:</p>
 <p><a name="Collocations-Pass1:CollocDriver.generateCollocations(...)"></a></p>
-<h3 id="pass-1-collocdrivergeneratecollocations">Pass 1: 
CollocDriver.generateCollocations(...)</h3>
+<h3 id="pass-1-collocdrivergeneratecollocations">Pass 1: 
CollocDriver.generateCollocations(...)<a class="headerlink" 
href="#pass-1-collocdrivergeneratecollocations" title="Permanent 
link">&para;</a></h3>
 <p>Generates NGrams and counts frequencies for ngrams, head and tail 
subgrams.</p>
 <p><a name="Collocations-Map:CollocMapper"></a></p>
-<h4 id="map-collocmapper">Map: CollocMapper</h4>
+<h4 id="map-collocmapper">Map: CollocMapper<a class="headerlink" 
href="#map-collocmapper" title="Permanent link">&para;</a></h4>
 <p>Input k: Text (documentId), v: StringTuple (tokens) </p>
 <p>Each call to the mapper passes in the full set of tokens for the
 corresponding document using a StringTuple. The ShingleFilter is run across
@@ -477,7 +489,7 @@ encountered in the input which is used a
 <p>Output k: GramKey (head or tail subgram), v: Gram (head, tail or ngram with
 frequency)</p>
 <p><a name="Collocations-Combiner:CollocCombiner"></a></p>
-<h4 id="combiner-colloccombiner">Combiner: CollocCombiner</h4>
+<h4 id="combiner-colloccombiner">Combiner: CollocCombiner<a class="headerlink" 
href="#combiner-colloccombiner" title="Permanent link">&para;</a></h4>
 <p>Input k: GramKey, v:Gram (as above)</p>
 <p>This phase merges the counts for unique ngrams or ngram fragments across
 multiple documents. The combiner treats the entire GramKey as the key and
@@ -486,7 +498,7 @@ call to the combiner's reduce method, th
 single tuple is passed out via the collector.</p>
 <p>Output k: GramKey, v:Gram</p>
 <p><a name="Collocations-Reduce:CollocReducer"></a></p>
-<h4 id="reduce-collocreducer">Reduce: CollocReducer</h4>
+<h4 id="reduce-collocreducer">Reduce: CollocReducer<a class="headerlink" 
href="#reduce-collocreducer" title="Permanent link">&para;</a></h4>
 <p>Input k: GramKey, v: Gram (as above)</p>
 <p>The CollocReducer employs the Hadoop secondary sort strategy to avoid
 caching ngram tuples in memory in order to calculate total ngram and
@@ -546,15 +558,15 @@ be incremented.</p>
 <p>Output is in the format k:Gram (ngram, frequency), v:Gram (subgram,
 frequency)</p>
 <p><a 
name="Collocations-Pass2:CollocDriver.computeNGramsPruneByLLR(...)"></a></p>
-<h3 id="pass-2-collocdrivercomputengramsprunebyllr">Pass 2: 
CollocDriver.computeNGramsPruneByLLR(...)</h3>
+<h3 id="pass-2-collocdrivercomputengramsprunebyllr">Pass 2: 
CollocDriver.computeNGramsPruneByLLR(...)<a class="headerlink" 
href="#pass-2-collocdrivercomputengramsprunebyllr" title="Permanent 
link">&para;</a></h3>
 <p>Pass 1 has calculated full frequencies for ngrams and subgrams, Pass 2
 performs the LLR calculation.</p>
 <p><a 
name="Collocations-MapPhase:IdentityMapper(org.apache.hadoop.mapred.lib.IdentityMapper)"></a></p>
-<h4 id="map-phase-identitymapper-orgapachehadoopmapredlibidentitymapper">Map 
Phase: IdentityMapper (org.apache.hadoop.mapred.lib.IdentityMapper)</h4>
+<h4 id="map-phase-identitymapper-orgapachehadoopmapredlibidentitymapper">Map 
Phase: IdentityMapper (org.apache.hadoop.mapred.lib.IdentityMapper)<a 
class="headerlink" 
href="#map-phase-identitymapper-orgapachehadoopmapredlibidentitymapper" 
title="Permanent link">&para;</a></h4>
 <p>This phase is a no-op. The data is passed through unchanged. The rest of
 the work for llr calculation is done in the reduce phase.</p>
 <p><a name="Collocations-ReducePhase:LLRReducer"></a></p>
-<h4 id="reduce-phase-llrreducer">Reduce Phase: LLRReducer</h4>
+<h4 id="reduce-phase-llrreducer">Reduce Phase: LLRReducer<a class="headerlink" 
href="#reduce-phase-llrreducer" title="Permanent link">&para;</a></h4>
 <p>Input is k:Gram, v:Gram (as above)</p>
 <p>This phase receives the head and tail subgrams and their frequencies for
 each ngram (with frequency) produced for the input:</p>
@@ -578,7 +590,7 @@ tail and N is the total number of ngrams
 the Skipped.LESS_THAN_MIN_LLR counter is incremented.</p>
 <p>Output is k: Text (ngram), v: DoubleWritable (llr score)</p>
 <p><a name="Collocations-Unigrampass-through."></a></p>
-<h3 id="unigram-pass-through">Unigram pass-through.</h3>
+<h3 id="unigram-pass-through">Unigram pass-through.<a class="headerlink" 
href="#unigram-pass-through" title="Permanent link">&para;</a></h3>
 <p>By default in seq2sparse, or if the -u option is provided to the
 CollocDriver, unigrams (single tokens) will be passed through the job and
 each token's frequency will be calculated. As with ngrams, unigrams are

Modified: 
websites/staging/mahout/trunk/content/users/basics/creating-vectors-from-text.html
==============================================================================
--- 
websites/staging/mahout/trunk/content/users/basics/creating-vectors-from-text.html
 (original)
+++ 
websites/staging/mahout/trunk/content/users/basics/creating-vectors-from-text.html
 Fri Apr  8 18:41:08 2016
@@ -146,6 +146,7 @@
                   <li class="nav-header">Engines</li>
                   <li><a href="/users/sparkbindings/home.html">Spark</a></li>
                   <li><a 
href="/users/environment/h2o-internals.html">H2O</a></li>
+                  <li><a href="/users/flinkbindings/home.html">Flink</a></li>
                   <li class="nav-header">References</li>
                   <li><a 
href="/users/environment/in-core-reference.html">In-Core Algebraic DSL 
Reference</a></li>
                   <li><a 
href="/users/environment/out-of-core-reference.html">Distributed Algebraic DSL 
Reference</a></li>
@@ -263,13 +264,24 @@
 
   <div id="content-wrap" class="clearfix">
    <div id="main">
-    <h1 id="creating-vectors-from-text">Creating vectors from text</h1>
+    <style type="text/css">
+/* The following code is added by mdx_elementid.py
+   It was originally lifted from http://subversion.apache.org/style/site.css */
+/*
+ * Hide class="elementid-permalink", except when an enclosing heading
+ * has the :hover property.
+ */
+.headerlink, .elementid-permalink {
+  visibility: hidden;
+}
+h2:hover > .headerlink, h3:hover > .headerlink, h1:hover > .headerlink, 
h6:hover > .headerlink, h4:hover > .headerlink, h5:hover > .headerlink, 
dt:hover > .elementid-permalink { visibility: visible }</style>
+<h1 id="creating-vectors-from-text">Creating vectors from text<a 
class="headerlink" href="#creating-vectors-from-text" title="Permanent 
link">&para;</a></h1>
 <p><a name="CreatingVectorsfromText-Introduction"></a></p>
-<h1 id="introduction">Introduction</h1>
+<h1 id="introduction">Introduction<a class="headerlink" href="#introduction" 
title="Permanent link">&para;</a></h1>
 <p>For clustering and classifying documents it is usually necessary to convert 
the raw text
 into vectors that can then be consumed by the clustering <a 
href="algorithms.html">Algorithms</a>.  These approaches are described 
below.</p>
 <p><a name="CreatingVectorsfromText-FromLucene"></a></p>
-<h1 id="from-lucene">From Lucene</h1>
+<h1 id="from-lucene">From Lucene<a class="headerlink" href="#from-lucene" 
title="Permanent link">&para;</a></h1>
 <p><em>NOTE: Your Lucene index must be created with the same version of Lucene
 used in Mahout.  As of Mahout 0.9 this is Lucene 4.6.1. If these versions dont 
match you will likely get "Exception in thread "main"
 org.apache.lucene.index.CorruptIndexException: Unknown format version: -11"
@@ -292,7 +304,7 @@ in the org.apache.mahout.utils.vectors p
 several input options, which can be displayed by specifying the --help
 option.  Examples of running the driver are included below:</p>
 <p><a 
name="CreatingVectorsfromText-GeneratinganoutputfilefromaLuceneIndex"></a></p>
-<h4 id="generating-an-output-file-from-a-lucene-index">Generating an output 
file from a Lucene Index</h4>
+<h4 id="generating-an-output-file-from-a-lucene-index">Generating an output 
file from a Lucene Index<a class="headerlink" 
href="#generating-an-output-file-from-a-lucene-index" title="Permanent 
link">&para;</a></h4>
 <div class="codehilite"><pre>$<span class="n">MAHOUT_HOME</span><span 
class="o">/</span><span class="n">bin</span><span class="o">/</span><span 
class="n">mahout</span> <span class="n">lucene</span><span 
class="p">.</span><span class="n">vector</span> 
     <span class="o">--</span><span class="n">dir</span> <span 
class="p">(</span><span class="o">-</span><span class="n">d</span><span 
class="p">)</span> <span class="n">dir</span>                     <span 
class="n">The</span> <span class="n">Lucene</span> <span 
class="n">directory</span>      
     <span class="o">--</span><span class="n">idField</span> <span 
class="n">idField</span>                  <span class="n">The</span> <span 
class="n">field</span> <span class="n">in</span> <span class="n">the</span> 
<span class="n">index</span>    
@@ -348,7 +360,7 @@ option.  Examples of running the driver
 </pre></div>
 
 
-<h4 id="example-create-50-vectors-from-an-index">Example: Create 50 Vectors 
from an Index</h4>
+<h4 id="example-create-50-vectors-from-an-index">Example: Create 50 Vectors 
from an Index<a class="headerlink" 
href="#example-create-50-vectors-from-an-index" title="Permanent 
link">&para;</a></h4>
 <div class="codehilite"><pre>$<span class="n">MAHOUT_HOME</span><span 
class="o">/</span><span class="n">bin</span><span class="o">/</span><span 
class="n">mahout</span> <span class="n">lucene</span><span 
class="p">.</span><span class="n">vector</span>
     <span class="o">--</span><span class="n">dir</span> $<span 
class="n">WORK_DIR</span><span class="o">/</span><span 
class="n">wikipedia</span><span class="o">/</span><span 
class="n">solr</span><span class="o">/</span><span class="n">data</span><span 
class="o">/</span><span class="n">index</span> 
     <span class="o">--</span><span class="n">field</span> <span 
class="n">body</span> 
@@ -363,7 +375,7 @@ out the info to the output dir and the d
 outputs 50 vectors.  If you don't specify --max, then all the documents in
 the index are output.</p>
 <p><a name="CreatingVectorsfromText-50VectorsFromLuceneL2Norm"></a></p>
-<h4 
id="example-creating-50-normalized-vectors-from-a-lucene-index-using-the-l_2-norm">Example:
 Creating 50 Normalized Vectors from a Lucene Index using the <a 
href="http://en.wikipedia.org/wiki/Lp_space";>L_2 Norm</a></h4>
+<h4 
id="example-creating-50-normalized-vectors-from-a-lucene-index-using-the-l_2-norm">Example:
 Creating 50 Normalized Vectors from a Lucene Index using the <a 
href="http://en.wikipedia.org/wiki/Lp_space";>L_2 Norm</a><a class="headerlink" 
href="#example-creating-50-normalized-vectors-from-a-lucene-index-using-the-l_2-norm"
 title="Permanent link">&para;</a></h4>
 <div class="codehilite"><pre>$<span class="n">MAHOUT_HOME</span><span 
class="o">/</span><span class="n">bin</span><span class="o">/</span><span 
class="n">mahout</span> <span class="n">lucene</span><span 
class="p">.</span><span class="n">vector</span> 
     <span class="o">--</span><span class="n">dir</span> $<span 
class="n">WORK_DIR</span><span class="o">/</span><span 
class="n">wikipedia</span><span class="o">/</span><span 
class="n">solr</span><span class="o">/</span><span class="n">data</span><span 
class="o">/</span><span class="n">index</span> 
     <span class="o">--</span><span class="n">field</span> <span 
class="n">body</span> 
@@ -375,7 +387,7 @@ the index are output.</p>
 
 
 <p><a name="CreatingVectorsfromText-FromDirectoryofTextdocuments"></a></p>
-<h2 id="from-a-directory-of-text-documents">From A Directory of Text 
documents</h2>
+<h2 id="from-a-directory-of-text-documents">From A Directory of Text 
documents<a class="headerlink" href="#from-a-directory-of-text-documents" 
title="Permanent link">&para;</a></h2>
 <p>Mahout has utilities to generate Vectors from a directory of text
 documents. Before creating the vectors, you need to convert the documents
 to SequenceFile format. SequenceFile is a hadoop class which allows us to
@@ -385,7 +397,7 @@ content in UTF-8 format.</p>
 <p>You may find <a href="http://tika.apache.org/";>Tika</a> helpful in 
converting
 binary documents to text.</p>
 <p><a 
name="CreatingVectorsfromText-ConvertingdirectoryofdocumentstoSequenceFileformat"></a></p>
-<h4 id="converting-directory-of-documents-to-sequencefile-format">Converting 
directory of documents to SequenceFile format</h4>
+<h4 id="converting-directory-of-documents-to-sequencefile-format">Converting 
directory of documents to SequenceFile format<a class="headerlink" 
href="#converting-directory-of-documents-to-sequencefile-format" 
title="Permanent link">&para;</a></h4>
 <p>Mahout has a nifty utility which reads a directory path including its
 sub-directories and creates the SequenceFile in a chunked manner for us.</p>
 <div class="codehilite"><pre>$<span class="n">MAHOUT_HOME</span><span 
class="o">/</span><span class="n">bin</span><span class="o">/</span><span 
class="n">mahout</span> <span class="n">seqdirectory</span> 
@@ -423,7 +435,7 @@ sub-directories and creates the Sequence
 
 <p>The output of seqDirectory will be a Sequence file &lt; Text, Text &gt; of 
all documents (/sub-directory-path/documentFileName, documentText).</p>
 <p><a name="CreatingVectorsfromText-CreatingVectorsfromSequenceFile"></a></p>
-<h4 id="creating-vectors-from-sequencefile">Creating Vectors from 
SequenceFile</h4>
+<h4 id="creating-vectors-from-sequencefile">Creating Vectors from 
SequenceFile<a class="headerlink" href="#creating-vectors-from-sequencefile" 
title="Permanent link">&para;</a></h4>
 <p>From the sequence file generated from the above step run the following to
 generate vectors. </p>
 <div class="codehilite"><pre>$<span class="n">MAHOUT_HOME</span><span 
class="o">/</span><span class="n">bin</span><span class="o">/</span><span 
class="n">mahout</span> <span class="n">seq2sparse</span>
@@ -482,7 +494,7 @@ generate vectors. </p>
 <p>As well, seq2sparse will create SequenceFiles for: a dictionary (wordIndex, 
word), a word frequency count (wordIndex, count) and a document frequency count 
(wordIndex, DFCount) in the output directory. </p>
 <p>The --minSupport option is the min frequency for the word to be considered 
as a feature; --minDF is the min number of documents the word needs to be in; 
--maxDFPercent is the max value of the expression (document frequency of a 
word/total number of document) to be considered as good feature to be in the 
document. These options are helpful in removing high frequency features like 
stop words.</p>
 <p>The vectorized documents can then be used as input to many of Mahout's 
classification and clustering algorithms.</p>
-<h4 
id="example-creating-normalized-tf-idf-vectors-from-a-directory-of-text-documents-using-trigrams-and-the-l_2-norm">Example:
 Creating Normalized <a 
href="http://en.wikipedia.org/wiki/Tf%E2%80%93idf";>TF-IDF</a> Vectors from a 
directory of text documents using <a 
href="http://en.wikipedia.org/wiki/N-gram";>trigrams</a> and the <a 
href="http://en.wikipedia.org/wiki/Lp_space";>L_2 Norm</a></h4>
+<h4 
id="example-creating-normalized-tf-idf-vectors-from-a-directory-of-text-documents-using-trigrams-and-the-l_2-norm">Example:
 Creating Normalized <a 
href="http://en.wikipedia.org/wiki/Tf%E2%80%93idf";>TF-IDF</a> Vectors from a 
directory of text documents using <a 
href="http://en.wikipedia.org/wiki/N-gram";>trigrams</a> and the <a 
href="http://en.wikipedia.org/wiki/Lp_space";>L_2 Norm</a><a class="headerlink" 
href="#example-creating-normalized-tf-idf-vectors-from-a-directory-of-text-documents-using-trigrams-and-the-l_2-norm"
 title="Permanent link">&para;</a></h4>
 <p>Create sequence files from the directory of text documents:</p>
 <div class="codehilite"><pre>$<span class="n">MAHOUT_HOME</span><span 
class="o">/</span><span class="n">bin</span><span class="o">/</span><span 
class="n">mahout</span> <span class="n">seqdirectory</span> 
     <span class="o">-</span><span class="nb">i</span> $<span 
class="n">WORK_DIR</span><span class="o">/</span><span class="n">reuters</span> 
@@ -507,12 +519,12 @@ generate vectors. </p>
 
 <p>The sequence file in the 
$WORK_DIR/reuters-out-seqdir-sparse-kmeans/tfidf-vectors directory can now be 
used as input to the Mahout <a 
href="http://mahout.apache.org/users/clustering/k-means-clustering.html";>k-Means</a>
 clustering algorithm.</p>
 <p><a name="CreatingVectorsfromText-Background"></a></p>
-<h2 id="background">Background</h2>
+<h2 id="background">Background<a class="headerlink" href="#background" 
title="Permanent link">&para;</a></h2>
 <ul>
 <li><a href="http://markmail.org/thread/l5zi3yk446goll3o";>Discussion on 
centroid calculations with sparse vectors</a></li>
 </ul>
 <p><a 
name="CreatingVectorsfromText-ConvertingexistingvectorstoMahout'sformat"></a></p>
-<h2 id="converting-existing-vectors-to-mahouts-format">Converting existing 
vectors to Mahout's format</h2>
+<h2 id="converting-existing-vectors-to-mahouts-format">Converting existing 
vectors to Mahout's format<a class="headerlink" 
href="#converting-existing-vectors-to-mahouts-format" title="Permanent 
link">&para;</a></h2>
 <p>If you are in the happy position to already own a document (as in: texts,
 images or whatever item you wish to treat) processing pipeline, the
 question arises of how to convert the vectors into the Mahout vector

Modified: 
websites/staging/mahout/trunk/content/users/basics/creating-vectors.html
==============================================================================
--- websites/staging/mahout/trunk/content/users/basics/creating-vectors.html 
(original)
+++ websites/staging/mahout/trunk/content/users/basics/creating-vectors.html 
Fri Apr  8 18:41:08 2016
@@ -146,6 +146,7 @@
                   <li class="nav-header">Engines</li>
                   <li><a href="/users/sparkbindings/home.html">Spark</a></li>
                   <li><a 
href="/users/environment/h2o-internals.html">H2O</a></li>
+                  <li><a href="/users/flinkbindings/home.html">Flink</a></li>
                   <li class="nav-header">References</li>
                   <li><a 
href="/users/environment/in-core-reference.html">In-Core Algebraic DSL 
Reference</a></li>
                   <li><a 
href="/users/environment/out-of-core-reference.html">Distributed Algebraic DSL 
Reference</a></li>
@@ -263,8 +264,19 @@
 
   <div id="content-wrap" class="clearfix">
    <div id="main">
-    <p><a name="CreatingVectors-UtilitiesforCreatingVectors"></a></p>
-<h1 id="utilities-for-creating-vectors">Utilities for Creating Vectors</h1>
+    <style type="text/css">
+/* The following code is added by mdx_elementid.py
+   It was originally lifted from http://subversion.apache.org/style/site.css */
+/*
+ * Hide class="elementid-permalink", except when an enclosing heading
+ * has the :hover property.
+ */
+.headerlink, .elementid-permalink {
+  visibility: hidden;
+}
+h2:hover > .headerlink, h3:hover > .headerlink, h1:hover > .headerlink, 
h6:hover > .headerlink, h4:hover > .headerlink, h5:hover > .headerlink, 
dt:hover > .elementid-permalink { visibility: visible }</style>
+<p><a name="CreatingVectors-UtilitiesforCreatingVectors"></a></p>
+<h1 id="utilities-for-creating-vectors">Utilities for Creating Vectors<a 
class="headerlink" href="#utilities-for-creating-vectors" title="Permanent 
link">&para;</a></h1>
 <ol>
 <li>
 <p><a href="creating-vectors-from-text.html">Text</a> ... utilities to turn 
plain text into Mahout vectors.</p>

Modified: 
websites/staging/mahout/trunk/content/users/basics/gaussian-discriminative-analysis.html
==============================================================================
--- 
websites/staging/mahout/trunk/content/users/basics/gaussian-discriminative-analysis.html
 (original)
+++ 
websites/staging/mahout/trunk/content/users/basics/gaussian-discriminative-analysis.html
 Fri Apr  8 18:41:08 2016
@@ -146,6 +146,7 @@
                   <li class="nav-header">Engines</li>
                   <li><a href="/users/sparkbindings/home.html">Spark</a></li>
                   <li><a 
href="/users/environment/h2o-internals.html">H2O</a></li>
+                  <li><a href="/users/flinkbindings/home.html">Flink</a></li>
                   <li class="nav-header">References</li>
                   <li><a 
href="/users/environment/in-core-reference.html">In-Core Algebraic DSL 
Reference</a></li>
                   <li><a 
href="/users/environment/out-of-core-reference.html">Distributed Algebraic DSL 
Reference</a></li>
@@ -263,16 +264,27 @@
 
   <div id="content-wrap" class="clearfix">
    <div id="main">
-    <p><a 
name="GaussianDiscriminativeAnalysis-GaussianDiscriminativeAnalysis"></a></p>
-<h1 id="gaussian-discriminative-analysis">Gaussian Discriminative Analysis</h1>
+    <style type="text/css">
+/* The following code is added by mdx_elementid.py
+   It was originally lifted from http://subversion.apache.org/style/site.css */
+/*
+ * Hide class="elementid-permalink", except when an enclosing heading
+ * has the :hover property.
+ */
+.headerlink, .elementid-permalink {
+  visibility: hidden;
+}
+h2:hover > .headerlink, h3:hover > .headerlink, h1:hover > .headerlink, 
h6:hover > .headerlink, h4:hover > .headerlink, h5:hover > .headerlink, 
dt:hover > .elementid-permalink { visibility: visible }</style>
+<p><a 
name="GaussianDiscriminativeAnalysis-GaussianDiscriminativeAnalysis"></a></p>
+<h1 id="gaussian-discriminative-analysis">Gaussian Discriminative Analysis<a 
class="headerlink" href="#gaussian-discriminative-analysis" title="Permanent 
link">&para;</a></h1>
 <p>Gaussian Discriminative Analysis is a tool for multigroup classification
 based on extending linear discriminant analysis. The paper on the approach
 is located at http://citeseer.ist.psu.edu/4617.html (note, for some reason
 the paper is backwards, in that page 1 is at the end)</p>
 <p><a name="GaussianDiscriminativeAnalysis-Parallelizationstrategy"></a></p>
-<h2 id="parallelization-strategy">Parallelization strategy</h2>
+<h2 id="parallelization-strategy">Parallelization strategy<a 
class="headerlink" href="#parallelization-strategy" title="Permanent 
link">&para;</a></h2>
 <p><a name="GaussianDiscriminativeAnalysis-Designofpackages"></a></p>
-<h2 id="design-of-packages">Design of packages</h2>
+<h2 id="design-of-packages">Design of packages<a class="headerlink" 
href="#design-of-packages" title="Permanent link">&para;</a></h2>
    </div>
   </div>     
 </div> 

Modified: 
websites/staging/mahout/trunk/content/users/basics/independent-component-analysis.html
==============================================================================
--- 
websites/staging/mahout/trunk/content/users/basics/independent-component-analysis.html
 (original)
+++ 
websites/staging/mahout/trunk/content/users/basics/independent-component-analysis.html
 Fri Apr  8 18:41:08 2016
@@ -146,6 +146,7 @@
                   <li class="nav-header">Engines</li>
                   <li><a href="/users/sparkbindings/home.html">Spark</a></li>
                   <li><a 
href="/users/environment/h2o-internals.html">H2O</a></li>
+                  <li><a href="/users/flinkbindings/home.html">Flink</a></li>
                   <li class="nav-header">References</li>
                   <li><a 
href="/users/environment/in-core-reference.html">In-Core Algebraic DSL 
Reference</a></li>
                   <li><a 
href="/users/environment/out-of-core-reference.html">Distributed Algebraic DSL 
Reference</a></li>
@@ -263,13 +264,24 @@
 
   <div id="content-wrap" class="clearfix">
    <div id="main">
-    <p><a 
name="IndependentComponentAnalysis-IndependentComponentAnalysis"></a></p>
-<h1 id="independent-component-analysis">Independent Component Analysis</h1>
+    <style type="text/css">
+/* The following code is added by mdx_elementid.py
+   It was originally lifted from http://subversion.apache.org/style/site.css */
+/*
+ * Hide class="elementid-permalink", except when an enclosing heading
+ * has the :hover property.
+ */
+.headerlink, .elementid-permalink {
+  visibility: hidden;
+}
+h2:hover > .headerlink, h3:hover > .headerlink, h1:hover > .headerlink, 
h6:hover > .headerlink, h4:hover > .headerlink, h5:hover > .headerlink, 
dt:hover > .elementid-permalink { visibility: visible }</style>
+<p><a name="IndependentComponentAnalysis-IndependentComponentAnalysis"></a></p>
+<h1 id="independent-component-analysis">Independent Component Analysis<a 
class="headerlink" href="#independent-component-analysis" title="Permanent 
link">&para;</a></h1>
 <p>See also: Principal Component Analysis.</p>
 <p><a name="IndependentComponentAnalysis-Parallelizationstrategy"></a></p>
-<h2 id="parallelization-strategy">Parallelization strategy</h2>
+<h2 id="parallelization-strategy">Parallelization strategy<a 
class="headerlink" href="#parallelization-strategy" title="Permanent 
link">&para;</a></h2>
 <p><a name="IndependentComponentAnalysis-Designofpackages"></a></p>
-<h2 id="design-of-packages">Design of packages</h2>
+<h2 id="design-of-packages">Design of packages<a class="headerlink" 
href="#design-of-packages" title="Permanent link">&para;</a></h2>
    </div>
   </div>     
 </div> 

Modified: 
websites/staging/mahout/trunk/content/users/basics/mahout-collections.html
==============================================================================
--- websites/staging/mahout/trunk/content/users/basics/mahout-collections.html 
(original)
+++ websites/staging/mahout/trunk/content/users/basics/mahout-collections.html 
Fri Apr  8 18:41:08 2016
@@ -146,6 +146,7 @@
                   <li class="nav-header">Engines</li>
                   <li><a href="/users/sparkbindings/home.html">Spark</a></li>
                   <li><a 
href="/users/environment/h2o-internals.html">H2O</a></li>
+                  <li><a href="/users/flinkbindings/home.html">Flink</a></li>
                   <li class="nav-header">References</li>
                   <li><a 
href="/users/environment/in-core-reference.html">In-Core Algebraic DSL 
Reference</a></li>
                   <li><a 
href="/users/environment/out-of-core-reference.html">Distributed Algebraic DSL 
Reference</a></li>
@@ -263,16 +264,27 @@
 
   <div id="content-wrap" class="clearfix">
    <div id="main">
-    <h1 id="mahout-collections">Mahout collections</h1>
+    <style type="text/css">
+/* The following code is added by mdx_elementid.py
+   It was originally lifted from http://subversion.apache.org/style/site.css */
+/*
+ * Hide class="elementid-permalink", except when an enclosing heading
+ * has the :hover property.
+ */
+.headerlink, .elementid-permalink {
+  visibility: hidden;
+}
+h2:hover > .headerlink, h3:hover > .headerlink, h1:hover > .headerlink, 
h6:hover > .headerlink, h4:hover > .headerlink, h5:hover > .headerlink, 
dt:hover > .elementid-permalink { visibility: visible }</style>
+<h1 id="mahout-collections">Mahout collections<a class="headerlink" 
href="#mahout-collections" title="Permanent link">&para;</a></h1>
 <p><a name="mahout-collections-Introduction"></a></p>
-<h2 id="introduction">Introduction</h2>
+<h2 id="introduction">Introduction<a class="headerlink" href="#introduction" 
title="Permanent link">&para;</a></h2>
 <p>The Mahout Collections library is a set of container classes that address
 some limitations of the standard collections in Java. <a 
href="http://domino.research.ibm.com/comm/research_people.nsf/pages/sevitsky.pubs.html/$FILE/oopsla08%20memory-efficient%20java%20slides.pdf";>This
 presentation</a>
  describes a number of performance problems with the standard collections. </p>
 <p>Mahout collections addresses two of the more glaring: the lack of support
 for primitive types and the lack of open hashing.</p>
 <p><a name="mahout-collections-PrimitiveTypes"></a></p>
-<h2 id="primitive-types">Primitive Types</h2>
+<h2 id="primitive-types">Primitive Types<a class="headerlink" 
href="#primitive-types" title="Permanent link">&para;</a></h2>
 <p>The most visible feature of Mahout Collections is the large collection of
 primitive type collections. Given Java's asymmetrical support for the
 primitive types, the only efficient way to handle them is with many
@@ -284,18 +296,18 @@ Even when the <em>java.util</em> interfa
 to include requirements that are not consistent with efficient use of
 primitive types.</p>
 <p><a name="mahout-collections-OpenAddressing"></a></p>
-<h1 id="open-addressing">Open Addressing</h1>
+<h1 id="open-addressing">Open Addressing<a class="headerlink" 
href="#open-addressing" title="Permanent link">&para;</a></h1>
 <p>All of the sets and maps in Mahout Collections are open-addressed hash
 tables. Open addressing has a much smaller memory footprint than chaining.
 Since the purpose of these collections is to avoid the memory cost of
 autoboxing, open addressing is a consistent design choice.</p>
 <p><a name="mahout-collections-Sets"></a></p>
-<h2 id="sets">Sets</h2>
+<h2 id="sets">Sets<a class="headerlink" href="#sets" title="Permanent 
link">&para;</a></h2>
 <p>Mahout Collections includes open hash sets. Unlike <em>java.util</em>, a 
set is
 not a recycled hash table; the sets are separately implemented and do not
 have any additional storage usage for unused keys.</p>
 <p><a name="mahout-collections-CreditwhereCreditisdue"></a></p>
-<h1 id="credit-where-credit-is-due">Credit where Credit is due</h1>
+<h1 id="credit-where-credit-is-due">Credit where Credit is due<a 
class="headerlink" href="#credit-where-credit-is-due" title="Permanent 
link">&para;</a></h1>
 <p>The implementation of Mahout Collections is derived from <a 
href="http://acs.lbl.gov/~hoschek/colt/";>Cern Colt</a>
 .</p>
    </div>



Reply via email to