Repository: mahout Updated Branches: refs/heads/asf-site f4961201e -> d9686c8ba
http://git-wip-us.apache.org/repos/asf/mahout/blob/d9686c8b/users/environment/out-of-core-reference.html ---------------------------------------------------------------------- diff --git a/users/environment/out-of-core-reference.html b/users/environment/out-of-core-reference.html index 8db77d0..aed6f88 100644 --- a/users/environment/out-of-core-reference.html +++ b/users/environment/out-of-core-reference.html @@ -278,29 +278,29 @@ <p>The subjects of this reference are solely applicable to Mahout-Samsaraâs <strong>DRM</strong> (distributed row matrix).</p> -<p>In this reference, DRMs will be denoted as e.g. <code class="highlighter-rouge">A</code>, and in-core matrices as e.g. <code class="highlighter-rouge">inCoreA</code>.</p> +<p>In this reference, DRMs will be denoted as e.g. <code>A</code>, and in-core matrices as e.g. <code>inCoreA</code>.</p> <h4 id="imports">Imports</h4> <p>The following imports are used to enable seamless in-core and distributed algebraic DSL operations:</p> -<div class="highlighter-rouge"><div class="highlight"><pre class="highlight"><code>import org.apache.mahout.math._ +<pre><code>import org.apache.mahout.math._ import scalabindings._ import RLikeOps._ import drm._ import RLikeDRMOps._ -</code></pre></div></div> +</code></pre> <p>If working with mixed scala/java code:</p> -<div class="highlighter-rouge"><div class="highlight"><pre class="highlight"><code>import collection._ +<pre><code>import collection._ import JavaConversions._ -</code></pre></div></div> +</code></pre> <p>If you are working with Mahout-Samsaraâs Spark-specific operations e.g. for context creation:</p> -<div class="highlighter-rouge"><div class="highlight"><pre class="highlight"><code>import org.apache.mahout.sparkbindings._ -</code></pre></div></div> +<pre><code>import org.apache.mahout.sparkbindings._ +</code></pre> <p>The Mahout shell does all of these imports automatically.</p> @@ -310,38 +310,38 @@ import JavaConversions._ <p>Loading a DRM from (HD)FS:</p> -<div class="highlighter-rouge"><div class="highlight"><pre class="highlight"><code>drmDfsRead(path = hdfsPath) -</code></pre></div></div> +<pre><code>drmDfsRead(path = hdfsPath) +</code></pre> <p>Parallelizing from an in-core matrix:</p> -<div class="highlighter-rouge"><div class="highlight"><pre class="highlight"><code>val inCoreA = (dense(1, 2, 3), (3, 4, 5)) +<pre><code>val inCoreA = (dense(1, 2, 3), (3, 4, 5)) val A = drmParallelize(inCoreA) -</code></pre></div></div> +</code></pre> <p>Creating an empty DRM:</p> -<div class="highlighter-rouge"><div class="highlight"><pre class="highlight"><code>val A = drmParallelizeEmpty(100, 50) -</code></pre></div></div> +<pre><code>val A = drmParallelizeEmpty(100, 50) +</code></pre> <p>Collecting to driverâs jvm in-core:</p> -<div class="highlighter-rouge"><div class="highlight"><pre class="highlight"><code>val inCoreA = A.collect -</code></pre></div></div> +<pre><code>val inCoreA = A.collect +</code></pre> <p><strong>Warning: The collection of distributed matrices happens implicitly whenever conversion to an in-core (o.a.m.math.Matrix) type is required. E.g.:</strong></p> -<div class="highlighter-rouge"><div class="highlight"><pre class="highlight"><code>val inCoreA: Matrix = ... +<pre><code>val inCoreA: Matrix = ... val drmB: DrmLike[Int] =... val inCoreC: Matrix = inCoreA %*%: drmB -</code></pre></div></div> +</code></pre> <p><strong>implies (incoreA %*%: drmB).collect</strong></p> <p>Collecting to (HD)FS as a Mahoutâs DRM formatted file:</p> -<div class="highlighter-rouge"><div class="highlight"><pre class="highlight"><code>A.dfsWrite(path = hdfsPath) -</code></pre></div></div> +<pre><code>A.dfsWrite(path = hdfsPath) +</code></pre> <h4 id="logical-algebraic-operators-on-drm-matrices">Logical algebraic operators on DRM matrices:</h4> @@ -350,12 +350,12 @@ val inCoreC: Matrix = inCoreA %*%: drmB <p>Cache a DRM and trigger an optimized physical plan:</p> -<div class="highlighter-rouge"><div class="highlight"><pre class="highlight"><code>drmA.checkpoint(CacheHint.MEMORY_AND_DISK) -</code></pre></div></div> +<pre><code>drmA.checkpoint(CacheHint.MEMORY_AND_DISK) +</code></pre> <p>Other valid caching Instructions:</p> -<div class="highlighter-rouge"><div class="highlight"><pre class="highlight"><code>drmA.checkpoint(CacheHint.NONE) +<pre><code>drmA.checkpoint(CacheHint.NONE) drmA.checkpoint(CacheHint.DISK_ONLY) drmA.checkpoint(CacheHint.DISK_ONLY_2) drmA.checkpoint(CacheHint.MEMORY_ONLY) @@ -365,38 +365,38 @@ drmA.checkpoint(CacheHint.MEMORY_ONLY_SER_2) drmA.checkpoint(CacheHint.MEMORY_AND_DISK_2) drmA.checkpoint(CacheHint.MEMORY_AND_DISK_SER) drmA.checkpoint(CacheHint.MEMORY_AND_DISK_SER_2) -</code></pre></div></div> +</code></pre> <p><em>Note: Logical DRM operations are lazily computed. Currently the actual computations and optional caching will be triggered by dfsWrite(â¦), collect(â¦) and blockify(â¦).</em></p> <p>Transposition:</p> -<div class="highlighter-rouge"><div class="highlight"><pre class="highlight"><code>A.t -</code></pre></div></div> +<pre><code>A.t +</code></pre> <p>Elementwise addition <em>(Matrices of identical geometry and row key types)</em>:</p> -<div class="highlighter-rouge"><div class="highlight"><pre class="highlight"><code>A + B -</code></pre></div></div> +<pre><code>A + B +</code></pre> <p>Elementwise subtraction <em>(Matrices of identical geometry and row key types)</em>:</p> -<div class="highlighter-rouge"><div class="highlight"><pre class="highlight"><code>A - B -</code></pre></div></div> +<pre><code>A - B +</code></pre> <p>Elementwise multiplication (Hadamard) <em>(Matrices of identical geometry and row key types)</em>:</p> -<div class="highlighter-rouge"><div class="highlight"><pre class="highlight"><code>A * B -</code></pre></div></div> +<pre><code>A * B +</code></pre> <p>Elementwise division <em>(Matrices of identical geometry and row key types)</em>:</p> -<div class="highlighter-rouge"><div class="highlight"><pre class="highlight"><code>A / B -</code></pre></div></div> +<pre><code>A / B +</code></pre> <p><strong>Elementwise operations involving one in-core argument (int-keyed DRMs only)</strong>:</p> -<div class="highlighter-rouge"><div class="highlight"><pre class="highlight"><code>A + inCoreB +<pre><code>A + inCoreB A - inCoreB A * inCoreB A / inCoreB @@ -408,73 +408,73 @@ inCoreA +: B inCoreA -: B inCoreA *: B inCoreA /: B -</code></pre></div></div> +</code></pre> -<p>Note the Spark associativity change (e.g. <code class="highlighter-rouge">A *: inCoreB</code> means <code class="highlighter-rouge">B.leftMultiply(A</code>), same as when both arguments are in core). Whenever operator arguments include both in-core and out-of-core arguments, the operator can only be associated with the out-of-core (DRM) argument to support the distributed implementation.</p> +<p>Note the Spark associativity change (e.g. <code>A *: inCoreB</code> means <code>B.leftMultiply(A</code>), same as when both arguments are in core). Whenever operator arguments include both in-core and out-of-core arguments, the operator can only be associated with the out-of-core (DRM) argument to support the distributed implementation.</p> <p><strong>Matrix-matrix multiplication %*%</strong>:</p> -<p><code class="highlighter-rouge">\(\mathbf{M}=\mathbf{AB}\)</code></p> +<p><code>\(\mathbf{M}=\mathbf{AB}\)</code></p> -<div class="highlighter-rouge"><div class="highlight"><pre class="highlight"><code>A %*% B +<pre><code>A %*% B A %*% inCoreB A %*% inCoreDiagonal A %*%: B -</code></pre></div></div> +</code></pre> <p><em>Note: same as above, whenever operator arguments include both in-core and out-of-core arguments, the operator can only be associated with the out-of-core (DRM) argument to support the distributed implementation.</em></p> <p><strong>Matrix-vector multiplication %*%</strong> -Currently we support a right multiply product of a DRM and an in-core Vector(<code class="highlighter-rouge">\(\mathbf{Ax}\)</code>) resulting in a single column DRM, which then can be collected in front (usually the desired outcome):</p> +Currently we support a right multiply product of a DRM and an in-core Vector(<code>\(\mathbf{Ax}\)</code>) resulting in a single column DRM, which then can be collected in front (usually the desired outcome):</p> -<div class="highlighter-rouge"><div class="highlight"><pre class="highlight"><code>val Ax = A %*% x +<pre><code>val Ax = A %*% x val inCoreX = Ax.collect(::, 0) -</code></pre></div></div> +</code></pre> <p><strong>Matrix-scalar +,-,*,/</strong> Elementwise operations of every matrix element and a scalar:</p> -<div class="highlighter-rouge"><div class="highlight"><pre class="highlight"><code>A + 5.0 +<pre><code>A + 5.0 A - 5.0 A :- 5.0 5.0 -: A A * 5.0 A / 5.0 5.0 /: a -</code></pre></div></div> +</code></pre> -<p>Note that <code class="highlighter-rouge">5.0 -: A</code> means <code class="highlighter-rouge">\(m_{ij} = 5 - a_{ij}\)</code> and <code class="highlighter-rouge">5.0 /: A</code> means <code class="highlighter-rouge">\(m_{ij} = \frac{5}{a{ij}}\)</code> for all elements of the result.</p> +<p>Note that <code>5.0 -: A</code> means <code>\(m_{ij} = 5 - a_{ij}\)</code> and <code>5.0 /: A</code> means <code>\(m_{ij} = \frac{5}{a{ij}}\)</code> for all elements of the result.</p> <h4 id="slicing">Slicing</h4> <p>General slice:</p> -<div class="highlighter-rouge"><div class="highlight"><pre class="highlight"><code>A(100 to 200, 100 to 200) -</code></pre></div></div> +<pre><code>A(100 to 200, 100 to 200) +</code></pre> <p>Horizontal Block:</p> -<div class="highlighter-rouge"><div class="highlight"><pre class="highlight"><code>A(::, 100 to 200) -</code></pre></div></div> +<pre><code>A(::, 100 to 200) +</code></pre> <p>Vertical Block:</p> -<div class="highlighter-rouge"><div class="highlight"><pre class="highlight"><code>A(100 to 200, ::) -</code></pre></div></div> +<pre><code>A(100 to 200, ::) +</code></pre> -<p><em>Note: if row range is not all-range (::) the the DRM must be <code class="highlighter-rouge">Int</code>-keyed. General case row slicing is not supported by DRMs with key types other than <code class="highlighter-rouge">Int</code></em>.</p> +<p><em>Note: if row range is not all-range (::) the the DRM must be <code>Int</code>-keyed. General case row slicing is not supported by DRMs with key types other than <code>Int</code></em>.</p> <h4 id="stitching">Stitching</h4> <p>Stitch side by side (cbind R semantics):</p> -<div class="highlighter-rouge"><div class="highlight"><pre class="highlight"><code>val drmAnextToB = drmA cbind drmB -</code></pre></div></div> +<pre><code>val drmAnextToB = drmA cbind drmB +</code></pre> <p>Stitch side by side (Scala):</p> -<div class="highlighter-rouge"><div class="highlight"><pre class="highlight"><code>val drmAnextToB = drmA.cbind(drmB) -</code></pre></div></div> +<pre><code>val drmAnextToB = drmA.cbind(drmB) +</code></pre> <p>Analogously, vertical concatenation is available via <strong>rbind</strong></p> @@ -483,126 +483,126 @@ A / 5.0 <p><strong>drm.mapBlock(â¦)</strong>:</p> -<p>The DRM operator <code class="highlighter-rouge">mapBlock</code> provides transformational access to the distributed vertical blockified tuples of a matrix (Row-Keys, Vertical-Matrix-Block).</p> +<p>The DRM operator <code>mapBlock</code> provides transformational access to the distributed vertical blockified tuples of a matrix (Row-Keys, Vertical-Matrix-Block).</p> -<p>Using <code class="highlighter-rouge">mapBlock</code> to add 1.0 to a DRM:</p> +<p>Using <code>mapBlock</code> to add 1.0 to a DRM:</p> -<div class="highlighter-rouge"><div class="highlight"><pre class="highlight"><code>val inCoreA = dense((1, 2, 3), (2, 3 , 4), (3, 4, 5)) +<pre><code>val inCoreA = dense((1, 2, 3), (2, 3 , 4), (3, 4, 5)) val drmA = drmParallelize(inCoreA) val B = A.mapBlock() { case (keys, block) => keys -> (block += 1.0) } -</code></pre></div></div> +</code></pre> <h4 id="broadcasting-vectors-and-matrices-to-closures">Broadcasting Vectors and matrices to closures</h4> <p>Generally we can create and use one-way closure attributes to be used on the back end.</p> <p>Scalar matrix multiplication:</p> -<div class="highlighter-rouge"><div class="highlight"><pre class="highlight"><code>val factor: Int = 15 +<pre><code>val factor: Int = 15 val drm2 = drm1.mapBlock() { case (keys, block) => block *= factor keys -> block } -</code></pre></div></div> +</code></pre> -<p><strong>Closure attributes must be java-serializable. Currently Mahoutâs in-core Vectors and Matrices are not java-serializable, and must be broadcast to the closure using <code class="highlighter-rouge">drmBroadcast(...)</code></strong>:</p> +<p><strong>Closure attributes must be java-serializable. Currently Mahoutâs in-core Vectors and Matrices are not java-serializable, and must be broadcast to the closure using <code>drmBroadcast(...)</code></strong>:</p> -<div class="highlighter-rouge"><div class="highlight"><pre class="highlight"><code>val v: Vector ... +<pre><code>val v: Vector ... val bcastV = drmBroadcast(v) val drm2 = drm1.mapBlock() { case (keys, block) => for(row <- 0 until block.nrow) block(row, ::) -= bcastV keys -> block } -</code></pre></div></div> +</code></pre> <h4 id="computations-providing-ad-hoc-summaries">Computations providing ad-hoc summaries</h4> <p>Matrix cardinality:</p> -<div class="highlighter-rouge"><div class="highlight"><pre class="highlight"><code>drmA.nrow +<pre><code>drmA.nrow drmA.ncol -</code></pre></div></div> +</code></pre> -<p><em>Note: depending on the stage of optimization, these may trigger a computational action. I.e. if one calls <code class="highlighter-rouge">nrow()</code> n times, then the back end will actually recompute <code class="highlighter-rouge">nrow</code> n times.</em></p> +<p><em>Note: depending on the stage of optimization, these may trigger a computational action. I.e. if one calls <code>nrow()</code> n times, then the back end will actually recompute <code>nrow</code> n times.</em></p> <p>Means and sums:</p> -<div class="highlighter-rouge"><div class="highlight"><pre class="highlight"><code>drmA.colSums +<pre><code>drmA.colSums drmA.colMeans drmA.rowSums drmA.rowMeans -</code></pre></div></div> +</code></pre> -<p><em>Note: These will always trigger a computational action. I.e. if one calls <code class="highlighter-rouge">colSums()</code> n times, then the back end will actually recompute <code class="highlighter-rouge">colSums</code> n times.</em></p> +<p><em>Note: These will always trigger a computational action. I.e. if one calls <code>colSums()</code> n times, then the back end will actually recompute <code>colSums</code> n times.</em></p> <h4 id="distributed-matrix-decompositions">Distributed Matrix Decompositions</h4> <p>To import the decomposition package:</p> -<div class="highlighter-rouge"><div class="highlight"><pre class="highlight"><code>import org.apache.mahout.math._ +<pre><code>import org.apache.mahout.math._ import decompositions._ -</code></pre></div></div> +</code></pre> <p>Distributed thin QR:</p> -<div class="highlighter-rouge"><div class="highlight"><pre class="highlight"><code>val (drmQ, incoreR) = dqrThin(drmA) -</code></pre></div></div> +<pre><code>val (drmQ, incoreR) = dqrThin(drmA) +</code></pre> <p>Distributed SSVD:</p> -<div class="highlighter-rouge"><div class="highlight"><pre class="highlight"><code>val (drmU, drmV, s) = dssvd(drmA, k = 40, q = 1) -</code></pre></div></div> +<pre><code>val (drmU, drmV, s) = dssvd(drmA, k = 40, q = 1) +</code></pre> <p>Distributed SPCA:</p> -<div class="highlighter-rouge"><div class="highlight"><pre class="highlight"><code>val (drmU, drmV, s) = dspca(drmA, k = 30, q = 1) -</code></pre></div></div> +<pre><code>val (drmU, drmV, s) = dspca(drmA, k = 30, q = 1) +</code></pre> <p>Distributed regularized ALS:</p> -<div class="highlighter-rouge"><div class="highlight"><pre class="highlight"><code>val (drmU, drmV, i) = dals(drmA, +<pre><code>val (drmU, drmV, i) = dals(drmA, k = 50, lambda = 0.0, maxIterations = 10, convergenceThreshold = 0.10)) -</code></pre></div></div> +</code></pre> <h4 id="adjusting-parallelism-of-computations">Adjusting parallelism of computations</h4> -<p>Set the minimum parallelism to 100 for computations on <code class="highlighter-rouge">drmA</code>:</p> +<p>Set the minimum parallelism to 100 for computations on <code>drmA</code>:</p> -<div class="highlighter-rouge"><div class="highlight"><pre class="highlight"><code>drmA.par(min = 100) -</code></pre></div></div> +<pre><code>drmA.par(min = 100) +</code></pre> -<p>Set the exact parallelism to 100 for computations on <code class="highlighter-rouge">drmA</code>:</p> +<p>Set the exact parallelism to 100 for computations on <code>drmA</code>:</p> -<div class="highlighter-rouge"><div class="highlight"><pre class="highlight"><code>drmA.par(exact = 100) -</code></pre></div></div> +<pre><code>drmA.par(exact = 100) +</code></pre> -<p>Set the engine specific automatic parallelism adjustment for computations on <code class="highlighter-rouge">drmA</code>:</p> +<p>Set the engine specific automatic parallelism adjustment for computations on <code>drmA</code>:</p> -<div class="highlighter-rouge"><div class="highlight"><pre class="highlight"><code>drmA.par(auto = true) -</code></pre></div></div> +<pre><code>drmA.par(auto = true) +</code></pre> <h4 id="retrieving-the-engine-specific-data-structure-backing-the-drm">Retrieving the engine specific data structure backing the DRM:</h4> <p><strong>A Spark RDD:</strong></p> -<div class="highlighter-rouge"><div class="highlight"><pre class="highlight"><code>val myRDD = drmA.checkpoint().rdd -</code></pre></div></div> +<pre><code>val myRDD = drmA.checkpoint().rdd +</code></pre> <p><strong>An H2O Frame and Key Vec:</strong></p> -<div class="highlighter-rouge"><div class="highlight"><pre class="highlight"><code>val myFrame = drmA.frame +<pre><code>val myFrame = drmA.frame val myKeys = drmA.keys -</code></pre></div></div> +</code></pre> <p><strong>A Flink DataSet:</strong></p> -<div class="highlighter-rouge"><div class="highlight"><pre class="highlight"><code>val myDataSet = drmA.ds -</code></pre></div></div> +<pre><code>val myDataSet = drmA.ds +</code></pre> <p>For more information including information on Mahout-Samsaraâs Algebraic Optimizer and in-core Linear algebra bindings see: <a href="http://mahout.apache.org/users/sparkbindings/ScalaSparkBindings.pdf">Mahout Scala Bindings and Mahout Spark Bindings for Linear Algebra Subroutines</a></p> http://git-wip-us.apache.org/repos/asf/mahout/blob/d9686c8b/users/flinkbindings/playing-with-samsara-flink.html ---------------------------------------------------------------------- diff --git a/users/flinkbindings/playing-with-samsara-flink.html b/users/flinkbindings/playing-with-samsara-flink.html index 8e45d9e..000d052 100644 --- a/users/flinkbindings/playing-with-samsara-flink.html +++ b/users/flinkbindings/playing-with-samsara-flink.html @@ -276,16 +276,16 @@ <p>To get started, add the following dependency to the pom:</p> -<div class="highlighter-rouge"><div class="highlight"><pre class="highlight"><code><dependency> +<pre><code><dependency> <groupId>org.apache.mahout</groupId> <artifactId>mahout-flink_2.10</artifactId> <version>0.12.0</version> </dependency> -</code></pre></div></div> +</code></pre> <p>Here is how to use the Flink backend:</p> -<div class="highlighter-rouge"><div class="highlight"><pre class="highlight"><code>import org.apache.flink.api.scala._ +<pre><code>import org.apache.flink.api.scala._ import org.apache.mahout.math.drm._ import org.apache.mahout.math.drm.RLikeDrmOps._ import org.apache.mahout.flinkbindings._ @@ -304,7 +304,7 @@ object ReadCsvExample { } } -</code></pre></div></div> +</code></pre> <h2 id="current-status">Current Status</h2> @@ -314,18 +314,18 @@ object ReadCsvExample { <ul> <li><a href="https://issues.apache.org/jira/browse/MAHOUT-1701">MAHOUT-1701</a> Mahout DSL for Flink: implement AtB ABt and AtA operators</li> - <li><a href="https://issues.apache.org/jira/browse/MAHOUT-1702">MAHOUT-1702</a> implement element-wise operators (like <code class="highlighter-rouge">A + 2</code> or <code class="highlighter-rouge">A + B</code>)</li> - <li><a href="https://issues.apache.org/jira/browse/MAHOUT-1703">MAHOUT-1703</a> implement <code class="highlighter-rouge">cbind</code> and <code class="highlighter-rouge">rbind</code></li> - <li><a href="https://issues.apache.org/jira/browse/MAHOUT-1709">MAHOUT-1709</a> implement slicing (like <code class="highlighter-rouge">A(1 to 10, ::)</code>)</li> - <li><a href="https://issues.apache.org/jira/browse/MAHOUT-1710">MAHOUT-1710</a> implement right in-core matrix multiplication (<code class="highlighter-rouge">A %*% B</code> when <code class="highlighter-rouge">B</code> is in-core)</li> + <li><a href="https://issues.apache.org/jira/browse/MAHOUT-1702">MAHOUT-1702</a> implement element-wise operators (like <code>A + 2</code> or <code>A + B</code>)</li> + <li><a href="https://issues.apache.org/jira/browse/MAHOUT-1703">MAHOUT-1703</a> implement <code>cbind</code> and <code>rbind</code></li> + <li><a href="https://issues.apache.org/jira/browse/MAHOUT-1709">MAHOUT-1709</a> implement slicing (like <code>A(1 to 10, ::)</code>)</li> + <li><a href="https://issues.apache.org/jira/browse/MAHOUT-1710">MAHOUT-1710</a> implement right in-core matrix multiplication (<code>A %*% B</code> when <code>B</code> is in-core)</li> <li><a href="https://issues.apache.org/jira/browse/MAHOUT-1711">MAHOUT-1711</a> implement broadcasting</li> - <li><a href="https://issues.apache.org/jira/browse/MAHOUT-1712">MAHOUT-1712</a> implement operators <code class="highlighter-rouge">At</code>, <code class="highlighter-rouge">Ax</code>, <code class="highlighter-rouge">Atx</code> - <code class="highlighter-rouge">Ax</code> and <code class="highlighter-rouge">At</code> are implemented</li> + <li><a href="https://issues.apache.org/jira/browse/MAHOUT-1712">MAHOUT-1712</a> implement operators <code>At</code>, <code>Ax</code>, <code>Atx</code> - <code>Ax</code> and <code>At</code> are implemented</li> <li><a href="https://issues.apache.org/jira/browse/MAHOUT-1734">MAHOUT-1734</a> implement I/O - should be able to read results of Flink bindings</li> - <li><a href="https://issues.apache.org/jira/browse/MAHOUT-1747">MAHOUT-1747</a> add support for different types of indexes (String, long, etc) - now supports <code class="highlighter-rouge">Int</code>, <code class="highlighter-rouge">Long</code> and <code class="highlighter-rouge">String</code></li> + <li><a href="https://issues.apache.org/jira/browse/MAHOUT-1747">MAHOUT-1747</a> add support for different types of indexes (String, long, etc) - now supports <code>Int</code>, <code>Long</code> and <code>String</code></li> <li><a href="https://issues.apache.org/jira/browse/MAHOUT-1748">MAHOUT-1748</a> switch to Flink Scala API</li> - <li><a href="https://issues.apache.org/jira/browse/MAHOUT-1749">MAHOUT-1749</a> Implement <code class="highlighter-rouge">Atx</code></li> - <li><a href="https://issues.apache.org/jira/browse/MAHOUT-1750">MAHOUT-1750</a> Implement <code class="highlighter-rouge">ABt</code></li> - <li><a href="https://issues.apache.org/jira/browse/MAHOUT-1751">MAHOUT-1751</a> Implement <code class="highlighter-rouge">AtA</code></li> + <li><a href="https://issues.apache.org/jira/browse/MAHOUT-1749">MAHOUT-1749</a> Implement <code>Atx</code></li> + <li><a href="https://issues.apache.org/jira/browse/MAHOUT-1750">MAHOUT-1750</a> Implement <code>ABt</code></li> + <li><a href="https://issues.apache.org/jira/browse/MAHOUT-1751">MAHOUT-1751</a> Implement <code>AtA</code></li> <li><a href="https://issues.apache.org/jira/browse/MAHOUT-1755">MAHOUT-1755</a> Flush intermediate results to FS - Flink, unlike Spark, does not store intermediate results in memory.</li> <li><a href="https://issues.apache.org/jira/browse/MAHOUT-1764">MAHOUT-1764</a> Add standard backend tests for Flink</li> <li><a href="https://issues.apache.org/jira/browse/MAHOUT-1765">MAHOUT-1765</a> Add documentation about Flink backend</li> @@ -355,19 +355,19 @@ object ReadCsvExample { <p>There is a set of standard tests that all engines should pass (see <a href="https://issues.apache.org/jira/browse/MAHOUT-1764">MAHOUT-1764</a>).</p> <ul> - <li><code class="highlighter-rouge">DistributedDecompositionsSuite</code></li> - <li><code class="highlighter-rouge">DrmLikeOpsSuite</code></li> - <li><code class="highlighter-rouge">DrmLikeSuite</code></li> - <li><code class="highlighter-rouge">RLikeDrmOpsSuite</code></li> + <li><code>DistributedDecompositionsSuite</code></li> + <li><code>DrmLikeOpsSuite</code></li> + <li><code>DrmLikeSuite</code></li> + <li><code>RLikeDrmOpsSuite</code></li> </ul> <p>These are Flink-backend specific tests, e.g.</p> <ul> - <li><code class="highlighter-rouge">DrmLikeOpsSuite</code> for operations like <code class="highlighter-rouge">norm</code>, <code class="highlighter-rouge">rowSums</code>, <code class="highlighter-rouge">rowMeans</code></li> - <li><code class="highlighter-rouge">RLikeOpsSuite</code> for basic LA like <code class="highlighter-rouge">A.t %*% A</code>, <code class="highlighter-rouge">A.t %*% x</code>, etc</li> - <li><code class="highlighter-rouge">LATestSuite</code> tests for specific operators like <code class="highlighter-rouge">AtB</code>, <code class="highlighter-rouge">Ax</code>, etc</li> - <li><code class="highlighter-rouge">UseCasesSuite</code> has more complex examples, like power iteration, ridge regression, etc</li> + <li><code>DrmLikeOpsSuite</code> for operations like <code>norm</code>, <code>rowSums</code>, <code>rowMeans</code></li> + <li><code>RLikeOpsSuite</code> for basic LA like <code>A.t %*% A</code>, <code>A.t %*% x</code>, etc</li> + <li><code>LATestSuite</code> tests for specific operators like <code>AtB</code>, <code>Ax</code>, etc</li> + <li><code>UseCasesSuite</code> has more complex examples, like power iteration, ridge regression, etc</li> </ul> <h2 id="environment">Environment</h2> @@ -382,9 +382,9 @@ object ReadCsvExample { <p>When using mahout, please import the following modules:</p> <ul> - <li><code class="highlighter-rouge">mahout-math</code></li> - <li><code class="highlighter-rouge">mahout-math-scala</code></li> - <li><code class="highlighter-rouge">mahout-flink_2.10</code> + <li><code>mahout-math</code></li> + <li><code>mahout-math-scala</code></li> + <li><code>mahout-flink_2.10</code> *</li> </ul> http://git-wip-us.apache.org/repos/asf/mahout/blob/d9686c8b/users/misc/parallel-frequent-pattern-mining.html ---------------------------------------------------------------------- diff --git a/users/misc/parallel-frequent-pattern-mining.html b/users/misc/parallel-frequent-pattern-mining.html index c02586a..1ec3df8 100644 --- a/users/misc/parallel-frequent-pattern-mining.html +++ b/users/misc/parallel-frequent-pattern-mining.html @@ -286,7 +286,7 @@ creating Iterators, Convertors and TopKPatternWritable for that particular object. For more information please refer the package org.apache.mahout.fpm.pfpgrowth.convertors.string</p> -<div class="highlighter-rouge"><div class="highlight"><pre class="highlight"><code>e.g: +<pre><code>e.g: FPGrowth<String> fp = new FPGrowth<String>(); Set<String> features = new HashSet<String>(); fp.generateTopKStringFrequentPatterns( @@ -298,7 +298,7 @@ org.apache.mahout.fpm.pfpgrowth.convertors.string</p> features, new StringOutputConvertor(new SequenceFileOutputCollector<Text, TopKStringPatterns>(writer)) ); -</code></pre></div></div> +</code></pre> <ul> <li>The first argument is the iterator of transaction in this case its http://git-wip-us.apache.org/repos/asf/mahout/blob/d9686c8b/users/misc/using-mahout-with-python-via-jpype.html ---------------------------------------------------------------------- diff --git a/users/misc/using-mahout-with-python-via-jpype.html b/users/misc/using-mahout-with-python-via-jpype.html index 5290f63..d9e72c6 100644 --- a/users/misc/using-mahout-with-python-via-jpype.html +++ b/users/misc/using-mahout-with-python-via-jpype.html @@ -305,14 +305,14 @@ python script. The result for me looks like the following</p> <p>Now we can create a function to start the jvm in python using jype</p> -<div class="highlighter-rouge"><div class="highlight"><pre class="highlight"><code>jvm=None +<pre><code>jvm=None def start_jpype(): global jvm if (jvm is None): cpopt="-Djava.class.path={cp}".format(cp=classpath) startJVM(jvmlib,"-ea",cpopt) jvm="started" -</code></pre></div></div> +</code></pre> <p><a name="UsingMahoutwithPythonviaJPype-WritingNamedVectorstoSequenceFilesfromPython"></a></p> <h1 id="writing-named-vectors-to-sequence-files-from-python">Writing Named Vectors to Sequence Files from Python</h1> @@ -320,7 +320,7 @@ jvm="started" be used by Mahout for kmeans. The example below is a function which creates vectors from two Gaussian distributions with unit variance.</p> -<div class="highlighter-rouge"><div class="highlight"><pre class="highlight"><code>def create_inputs(ifile,*args,**param): +<pre><code>def create_inputs(ifile,*args,**param): """Create a sequence file containing some normally distributed ifile - path to the sequence file to create """ @@ -383,14 +383,14 @@ vectors from two Gaussian distributions with unit variance.</p> writer.append(wrapkey,wrapval) writer.close() -</code></pre></div></div> +</code></pre> <p><a name="UsingMahoutwithPythonviaJPype-ReadingtheKMeansClusteredPointsfromPython"></a></p> <h1 id="reading-the-kmeans-clustered-points-from-python">Reading the KMeans Clustered Points from Python</h1> <p>Similarly we can use JPype to easily read the clustered points outputted by mahout.</p> -<div class="highlighter-rouge"><div class="highlight"><pre class="highlight"><code>def read_clustered_pts(ifile,*args,**param): +<pre><code>def read_clustered_pts(ifile,*args,**param): """Read the clustered points ifile - path to the sequence file containing the clustered points """ @@ -433,14 +433,14 @@ mahout.</p> print "cluster={key} Name={name} x={x}y={y}".format(key=key.toString(),name=nvec.getName(),x=nvec.get(0),y=nvec.get(1)) else: raise NotImplementedError("Vector isn't a NamedVector. Need tomodify/test the code to handle this case.") -</code></pre></div></div> +</code></pre> <p><a name="UsingMahoutwithPythonviaJPype-ReadingtheKMeansCentroids"></a></p> <h1 id="reading-the-kmeans-centroids">Reading the KMeans Centroids</h1> <p>Finally we can create a function to print out the actual cluster centers found by mahout,</p> -<div class="highlighter-rouge"><div class="highlight"><pre class="highlight"><code>def getClusters(ifile,*args,**param): +<pre><code>def getClusters(ifile,*args,**param): """Read the centroids from the clusters outputted by kmenas ifile - Path to the sequence file containing the centroids """ @@ -479,7 +479,7 @@ found by mahout,</p> print "id={cid}center={center}".format(cid=vecwritable.getId(),center=center.values) pass -</code></pre></div></div> +</code></pre> </div> http://git-wip-us.apache.org/repos/asf/mahout/blob/d9686c8b/users/recommender/intro-als-hadoop.html ---------------------------------------------------------------------- diff --git a/users/recommender/intro-als-hadoop.html b/users/recommender/intro-als-hadoop.html index e2920c2..db3700a 100644 --- a/users/recommender/intro-als-hadoop.html +++ b/users/recommender/intro-als-hadoop.html @@ -350,8 +350,8 @@ for item IDs. Then after recommendations are calculated you will have to transla <p>Assuming your <em>JAVA_HOME</em> is appropriately set and Mahout was installed properly weâre ready to configure our syntax. Enter the following command:</p> -<div class="highlighter-rouge"><div class="highlight"><pre class="highlight"><code>$ mahout parallelALS --input $als_input --output $als_output --lambda 0.1 --implicitFeedback true --alpha 0.8 --numFeatures 2 --numIterations 5 --numThreadsPerSolver 1 --tempDir tmp -</code></pre></div></div> +<pre><code>$ mahout parallelALS --input $als_input --output $als_output --lambda 0.1 --implicitFeedback true --alpha 0.8 --numFeatures 2 --numIterations 5 --numThreadsPerSolver 1 --tempDir tmp +</code></pre> <p>Running the command will execute a series of jobs the final product of which will be an output file deposited to the output directory specified in the command syntax. The output directory contains 3 sub-directories: <em>M</em> stores the item to feature matrix, <em>U</em> stores the user to feature matrix and userRatings stores the userâs ratings on the items. The <em>tempDir</em> parameter specifies the directory to store the intermediate output of the job, such as the matrix output in each iteration and each itemâs average rating. Using the <em>tempDir</em> will help on debugging.</p> @@ -359,8 +359,8 @@ for item IDs. Then after recommendations are calculated you will have to transla <p>Based on the output feature matrices from step 3, we could make recommendations for users. Enter the following command:</p> -<div class="highlighter-rouge"><div class="highlight"><pre class="highlight"><code> $ mahout recommendfactorized --input $als_recommender_input --userFeatures $als_output/U/ --itemFeatures $als_output/M/ --numRecommendations 1 --output recommendations --maxRating 1 -</code></pre></div></div> +<pre><code> $ mahout recommendfactorized --input $als_recommender_input --userFeatures $als_output/U/ --itemFeatures $als_output/M/ --numRecommendations 1 --output recommendations --maxRating 1 +</code></pre> <p>The input user file is a sequence file, the sequence record key is user id and value is the userâs rated item ids which will be removed from recommendation. The output file generated in our simple example will be a text file giving the recommended item ids for each user. Remember to translate the Mahout ids back into your application specific ids.</p> http://git-wip-us.apache.org/repos/asf/mahout/blob/d9686c8b/users/recommender/intro-cooccurrence-spark.html ---------------------------------------------------------------------- diff --git a/users/recommender/intro-cooccurrence-spark.html b/users/recommender/intro-cooccurrence-spark.html index 6aaccd5..fbeb1cd 100644 --- a/users/recommender/intro-cooccurrence-spark.html +++ b/users/recommender/intro-cooccurrence-spark.html @@ -304,7 +304,7 @@ For instance they might say an item-view is 0.2 of an item purchase. In practice cross-cooccurrence is a more principled way to handle this case. In effect it scrubs secondary actions with the action you want to recommend.</p> -<div class="highlighter-rouge"><div class="highlight"><pre class="highlight"><code>spark-itemsimilarity Mahout 1.0 +<pre><code>spark-itemsimilarity Mahout 1.0 Usage: spark-itemsimilarity [options] Disconnected from the target VM, address: '127.0.0.1:64676', transport: 'socket' @@ -368,7 +368,7 @@ Spark config options: -h | --help prints this usage text -</code></pre></div></div> +</code></pre> <p>This looks daunting but defaults to simple fairly sane values to take exactly the same input as legacy code and is pretty flexible. It allows the user to point to a single text file, a directory full of files, or a tree of directories to be traversed recursively. The files included can be specified with either a regex-style pattern or filename. The schema for the file is defined by column numbers, which map to the important bits of data including IDs and values. The files can even contain filters, which allow unneeded rows to be discarded or used for cross-cooccurrence calculations.</p> @@ -378,20 +378,20 @@ Spark config options: <p>If all defaults are used the input can be as simple as:</p> -<div class="highlighter-rouge"><div class="highlight"><pre class="highlight"><code>userID1,itemID1 +<pre><code>userID1,itemID1 userID2,itemID2 ... -</code></pre></div></div> +</code></pre> <p>With the command line:</p> -<div class="highlighter-rouge"><div class="highlight"><pre class="highlight"><code>bash$ mahout spark-itemsimilarity --input in-file --output out-dir -</code></pre></div></div> +<pre><code>bash$ mahout spark-itemsimilarity --input in-file --output out-dir +</code></pre> <p>This will use the âlocalâ Spark context and will output the standard text version of a DRM</p> -<div class="highlighter-rouge"><div class="highlight"><pre class="highlight"><code>itemID1<tab>itemID2:value2<space>itemID10:value10... -</code></pre></div></div> +<pre><code>itemID1<tab>itemID2:value2<space>itemID10:value10... +</code></pre> <p>###<a name="multiple-actions">How To Use Multiple User Actions</a></p> @@ -407,7 +407,7 @@ to calculate the cross-cooccurrence indicator matrix.</p> <p><em>spark-itemsimilarity</em> can read separate actions from separate files or from a mixed action log by filtering certain lines. For a mixed action log of the form:</p> -<div class="highlighter-rouge"><div class="highlight"><pre class="highlight"><code>u1,purchase,iphone +<pre><code>u1,purchase,iphone u1,purchase,ipad u2,purchase,nexus u2,purchase,galaxy @@ -427,13 +427,13 @@ u3,view,nexus u4,view,iphone u4,view,ipad u4,view,galaxy -</code></pre></div></div> +</code></pre> <p>###Command Line</p> <p>Use the following options:</p> -<div class="highlighter-rouge"><div class="highlight"><pre class="highlight"><code>bash$ mahout spark-itemsimilarity \ +<pre><code>bash$ mahout spark-itemsimilarity \ --input in-file \ # where to look for data --output out-path \ # root dir for output --master masterUrl \ # URL of the Spark master server @@ -442,35 +442,35 @@ u4,view,galaxy --itemIDPosition 2 \ # column that has the item ID --rowIDPosition 0 \ # column that has the user ID --filterPosition 1 # column that has the filter word -</code></pre></div></div> +</code></pre> <p>###Output</p> <p>The output of the job will be the standard text version of two Mahout DRMs. This is a case where we are calculating cross-cooccurrence so a primary indicator matrix and cross-cooccurrence indicator matrix will be created</p> -<div class="highlighter-rouge"><div class="highlight"><pre class="highlight"><code>out-path +<pre><code>out-path |-- similarity-matrix - TDF part files \-- cross-similarity-matrix - TDF part-files -</code></pre></div></div> +</code></pre> <p>The indicator matrix will contain the lines:</p> -<div class="highlighter-rouge"><div class="highlight"><pre class="highlight"><code>galaxy\tnexus:1.7260924347106847 +<pre><code>galaxy\tnexus:1.7260924347106847 ipad\tiphone:1.7260924347106847 nexus\tgalaxy:1.7260924347106847 iphone\tipad:1.7260924347106847 surface -</code></pre></div></div> +</code></pre> <p>The cross-cooccurrence indicator matrix will contain:</p> -<div class="highlighter-rouge"><div class="highlight"><pre class="highlight"><code>iphone\tnexus:1.7260924347106847 iphone:1.7260924347106847 ipad:1.7260924347106847 galaxy:1.7260924347106847 +<pre><code>iphone\tnexus:1.7260924347106847 iphone:1.7260924347106847 ipad:1.7260924347106847 galaxy:1.7260924347106847 ipad\tnexus:0.6795961471815897 iphone:0.6795961471815897 ipad:0.6795961471815897 galaxy:0.6795961471815897 nexus\tnexus:0.6795961471815897 iphone:0.6795961471815897 ipad:0.6795961471815897 galaxy:0.6795961471815897 galaxy\tnexus:1.7260924347106847 iphone:1.7260924347106847 ipad:1.7260924347106847 galaxy:1.7260924347106847 surface\tsurface:4.498681156950466 nexus:0.6795961471815897 -</code></pre></div></div> +</code></pre> <p><strong>Note:</strong> You can run this multiple times to use more than two actions or you can use the underlying SimilarityAnalysis.cooccurrence API, which will more efficiently calculate any number of cross-cooccurrence indicators.</p> @@ -479,7 +479,7 @@ SimilarityAnalysis.cooccurrence API, which will more efficiently calculate any n <p>A common method of storing data is in log files. If they are written using some delimiter they can be consumed directly by spark-itemsimilarity. For instance input of the form:</p> -<div class="highlighter-rouge"><div class="highlight"><pre class="highlight"><code>2014-06-23 14:46:53.115\tu1\tpurchase\trandom text\tiphone +<pre><code>2014-06-23 14:46:53.115\tu1\tpurchase\trandom text\tiphone 2014-06-23 14:46:53.115\tu1\tpurchase\trandom text\tipad 2014-06-23 14:46:53.115\tu2\tpurchase\trandom text\tnexus 2014-06-23 14:46:53.115\tu2\tpurchase\trandom text\tgalaxy @@ -499,11 +499,11 @@ SimilarityAnalysis.cooccurrence API, which will more efficiently calculate any n 2014-06-23 14:46:53.115\tu4\tview\trandom text\tiphone 2014-06-23 14:46:53.115\tu4\tview\trandom text\tipad 2014-06-23 14:46:53.115\tu4\tview\trandom text\tgalaxy -</code></pre></div></div> +</code></pre> <p>Can be parsed with the following CLI and run on the cluster producing the same output as the above example.</p> -<div class="highlighter-rouge"><div class="highlight"><pre class="highlight"><code>bash$ mahout spark-itemsimilarity \ +<pre><code>bash$ mahout spark-itemsimilarity \ --input in-file \ --output out-path \ --master spark://sparkmaster:4044 \ @@ -513,7 +513,7 @@ SimilarityAnalysis.cooccurrence API, which will more efficiently calculate any n --itemIDPosition 4 \ --rowIDPosition 1 \ --filterPosition 2 -</code></pre></div></div> +</code></pre> <p>##2. spark-rowsimilarity</p> @@ -528,7 +528,7 @@ by a list of the most similar rows.</p> <p>The command line interface is:</p> -<div class="highlighter-rouge"><div class="highlight"><pre class="highlight"><code>spark-rowsimilarity Mahout 1.0 +<pre><code>spark-rowsimilarity Mahout 1.0 Usage: spark-rowsimilarity [options] Input, output options @@ -574,7 +574,7 @@ Spark config options: -h | --help prints this usage text -</code></pre></div></div> +</code></pre> <p>See RowSimilarityDriver.scala in Mahoutâs spark module if you want to customize the code.</p> @@ -654,32 +654,32 @@ content or metadata, not by which users interacted with them.</p> <p>For this we need input of the form:</p> -<div class="highlighter-rouge"><div class="highlight"><pre class="highlight"><code>itemID<tab>list-of-tags +<pre><code>itemID<tab>list-of-tags ... -</code></pre></div></div> +</code></pre> <p>The full collection will look like the tags column from a catalog DB. For our ecom example it might be:</p> -<div class="highlighter-rouge"><div class="highlight"><pre class="highlight"><code>3459860b<tab>men long-sleeve chambray clothing casual +<pre><code>3459860b<tab>men long-sleeve chambray clothing casual 9446577d<tab>women tops chambray clothing casual ... -</code></pre></div></div> +</code></pre> <p>Weâll use <em>spark-rowimilairity</em> because we are looking for similar rows, which encode items in this case. As with the collaborative filtering indicator and cross-cooccurrence indicator we use the âomitStrength option. The strengths created are probabilistic log-likelihood ratios and so are used to filter unimportant similarities. Once the filtering or downsampling is finished we no longer need the strengths. We will get an indicator matrix of the form:</p> -<div class="highlighter-rouge"><div class="highlight"><pre class="highlight"><code>itemID<tab>list-of-item IDs +<pre><code>itemID<tab>list-of-item IDs ... -</code></pre></div></div> +</code></pre> <p>This is a content indicator since it has found other items with similar content or metadata.</p> -<div class="highlighter-rouge"><div class="highlight"><pre class="highlight"><code>3459860b<tab>3459860b 3459860b 6749860c 5959860a 3434860a 3477860a +<pre><code>3459860b<tab>3459860b 3459860b 6749860c 5959860a 3434860a 3477860a 9446577d<tab>9446577d 9496577d 0943577d 8346577d 9442277d 9446577e ... -</code></pre></div></div> +</code></pre> <p>We now have three indicators, two collaborative filtering type and one content type.</p> @@ -691,11 +691,11 @@ For a given user, map their history of an action or content to the correct indic <p>We have 3 indicators, these are indexed by the search engine into 3 fields, weâll call them âpurchaseâ, âviewâ, and âtagsâ. We take the userâs history that corresponds to each indicator and create a query of the form:</p> -<div class="highlighter-rouge"><div class="highlight"><pre class="highlight"><code>Query: +<pre><code>Query: field: purchase; q:user's-purchase-history field: view; q:user's view-history field: tags; q:user's-tags-associated-with-purchases -</code></pre></div></div> +</code></pre> <p>The query will result in an ordered list of items recommended for purchase but skewed towards items with similar tags to the ones the user has already purchased.</p> @@ -707,11 +707,11 @@ by tagging items with some category of popularity (hot, warm, cold for instance) index that as a new indicator field and include the corresponding value in a query on the popularity field. If we use the ecom example but use the query to get âhotâ recommendations it might look like this:</p> -<div class="highlighter-rouge"><div class="highlight"><pre class="highlight"><code>Query: +<pre><code>Query: field: purchase; q:user's-purchase-history field: view; q:user's view-history field: popularity; q:"hot" -</code></pre></div></div> +</code></pre> <p>This will return recommendations favoring ones that have the intrinsic indicator âhotâ.</p> http://git-wip-us.apache.org/repos/asf/mahout/blob/d9686c8b/users/recommender/intro-itembased-hadoop.html ---------------------------------------------------------------------- diff --git a/users/recommender/intro-itembased-hadoop.html b/users/recommender/intro-itembased-hadoop.html index 887b260..bb9b618 100644 --- a/users/recommender/intro-itembased-hadoop.html +++ b/users/recommender/intro-itembased-hadoop.html @@ -317,8 +317,8 @@ <p>Assuming your <em>JAVA_HOME</em> is appropriately set and Mahout was installed properly weâre ready to configure our syntax. Enter the following command:</p> -<div class="highlighter-rouge"><div class="highlight"><pre class="highlight"><code>$ mahout recommenditembased -s SIMILARITY_LOGLIKELIHOOD -i /path/to/input/file -o /path/to/desired/output --numRecommendations 25 -</code></pre></div></div> +<pre><code>$ mahout recommenditembased -s SIMILARITY_LOGLIKELIHOOD -i /path/to/input/file -o /path/to/desired/output --numRecommendations 25 +</code></pre> <p>Running the command will execute a series of jobs the final product of which will be an output file deposited to the directory specified in the command syntax. The output file will contain two columns: the <em>userID</em> and an array of <em>itemIDs</em> and scores.</p> http://git-wip-us.apache.org/repos/asf/mahout/blob/d9686c8b/users/recommender/matrix-factorization.html ---------------------------------------------------------------------- diff --git a/users/recommender/matrix-factorization.html b/users/recommender/matrix-factorization.html index f15fa8d..a1c28d6 100644 --- a/users/recommender/matrix-factorization.html +++ b/users/recommender/matrix-factorization.html @@ -282,39 +282,39 @@ There are many different matrix decompositions, each finds use among a particula <p>In mahout, the SVDRecommender provides an interface to build recommender based on matrix factorization. The idea behind is to project the users and items onto a feature space and try to optimize U and M so that U * (M^t) is as close to R as possible:</p> -<div class="highlighter-rouge"><div class="highlight"><pre class="highlight"><code> U is n * p user feature matrix, +<pre><code> U is n * p user feature matrix, M is m * p item feature matrix, M^t is the conjugate transpose of M, R is n * m rating matrix, n is the number of users, m is the number of items, p is the number of features -</code></pre></div></div> +</code></pre> <p>We usually use RMSE to represent the deviations between predictions and atual ratings. RMSE is defined as the squared root of the sum of squared errors at each known user item ratings. So our matrix factorization target could be mathmatically defined as:</p> -<div class="highlighter-rouge"><div class="highlight"><pre class="highlight"><code> find U and M, (U, M) = argmin(RMSE) = argmin(pow(SSE / K, 0.5)) +<pre><code> find U and M, (U, M) = argmin(RMSE) = argmin(pow(SSE / K, 0.5)) SSE = sum(e(u,i)^2) e(u,i) = r(u, i) - U[u,] * (M[i,]^t) = r(u,i) - sum(U[u,f] * M[i,f]), f = 0, 1, .. p - 1 K is the number of known user item ratings. -</code></pre></div></div> +</code></pre> <p><a name="MatrixFactorization-Factorizers"></a></p> <p>Mahout has implemented matrix factorization based on</p> -<div class="highlighter-rouge"><div class="highlight"><pre class="highlight"><code>(1) SGD(Stochastic Gradient Descent) +<pre><code>(1) SGD(Stochastic Gradient Descent) (2) ALSWR(Alternating-Least-Squares with Weighted-λ-Regularization). -</code></pre></div></div> +</code></pre> <h2 id="sgd">SGD</h2> <p>Stochastic gradient descent is a gradient descent optimization method for minimizing an objective function that is written as a su of differentiable functions.</p> -<div class="highlighter-rouge"><div class="highlight"><pre class="highlight"><code> Q(w) = sum(Q_i(w)), -</code></pre></div></div> +<pre><code> Q(w) = sum(Q_i(w)), +</code></pre> <p>where w is the parameters to be estimated, Q(w) is the objective function that could be expressed as sum of differentiable functions, @@ -322,14 +322,14 @@ So our matrix factorization target could be mathmatically defined as:</p> <p>In practice, w is estimated using an iterative method at each single sample until an approximate miminum is obtained,</p> -<div class="highlighter-rouge"><div class="highlight"><pre class="highlight"><code> w = w - alpha * (d(Q_i(w))/dw), where aplpha is the learning rate, +<pre><code> w = w - alpha * (d(Q_i(w))/dw), where aplpha is the learning rate, (d(Q_i(w))/dw) is the first derivative of Q_i(w) on w. -</code></pre></div></div> +</code></pre> <p>In matrix factorization, the RatingSGDFactorizer class implements the SGD with w = (U, M) and objective function Q(w) = sum(Q(u,i)),</p> -<div class="highlighter-rouge"><div class="highlight"><pre class="highlight"><code> Q(u,i) = sum(e(u,i) * e(u,i)) / 2 + lambda * [(U[u,] * (U[u,]^t)) + (M[i,] * (M[i,]^t))] / 2 -</code></pre></div></div> +<pre><code> Q(u,i) = sum(e(u,i) * e(u,i)) / 2 + lambda * [(U[u,] * (U[u,]^t)) + (M[i,] * (M[i,]^t))] / 2 +</code></pre> <p>where Q(u, i) is the objecive function for user u and item i, e(u, i) is the error between predicted rating and actual rating, @@ -339,7 +339,7 @@ So our matrix factorization target could be mathmatically defined as:</p> <p>The algorithm is sketched as follows:</p> -<div class="highlighter-rouge"><div class="highlight"><pre class="highlight"><code> init U and M with randomized value between 0.0 and 1.0 with standard Gaussian distribution +<pre><code> init U and M with randomized value between 0.0 and 1.0 with standard Gaussian distribution for(iter = 0; iter < numIterations; iter++) { @@ -360,7 +360,7 @@ So our matrix factorization target could be mathmatically defined as:</p> U[u,] = NU[u,] } } -</code></pre></div></div> +</code></pre> <h2 id="svd">SVD++</h2> @@ -370,18 +370,18 @@ So our matrix factorization target could be mathmatically defined as:</p> <p>The complete model is a sum of 3 sub-models with complete prediction formula as follows:</p> -<div class="highlighter-rouge"><div class="highlight"><pre class="highlight"><code>pr(u,i) = b[u,i] + fm + nm //user u and item i +<pre><code>pr(u,i) = b[u,i] + fm + nm //user u and item i pr(u,i) is the predicted rating of user u on item i, b[u,i] = U + b(u) + b(i) fm = (q[i,]) * (p[u,] + pow(|N(u)|, -0.5) * sum(y[j,])), j is an item in N(u) nm = pow(|R(i;u;k)|, -0.5) * sum((r[u,j0] - b[u,j0]) * w[i,j0]) + pow(|N(i;u;k)|, -0.5) * sum(c[i,j1]), j0 is an item in R(i;u;k), j1 is an item in N(i;u;k) -</code></pre></div></div> +</code></pre> <p>The associated regularized squared error function to be minimized is:</p> -<div class="highlighter-rouge"><div class="highlight"><pre class="highlight"><code>{sum((r[u,i] - pr[u,i]) * (r[u,i] - pr[u,i])) - lambda * (b(u) * b(u) + b(i) * b(i) + ||q[i,]||^2 + ||p[u,]||^2 + sum(||y[j,]||^2) + sum(w[i,j0] * w[i,j0]) + sum(c[i,j1] * c[i,j1]))} -</code></pre></div></div> +<pre><code>{sum((r[u,i] - pr[u,i]) * (r[u,i] - pr[u,i])) - lambda * (b(u) * b(u) + b(i) * b(i) + ||q[i,]||^2 + ||p[u,]||^2 + sum(||y[j,]||^2) + sum(w[i,j0] * w[i,j0]) + sum(c[i,j1] * c[i,j1]))} +</code></pre> <p>b[u,i] is the baseline estimate of user uâs predicted rating on item i. U is usersâ overall average rating and b(u) and b(i) indicate the observed deviations of user u and item iâs ratings from average.</p> @@ -420,12 +420,12 @@ please refer to the paper <a href="http://research.yahoo.com/files/kdd08koren.pd <p>The update to q, p, y in each gradient descent step is:</p> -<div class="highlighter-rouge"><div class="highlight"><pre class="highlight"><code> err(u,i) = r[u,i] - pr[u,i] +<pre><code> err(u,i) = r[u,i] - pr[u,i] q[i,] = q[i,] + alpha * (err(u,i) * (p[u,] + pow(|N(u)|, -0.5) * sum(y[j,])) - lamda * q[i,]) p[u,] = p[u,] + alpha * (err(u,i) * q[i,] - lambda * p[u,]) for j that is an item in N(u): y[j,] = y[j,] + alpha * (err(u,i) * pow(|N(u)|, -0.5) * q[i,] - lambda * y[j,]) -</code></pre></div></div> +</code></pre> <p>where alpha is the learning rate of gradient descent, N(u) is the items that user u has expressed preference.</p> @@ -443,8 +443,8 @@ vanilla SGD.</p> <p>ALSWR is an iterative algorithm to solve the low rank factorization of user feature matrix U and item feature matrix M.<br /> The loss function to be minimized is formulated as the sum of squared errors plus <a href="http://en.wikipedia.org/wiki/Tikhonov_regularization">Tikhonov regularization</a>:</p> -<div class="highlighter-rouge"><div class="highlight"><pre class="highlight"><code> L(R, U, M) = sum(pow((R[u,i] - U[u,]* (M[i,]^t)), 2)) + lambda * (sum(n(u) * ||U[u,]||^2) + sum(n(i) * ||M[i,]||^2)) -</code></pre></div></div> +<pre><code> L(R, U, M) = sum(pow((R[u,i] - U[u,]* (M[i,]^t)), 2)) + lambda * (sum(n(u) * ||U[u,]||^2) + sum(n(i) * ||M[i,]||^2)) +</code></pre> <p>At the beginning of the algorithm, M is initialized with the average item ratings as its first row and random numbers for the rest row.</p> @@ -454,14 +454,14 @@ the cost function similarly. The iteration stops until a certain stopping criter <p>To solve the matrix U when M is given, each userâs feature vector is calculated by resolving a regularized linear least square error function using the items the user has rated and their feature vectors:</p> -<div class="highlighter-rouge"><div class="highlight"><pre class="highlight"><code> 1/2 * d(L(R,U,M)) / d(U[u,f]) = 0 -</code></pre></div></div> +<pre><code> 1/2 * d(L(R,U,M)) / d(U[u,f]) = 0 +</code></pre> <p>Similary, when M is updated, we resolve a regularized linear least square error function using feature vectors of the users that have rated the item and their feature vectors:</p> -<div class="highlighter-rouge"><div class="highlight"><pre class="highlight"><code> 1/2 * d(L(R,U,M)) / d(M[i,f]) = 0 -</code></pre></div></div> +<pre><code> 1/2 * d(L(R,U,M)) / d(M[i,f]) = 0 +</code></pre> <p>The ALSWRFactorizer class is a non-distributed implementation of ALSWR using multi-threading to dispatch the computation among several threads. Mahout also offers a <a href="https://mahout.apache.org/users/recommender/intro-als-hadoop.html">parallel map-reduce implementation</a>.</p> http://git-wip-us.apache.org/repos/asf/mahout/blob/d9686c8b/users/recommender/recommender-documentation.html ---------------------------------------------------------------------- diff --git a/users/recommender/recommender-documentation.html b/users/recommender/recommender-documentation.html index d1a4b44..55a508b 100644 --- a/users/recommender/recommender-documentation.html +++ b/users/recommender/recommender-documentation.html @@ -375,34 +375,34 @@ Weâll start with an example of this.</p> on data in a file. The file should be in CSV format, with lines of the form âuserID,itemID,prefValueâ (e.g. â39505,290002,3.5â):</p> -<div class="highlighter-rouge"><div class="highlight"><pre class="highlight"><code>DataModel model = new FileDataModel(new File("data.txt")); -</code></pre></div></div> +<pre><code>DataModel model = new FileDataModel(new File("data.txt")); +</code></pre> <p>Weâll use the <strong>PearsonCorrelationSimilarity</strong> implementation of <strong>UserSimilarity</strong> as our user correlation algorithm, and add an optional preference inference algorithm:</p> -<div class="highlighter-rouge"><div class="highlight"><pre class="highlight"><code>UserSimilarity userSimilarity = new PearsonCorrelationSimilarity(model); -</code></pre></div></div> +<pre><code>UserSimilarity userSimilarity = new PearsonCorrelationSimilarity(model); +</code></pre> <p>Now we create a <strong>UserNeighborhood</strong> algorithm. Here we use nearest-3:</p> -<div class="highlighter-rouge"><div class="highlight"><pre class="highlight"><code>UserNeighborhood neighborhood = +<pre><code>UserNeighborhood neighborhood = new NearestNUserNeighborhood(3, userSimilarity, model);{code} -</code></pre></div></div> +</code></pre> <p>Now we can create our <strong>Recommender</strong>, and add a caching decorator:</p> -<div class="highlighter-rouge"><div class="highlight"><pre class="highlight"><code>Recommender recommender = +<pre><code>Recommender recommender = new GenericUserBasedRecommender(model, neighborhood, userSimilarity); Recommender cachingRecommender = new CachingRecommender(recommender); -</code></pre></div></div> +</code></pre> <p>Now we can get 10 recommendations for user ID â1234â â done!</p> -<div class="highlighter-rouge"><div class="highlight"><pre class="highlight"><code>List<RecommendedItem> recommendations = +<pre><code>List<RecommendedItem> recommendations = cachingRecommender.recommend(1234, 10); -</code></pre></div></div> +</code></pre> <h2 id="item-based-recommender">Item-based Recommender</h2> @@ -417,8 +417,8 @@ are more appropriate.</p> <p>Letâs start over, again with a <strong>FileDataModel</strong> to start:</p> -<div class="highlighter-rouge"><div class="highlight"><pre class="highlight"><code>DataModel model = new FileDataModel(new File("data.txt")); -</code></pre></div></div> +<pre><code>DataModel model = new FileDataModel(new File("data.txt")); +</code></pre> <p>Weâll also need an <strong>ItemSimilarity</strong>. We could use <strong>PearsonCorrelationSimilarity</strong>, which computes item similarity in realtime, @@ -426,22 +426,22 @@ but, this is generally too slow to be useful. Instead, in a real application, you would feed a list of pre-computed correlations to a <strong>GenericItemSimilarity</strong>:</p> -<div class="highlighter-rouge"><div class="highlight"><pre class="highlight"><code>// Construct the list of pre-computed correlations +<pre><code>// Construct the list of pre-computed correlations Collection<GenericItemSimilarity.ItemItemSimilarity> correlations = ...; ItemSimilarity itemSimilarity = new GenericItemSimilarity(correlations); -</code></pre></div></div> +</code></pre> <p>Then we can finish as before to produce recommendations:</p> -<div class="highlighter-rouge"><div class="highlight"><pre class="highlight"><code>Recommender recommender = +<pre><code>Recommender recommender = new GenericItemBasedRecommender(model, itemSimilarity); Recommender cachingRecommender = new CachingRecommender(recommender); ... List<RecommendedItem> recommendations = cachingRecommender.recommend(1234, 10); -</code></pre></div></div> +</code></pre> <p><a name="RecommenderDocumentation-Integrationwithyourapplication"></a></p> <h2 id="integration-with-your-application">Integration with your application</h2> @@ -499,7 +499,7 @@ provides a good starting point.</p> <p>Fortunately, Mahout provides a way to evaluate the accuracy of your Recommender on your own data, in org.apache.mahout.cf.taste.eval</p> -<div class="highlighter-rouge"><div class="highlight"><pre class="highlight"><code>DataModel myModel = ...; +<pre><code>DataModel myModel = ...; RecommenderBuilder builder = new RecommenderBuilder() { public Recommender buildRecommender(DataModel model) { // build and return the Recommender to evaluate here @@ -508,7 +508,7 @@ RecommenderBuilder builder = new RecommenderBuilder() { RecommenderEvaluator evaluator = new AverageAbsoluteDifferenceRecommenderEvaluator(); double evaluation = evaluator.evaluate(builder, myModel, 0.9, 1.0); -</code></pre></div></div> +</code></pre> <p>For âbooleanâ data model situations, where there are no notions of preference value, the above evaluation based on estimated preference does http://git-wip-us.apache.org/repos/asf/mahout/blob/d9686c8b/users/sparkbindings/faq.html ---------------------------------------------------------------------- diff --git a/users/sparkbindings/faq.html b/users/sparkbindings/faq.html index 6f3627c..4665f3a 100644 --- a/users/sparkbindings/faq.html +++ b/users/sparkbindings/faq.html @@ -289,10 +289,10 @@ the classpath is sane and is made available to Mahout:</p> <ol> <li>Check Spark is of correct version (same as in Mahoutâs poms), is compiled and SPARK_HOME is set.</li> <li>Check Mahout is compiled and MAHOUT_HOME is set.</li> - <li>Run <code class="highlighter-rouge">$SPARK_HOME/bin/compute-classpath.sh</code> and make sure it produces sane result with no errors. + <li>Run <code>$SPARK_HOME/bin/compute-classpath.sh</code> and make sure it produces sane result with no errors. If it outputs something other than a straightforward classpath string, most likely Spark is not compiled/set correctly (later spark versions require -<code class="highlighter-rouge">sbt/sbt assembly</code> to be run, simply runnig <code class="highlighter-rouge">sbt/sbt publish-local</code> is not enough any longer).</li> - <li>Run <code class="highlighter-rouge">$MAHOUT_HOME/bin/mahout -spark classpath</code> and check that path reported in step (3) is included.</li> +<code>sbt/sbt assembly</code> to be run, simply runnig <code>sbt/sbt publish-local</code> is not enough any longer).</li> + <li>Run <code>$MAHOUT_HOME/bin/mahout -spark classpath</code> and check that path reported in step (3) is included.</li> </ol> <p><strong>Q: I am using the command line Mahout jobs that run on Spark or am writing my own application that uses http://git-wip-us.apache.org/repos/asf/mahout/blob/d9686c8b/users/sparkbindings/home.html ---------------------------------------------------------------------- diff --git a/users/sparkbindings/home.html b/users/sparkbindings/home.html index 234f4c2..c85c2ec 100644 --- a/users/sparkbindings/home.html +++ b/users/sparkbindings/home.html @@ -279,14 +279,14 @@ <p>In short, Scala & Spark Bindings for Mahout is Scala DSL and algebraic optimizer of something like this (actual formula from <strong>(d)spca</strong>)</p> -<p><code class="highlighter-rouge">\[\mathbf{G}=\mathbf{B}\mathbf{B}^{\top}-\mathbf{C}-\mathbf{C}^{\top}+\mathbf{s}_{q}\mathbf{s}_{q}^{\top}\boldsymbol{\xi}^{\top}\boldsymbol{\xi}\]</code></p> +<p><code>\[\mathbf{G}=\mathbf{B}\mathbf{B}^{\top}-\mathbf{C}-\mathbf{C}^{\top}+\mathbf{s}_{q}\mathbf{s}_{q}^{\top}\boldsymbol{\xi}^{\top}\boldsymbol{\xi}\]</code></p> <p>bound to in-core and distributed computations (currently, on Apache Spark).</p> <p>Mahout Scala & Spark Bindings expression of the above:</p> -<div class="highlighter-rouge"><div class="highlight"><pre class="highlight"><code> val g = bt.t %*% bt - c - c.t + (s_q cross s_q) * (xi dot xi) -</code></pre></div></div> +<pre><code> val g = bt.t %*% bt - c - c.t + (s_q cross s_q) * (xi dot xi) +</code></pre> <p>The main idea is that a scientist writing algebraic expressions cannot care less of distributed operation plans and works <strong>entirely on the logical level</strong> just like he or she would do with R.</p> http://git-wip-us.apache.org/repos/asf/mahout/blob/d9686c8b/users/sparkbindings/play-with-shell.html ---------------------------------------------------------------------- diff --git a/users/sparkbindings/play-with-shell.html b/users/sparkbindings/play-with-shell.html index 694e3e1..a05e764 100644 --- a/users/sparkbindings/play-with-shell.html +++ b/users/sparkbindings/play-with-shell.html @@ -375,15 +375,15 @@ <ol> <li>Download <a href="http://d3kbcqa49mib13.cloudfront.net/spark-1.6.2-bin-hadoop2.6.tgz">Apache Spark 1.6.2</a> and unpack the archive file</li> - <li>Change to the directory where you unpacked Spark and type <code class="highlighter-rouge">sbt/sbt assembly</code> to build it</li> - <li>Create a directory for Mahout somewhere on your machine, change to there and checkout the master branch of Apache Mahout from GitHub <code class="highlighter-rouge">git clone https://github.com/apache/mahout mahout</code></li> - <li>Change to the <code class="highlighter-rouge">mahout</code> directory and build mahout using <code class="highlighter-rouge">mvn -DskipTests clean install</code></li> + <li>Change to the directory where you unpacked Spark and type <code>sbt/sbt assembly</code> to build it</li> + <li>Create a directory for Mahout somewhere on your machine, change to there and checkout the master branch of Apache Mahout from GitHub <code>git clone https://github.com/apache/mahout mahout</code></li> + <li>Change to the <code>mahout</code> directory and build mahout using <code>mvn -DskipTests clean install</code></li> </ol> <h2 id="starting-mahouts-spark-shell">Starting Mahoutâs Spark shell</h2> <ol> - <li>Goto the directory where you unpacked Spark and type <code class="highlighter-rouge">sbin/start-all.sh</code> to locally start Spark</li> + <li>Goto the directory where you unpacked Spark and type <code>sbin/start-all.sh</code> to locally start Spark</li> <li>Open a browser, point it to <a href="http://localhost:8080/">http://localhost:8080/</a> to check whether Spark successfully started. Copy the url of the spark master at the top of the page (it starts with <strong>spark://</strong>)</li> <li>Define the following environment variables: <pre class="codehilite">export MAHOUT_HOME=[directory into which you checked out Mahout] export SPARK_HOME=[directory where you unpacked Spark] @@ -391,8 +391,8 @@ export MASTER=[url of the Spark master]</li> </ol> <p></pre></p> <ol> - <li>Finally, change to the directory where you unpacked Mahout and type <code class="highlighter-rouge">bin/mahout spark-shell</code>, -you should see the shell starting and get the prompt <code class="highlighter-rouge">mahout></code>. Check + <li>Finally, change to the directory where you unpacked Mahout and type <code>bin/mahout spark-shell</code>, +you should see the shell starting and get the prompt <code>mahout></code>. Check <a href="http://mahout.apache.org/users/sparkbindings/faq.html">FAQ</a> for further troubleshooting.</li> </ol> @@ -402,7 +402,7 @@ you should see the shell starting and get the prompt <code class="highlighter-ro <p><em>Note: You can incrementally follow the example by copy-and-pasting the code into your running Mahout shell.</em></p> -<p>Mahoutâs linear algebra DSL has an abstraction called <em>DistributedRowMatrix (DRM)</em> which models a matrix that is partitioned by rows and stored in the memory of a cluster of machines. We use <code class="highlighter-rouge">dense()</code> to create a dense in-memory matrix from our toy dataset and use <code class="highlighter-rouge">drmParallelize</code> to load it into the cluster, âmimickingâ a large, partitioned dataset.</p> +<p>Mahoutâs linear algebra DSL has an abstraction called <em>DistributedRowMatrix (DRM)</em> which models a matrix that is partitioned by rows and stored in the memory of a cluster of machines. We use <code>dense()</code> to create a dense in-memory matrix from our toy dataset and use <code>drmParallelize</code> to load it into the cluster, âmimickingâ a large, partitioned dataset.</p> <div class="codehilite"><pre> val drmData = drmParallelize(dense( @@ -421,47 +421,47 @@ val drmData = drmParallelize(dense( <p>Have a look at this matrix. The first four columns represent the ingredients (our features) and the last column (the rating) is the target variable for our regression. <a href="https://en.wikipedia.org/wiki/Linear_regression">Linear regression</a> -assumes that the <strong>target variable</strong> <code class="highlighter-rouge">\(\mathbf{y}\)</code> is generated by the -linear combination of <strong>the feature matrix</strong> <code class="highlighter-rouge">\(\mathbf{X}\)</code> with the -<strong>parameter vector</strong> <code class="highlighter-rouge">\(\boldsymbol{\beta}\)</code> plus the - <strong>noise</strong> <code class="highlighter-rouge">\(\boldsymbol{\varepsilon}\)</code>, summarized in the formula -<code class="highlighter-rouge">\(\mathbf{y}=\mathbf{X}\boldsymbol{\beta}+\boldsymbol{\varepsilon}\)</code>. +assumes that the <strong>target variable</strong> <code>\(\mathbf{y}\)</code> is generated by the +linear combination of <strong>the feature matrix</strong> <code>\(\mathbf{X}\)</code> with the +<strong>parameter vector</strong> <code>\(\boldsymbol{\beta}\)</code> plus the + <strong>noise</strong> <code>\(\boldsymbol{\varepsilon}\)</code>, summarized in the formula +<code>\(\mathbf{y}=\mathbf{X}\boldsymbol{\beta}+\boldsymbol{\varepsilon}\)</code>. Our goal is to find an estimate of the parameter vector -<code class="highlighter-rouge">\(\boldsymbol{\beta}\)</code> that explains the data very well.</p> +<code>\(\boldsymbol{\beta}\)</code> that explains the data very well.</p> -<p>As a first step, we extract <code class="highlighter-rouge">\(\mathbf{X}\)</code> and <code class="highlighter-rouge">\(\mathbf{y}\)</code> from our data matrix. We get <em>X</em> by slicing: we take all rows (denoted by <code class="highlighter-rouge">::</code>) and the first four columns, which have the ingredients in milligrams as content. Note that the result is again a DRM. The shell will not execute this code yet, it saves the history of operations and defers the execution until we really access a result. <strong>Mahoutâs DSL automatically optimizes and parallelizes all operations on DRMs and runs them on Apache Spark.</strong></p> +<p>As a first step, we extract <code>\(\mathbf{X}\)</code> and <code>\(\mathbf{y}\)</code> from our data matrix. We get <em>X</em> by slicing: we take all rows (denoted by <code>::</code>) and the first four columns, which have the ingredients in milligrams as content. Note that the result is again a DRM. The shell will not execute this code yet, it saves the history of operations and defers the execution until we really access a result. <strong>Mahoutâs DSL automatically optimizes and parallelizes all operations on DRMs and runs them on Apache Spark.</strong></p> <div class="codehilite"><pre> val drmX = drmData(::, 0 until 4) </pre></div> -<p>Next, we extract the target variable vector <em>y</em>, the fifth column of the data matrix. We assume this one fits into our driver machine, so we fetch it into memory using <code class="highlighter-rouge">collect</code>:</p> +<p>Next, we extract the target variable vector <em>y</em>, the fifth column of the data matrix. We assume this one fits into our driver machine, so we fetch it into memory using <code>collect</code>:</p> <div class="codehilite"><pre> val y = drmData.collect(::, 4) </pre></div> -<p>Now we are ready to think about a mathematical way to estimate the parameter vector <em>β</em>. A simple textbook approach is <a href="https://en.wikipedia.org/wiki/Ordinary_least_squares">ordinary least squares (OLS)</a>, which minimizes the sum of residual squares between the true target variable and the prediction of the target variable. In OLS, there is even a closed form expression for estimating <code class="highlighter-rouge">\(\boldsymbol{\beta}\)</code> as -<code class="highlighter-rouge">\(\left(\mathbf{X}^{\top}\mathbf{X}\right)^{-1}\mathbf{X}^{\top}\mathbf{y}\)</code>.</p> +<p>Now we are ready to think about a mathematical way to estimate the parameter vector <em>β</em>. A simple textbook approach is <a href="https://en.wikipedia.org/wiki/Ordinary_least_squares">ordinary least squares (OLS)</a>, which minimizes the sum of residual squares between the true target variable and the prediction of the target variable. In OLS, there is even a closed form expression for estimating <code>\(\boldsymbol{\beta}\)</code> as +<code>\(\left(\mathbf{X}^{\top}\mathbf{X}\right)^{-1}\mathbf{X}^{\top}\mathbf{y}\)</code>.</p> -<p>The first thing which we compute for this is <code class="highlighter-rouge">\(\mathbf{X}^{\top}\mathbf{X}\)</code>. The code for doing this in Mahoutâs scala DSL maps directly to the mathematical formula. The operation <code class="highlighter-rouge">.t()</code> transposes a matrix and analogous to R <code class="highlighter-rouge">%*%</code> denotes matrix multiplication.</p> +<p>The first thing which we compute for this is <code>\(\mathbf{X}^{\top}\mathbf{X}\)</code>. The code for doing this in Mahoutâs scala DSL maps directly to the mathematical formula. The operation <code>.t()</code> transposes a matrix and analogous to R <code>%*%</code> denotes matrix multiplication.</p> <div class="codehilite"><pre> val drmXtX = drmX.t %*% drmX </pre></div> -<p>The same is true for computing <code class="highlighter-rouge">\(\mathbf{X}^{\top}\mathbf{y}\)</code>. We can simply type the math in scala expressions into the shell. Here, <em>X</em> lives in the cluster, while is <em>y</em> in the memory of the driver, and the result is a DRM again.</p> +<p>The same is true for computing <code>\(\mathbf{X}^{\top}\mathbf{y}\)</code>. We can simply type the math in scala expressions into the shell. Here, <em>X</em> lives in the cluster, while is <em>y</em> in the memory of the driver, and the result is a DRM again.</p> <div class="codehilite"><pre> val drmXty = drmX.t %*% y </pre></div> -<p>Weâre nearly done. The next step we take is to fetch <code class="highlighter-rouge">\(\mathbf{X}^{\top}\mathbf{X}\)</code> and -<code class="highlighter-rouge">\(\mathbf{X}^{\top}\mathbf{y}\)</code> into the memory of our driver machine (we are targeting +<p>Weâre nearly done. The next step we take is to fetch <code>\(\mathbf{X}^{\top}\mathbf{X}\)</code> and +<code>\(\mathbf{X}^{\top}\mathbf{y}\)</code> into the memory of our driver machine (we are targeting features matrices that are tall and skinny , -so we can assume that <code class="highlighter-rouge">\(\mathbf{X}^{\top}\mathbf{X}\)</code> is small enough +so we can assume that <code>\(\mathbf{X}^{\top}\mathbf{X}\)</code> is small enough to fit in). Then, we provide them to an in-memory solver (Mahout provides -the an analog to Râs <code class="highlighter-rouge">solve()</code> for that) which computes <code class="highlighter-rouge">beta</code>, our -OLS estimate of the parameter vector <code class="highlighter-rouge">\(\boldsymbol{\beta}\)</code>.</p> +the an analog to Râs <code>solve()</code> for that) which computes <code>beta</code>, our +OLS estimate of the parameter vector <code>\(\boldsymbol{\beta}\)</code>.</p> <div class="codehilite"><pre> val XtX = drmXtX.collect @@ -478,9 +478,9 @@ as much as possible, while still retaining decent performance and scalability.</p> <p>We can now check how well our model fits its training data. -First, we multiply the feature matrix <code class="highlighter-rouge">\(\mathbf{X}\)</code> by our estimate of -<code class="highlighter-rouge">\(\boldsymbol{\beta}\)</code>. Then, we look at the difference (via L2-norm) of -the target variable <code class="highlighter-rouge">\(\mathbf{y}\)</code> to the fitted target variable:</p> +First, we multiply the feature matrix <code>\(\mathbf{X}\)</code> by our estimate of +<code>\(\boldsymbol{\beta}\)</code>. Then, we look at the difference (via L2-norm) of +the target variable <code>\(\mathbf{y}\)</code> to the fitted target variable:</p> <div class="codehilite"><pre> val yFitted = (drmX %*% beta).collect(::, 0) @@ -489,7 +489,7 @@ val yFitted = (drmX %*% beta).collect(::, 0) <p>We hope that we could show that Mahoutâs shell allows people to interactively and incrementally write algorithms. We have entered a lot of individual commands, one-by-one, until we got the desired results. We can now refactor a little by wrapping our statements into easy-to-use functions. The definition of functions follows standard scala syntax.</p> -<p>We put all the commands for ordinary least squares into a function <code class="highlighter-rouge">ols</code>.</p> +<p>We put all the commands for ordinary least squares into a function <code>ols</code>.</p> <div class="codehilite"><pre> def ols(drmX: DrmLike[Int], y: Vector) = @@ -497,10 +497,10 @@ def ols(drmX: DrmLike[Int], y: Vector) = </pre></div> -<p>Note that DSL declares implicit <code class="highlighter-rouge">collect</code> if coersion rules require an in-core argument. Hence, we can simply -skip explicit <code class="highlighter-rouge">collect</code>s.</p> +<p>Note that DSL declares implicit <code>collect</code> if coersion rules require an in-core argument. Hence, we can simply +skip explicit <code>collect</code>s.</p> -<p>Next, we define a function <code class="highlighter-rouge">goodnessOfFit</code> that tells how well a model fits the target variable:</p> +<p>Next, we define a function <code>goodnessOfFit</code> that tells how well a model fits the target variable:</p> <div class="codehilite"><pre> def goodnessOfFit(drmX: DrmLike[Int], beta: Vector, y: Vector) = { @@ -513,7 +513,7 @@ def goodnessOfFit(drmX: DrmLike[Int], beta: Vector, y: Vector) = { model. Usually there is a constant bias term added to the model. Without that, our model always crosses through the origin and we only learn the right angle. An easy way to add such a bias term to our model is to add a -column of ones to the feature matrix <code class="highlighter-rouge">\(\mathbf{X}\)</code>. +column of ones to the feature matrix <code>\(\mathbf{X}\)</code>. The corresponding weight in the parameter vector will then be the bias term.</p> <p>Here is how we add a bias column:</p> @@ -522,14 +522,14 @@ The corresponding weight in the parameter vector will then be the bias term.</p> val drmXwithBiasColumn = drmX cbind 1 </pre></div> -<p>Now we can give the newly created DRM <code class="highlighter-rouge">drmXwithBiasColumn</code> to our model fitting method <code class="highlighter-rouge">ols</code> and see how well the resulting model fits the training data with <code class="highlighter-rouge">goodnessOfFit</code>. You should see a large improvement in the result.</p> +<p>Now we can give the newly created DRM <code>drmXwithBiasColumn</code> to our model fitting method <code>ols</code> and see how well the resulting model fits the training data with <code>goodnessOfFit</code>. You should see a large improvement in the result.</p> <div class="codehilite"><pre> val betaWithBiasTerm = ols(drmXwithBiasColumn, y) goodnessOfFit(drmXwithBiasColumn, betaWithBiasTerm, y) </pre></div> -<p>As a further optimization, we can make use of the DSLâs caching functionality. We use <code class="highlighter-rouge">drmXwithBiasColumn</code> repeatedly as input to a computation, so it might be beneficial to cache it in memory. This is achieved by calling <code class="highlighter-rouge">checkpoint()</code>. In the end, we remove it from the cache with uncache:</p> +<p>As a further optimization, we can make use of the DSLâs caching functionality. We use <code>drmXwithBiasColumn</code> repeatedly as input to a computation, so it might be beneficial to cache it in memory. This is achieved by calling <code>checkpoint()</code>. In the end, we remove it from the cache with uncache:</p> <div class="codehilite"><pre> val cachedDrmX = drmXwithBiasColumn.checkpoint()
