http://git-wip-us.apache.org/repos/asf/mahout/blob/0e718ec9/website/oldsite/_site/users/algorithms/d-spca.html ---------------------------------------------------------------------- diff --git a/website/oldsite/_site/users/algorithms/d-spca.html b/website/oldsite/_site/users/algorithms/d-spca.html new file mode 100644 index 0000000..140f174 --- /dev/null +++ b/website/oldsite/_site/users/algorithms/d-spca.html @@ -0,0 +1,482 @@ + + +<!DOCTYPE html> +<!-- + + Licensed to the Apache Software Foundation (ASF) under one or more + contributor license agreements. See the NOTICE file distributed with + this work for additional information regarding copyright ownership. + The ASF licenses this file to You under the Apache License, Version 2.0 + (the "License"); you may not use this file except in compliance with + the License. You may obtain a copy of the License at + + http://www.apache.org/licenses/LICENSE-2.0 + + Unless required by applicable law or agreed to in writing, software + distributed under the License is distributed on an "AS IS" BASIS, + WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + See the License for the specific language governing permissions and + limitations under the License. +--> + +<html xmlns="http://www.w3.org/1999/xhtml" xml:lang="en" lang="en"><head><meta http-equiv="Content-Type" content="text/html; charset=UTF-8"> + <title>Apache Mahout: Scalable machine learning and data mining</title> + <meta http-equiv="Content-Type" content="text/html; charset=utf-8"> + <meta name="Distribution" content="Global"> + <meta name="Robots" content="index,follow"> + <meta name="keywords" content="apache, apache hadoop, apache lucene, + business data mining, cluster analysis, + collaborative filtering, data extraction, data filtering, data framework, data integration, + data matching, data mining, data mining algorithms, data mining analysis, data mining data, + data mining introduction, data mining software, + data mining techniques, data representation, data set, datamining, + feature extraction, fuzzy k means, genetic algorithm, hadoop, + hierarchical clustering, high dimensional, introduction to data mining, kmeans, + knowledge discovery, learning approach, learning approaches, learning methods, + learning techniques, lucene, machine learning, machine translation, mahout apache, + mahout taste, map reduce hadoop, mining data, mining methods, naive bayes, + natural language processing, + supervised, text mining, time series data, unsupervised, web data mining"> + <link rel="shortcut icon" type="image/x-icon" href="https://mahout.apache.org/images/favicon.ico"> + <!--<script type="text/javascript" src="/js/prototype.js"></script>--> + <script type="text/javascript" src="https://ajax.googleapis.com/ajax/libs/prototype/1.7.2.0/prototype.js"></script> + <script type="text/javascript" src="/assets/themes/mahout-retro/js/effects.js"></script> + <script type="text/javascript" src="/assets/themes/mahout-retro/js/search.js"></script> + <script type="text/javascript" src="/assets/themes/mahout-retro/js/slides.js"></script> + + <link href="/assets/themes/mahout-retro/css/bootstrap.min.css" rel="stylesheet" media="screen"> + <link href="/assets/themes/mahout-retro/css/bootstrap-responsive.css" rel="stylesheet"> + <link rel="stylesheet" href="/assets/themes/mahout-retro/css/global.css" type="text/css"> + + <!-- mathJax stuff -- use `\(...\)` for inline style math in markdown --> + <script type="text/x-mathjax-config"> + MathJax.Hub.Config({ + tex2jax: { + skipTags: ['script', 'noscript', 'style', 'textarea', 'pre'] + } + }); + MathJax.Hub.Queue(function() { + var all = MathJax.Hub.getAllJax(), i; + for(i = 0; i < all.length; i += 1) { + all[i].SourceElement().parentNode.className += ' has-jax'; + } + }); + </script> + <script type="text/javascript"> + var mathjax = document.createElement('script'); + mathjax.type = 'text/javascript'; + mathjax.async = true; + + mathjax.src = ('https:' == document.location.protocol) ? + 'https://c328740.ssl.cf1.rackcdn.com/mathjax/latest/MathJax.js?config=TeX-AMS-MML_HTMLorMML' : + 'http://cdn.mathjax.org/mathjax/latest/MathJax.js?config=TeX-AMS-MML_HTMLorMML'; + + var s = document.getElementsByTagName('script')[0]; + s.parentNode.insertBefore(mathjax, s); + </script> +</head> + +<body id="home" data-twttr-rendered="true"> + <div id="wrap"> + <div id="header"> + <div id="logo"><a href="/"><img src="/assets/img/mahout-logo-brudman.png" alt="Logos for Mahout and Apache Software Foundation" /></a></div> + <div id="search"> + <form id="search-form" action="http://www.google.com/search" method="get" class="navbar-search pull-right"> + <input value="http://mahout.apache.org" name="sitesearch" type="hidden"> + <input class="search-query" name="q" id="query" type="text"> + <input id="submission" type="image" src="/assets/img/mahout-lupe.png" alt="Search" /> + </form> + </div> + + <div class="navbar navbar-inverse" style="position:absolute;top:133px;padding-right:0px;padding-left:0px;"> + <div class="navbar-inner" style="border: none; background: #999; border: none; border-radius: 0px;"> + <div class="container"> + <button type="button" class="btn btn-navbar" data-toggle="collapse" data-target=".nav-collapse"> + <span class="icon-bar"></span> + <span class="icon-bar"></span> + <span class="icon-bar"></span> + </button> + <!-- <a class="brand" href="#">Apache Community Development Project</a> --> + <!--<div class="nav-collapse collapse">--> +<div class="collapse navbar-collapse" id="main-navbar"> + <ul class="nav navbar-nav"> + <!-- <li><a href="/">Home</a></li> --> + <li class="dropdown"> <a href="#" class="dropdown-toggle" data-toggle="dropdown">General<b class="caret"></b></a> + <ul class="dropdown-menu"> + <li><a href="/general/downloads.html">Downloads</a> + <li><a href="/general/who-we-are.html">Who we are</a> + <li><a href="/general/mailing-lists,-irc-and-archives.html">Mailing Lists</a> + <li><a href="/general/release-notes.html">Release Notes</a> + <li><a href="/general/books-tutorials-and-talks.html">Books, Tutorials, Talks</a></li> + <li><a href="/general/powered-by-mahout.html">Powered By Mahout</a> + <li><a href="/general/professional-support.html">Professional Support</a> + <li class="divider"></li> + <li class="nav-header">Resources</li> + <li><a href="/general/reference-reading.html">Reference Reading</a> + <li><a href="/general/faq.html">FAQ</a> + <li class="divider"></li> + <li class="nav-header">Legal</li> + <li><a href="http://www.apache.org/licenses/">License</a></li> + <li><a href="http://www.apache.org/security/">Security</a></li> + <li><a href="/general/privacy-policy.html">Privacy Policy</a> + </ul> + </li> + <li class="dropdown"> <a href="#" class="dropdown-toggle" data-toggle="dropdown">Developers<b class="caret"></b></a> + <ul class="dropdown-menu"> + <li><a href="/developers/developer-resources.html">Developer resources</a></li> + <li><a href="/developers/version-control.html">Version control</a></li> + <li><a href="/developers/buildingmahout.html">Build from source</a></li> + <li><a href="/developers/issue-tracker.html">Issue tracker</a></li> + <li><a href="https://builds.apache.org/job/Mahout-Quality/" target="_blank">Code quality reports</a></li> + <li class="divider"></li> + <li class="nav-header">Contributions</li> + <li><a href="/developers/how-to-contribute.html">How to contribute</a></li> + <li><a href="/developers/how-to-become-a-committer.html">How to become a committer</a></li> + <li><a href="/developers/gsoc.html">GSoC</a></li> + <li class="divider"></li> + <li class="nav-header">For committers</li> + <li><a href="/developers/how-to-update-the-website.html">How to update the website</a></li> + <li><a href="/developers/patch-check-list.html">Patch check list</a></li> + <li><a href="/developers/github.html">Handling Github PRs</a></li> + <li><a href="/developers/how-to-release.html">How to release</a></li> + <li><a href="/developers/thirdparty-dependencies.html">Third party dependencies</a></li> + </ul> + </li> + <li class="dropdown"> <a href="#" class="dropdown-toggle" data-toggle="dropdown">Mahout-Samsara<b class="caret"></b></a> + <ul class="dropdown-menu"> + <li><a href="/users/sparkbindings/home.html">Scala & Spark Bindings Overview</a></li> + <li><a href="/users/sparkbindings/faq.html">FAQ</a></li> + <li><a href="/users/flinkbindings/playing-with-samsara-flink.html">Flink Bindings Overview</a></li> + <li class="nav-header">Engines</li> + <li><a href="/users/sparkbindings/home.html">Spark</a></li> + <li><a href="/users/environment/h2o-internals.html">H2O</a></li> + <li><a href="/users/flinkbindings/flink-internals.html">Flink</a></li> + <li class="nav-header">References</li> + <li><a href="/users/environment/in-core-reference.html">In-Core Algebraic DSL Reference</a></li> + <li><a href="/users/environment/out-of-core-reference.html">Distributed Algebraic DSL Reference</a></li> + <li class="nav-header">Tutorials</li> + <li><a href="/users/sparkbindings/play-with-shell.html">Playing with Mahout's Spark Shell</a></li> + <li><a href="/users/environment/how-to-build-an-app.html">How to build an app</a></li> + <li><a href="/users/environment/classify-a-doc-from-the-shell.html">Building a text classifier in Mahout's Spark Shell</a></li> + </ul> + </li> + <li class="dropdown"> <a href="#" class="dropdown-toggle" data-toggle="dropdown">Algorithms<b class="caret"></b></a> + <ul class="dropdown-menu"> + <li><a href="/users/basics/algorithms.html">List of algorithms</a> + <li class="nav-header">Distributed Matrix Decomposition</li> + <li><a href="/users/algorithms/d-qr.html">Cholesky QR</a></li> + <li><a href="/users/algorithms/d-ssvd.html">SSVD</a></li> + <li><a href="/users/algorithms/d-als.html">Distributed ALS</a></li> + <li><a href="/users/algorithms/d-spca.html">SPCA</a></li> + <li class="nav-header">Recommendations</li> + <li><a href="/users/algorithms/recommender-overview.html">Recommender Overview</a></li> + <li><a href="/users/algorithms/intro-cooccurrence-spark.html">Intro to cooccurrence-based<br/> recommendations with Spark</a></li> + <li class="nav-header">Classification</li> + <li><a href="/users/algorithms/spark-naive-bayes.html">Spark Naive Bayes</a></li> + </ul> + </li> + <li class="dropdown"> <a href="#" class="dropdown-toggle" data-toggle="dropdown">MapReduce Basics<b class="caret"></b></a> + <ul class="dropdown-menu"> + <li><a href="/users/basics/algorithms.html">List of algorithms</a> + <li><a href="/users/basics/quickstart.html">Overview</a> + <li class="divider"></li> + <li class="nav-header">Working with text</li> + <li><a href="/users/basics/creating-vectors-from-text.html">Creating vectors from text</a> + <li><a href="/users/basics/collocations.html">Collocations</a> + <li class="divider"></li> + <li class="nav-header">Dimensionality reduction</li> + <li><a href="/users/dim-reduction/dimensional-reduction.html">Singular Value Decomposition</a></li> + <li><a href="/users/dim-reduction/ssvd.html">Stochastic SVD</a></li> + <li class="divider"></li> + <li class="nav-header">Topic Models</li> + <li><a href="/users/clustering/latent-dirichlet-allocation.html">Latent Dirichlet Allocation</a></li> + </ul> + </li> + <li class="dropdown"> <a href="#" class="dropdown-toggle" data-toggle="dropdown">Mahout MapReduce<b class="caret"></b></a> + <ul class="dropdown-menu"> + <li class="nav-header">Classification</li> + <li><a href="/users/classification/bayesian.html">Naive Bayes</a></li> + <li><a href="/users/classification/hidden-markov-models.html">Hidden Markov Models</a></li> + <li><a href="/users/classification/logistic-regression.html">Logistic Regression (Single Machine)</a></li> + <li><a href="/users/classification/partial-implementation.html">Random Forest</a></li> + <li class="nav-header">Classification Examples</li> + <li><a href="/users/classification/breiman-example.html">Breiman example</a></li> + <li><a href="/users/classification/twenty-newsgroups.html">20 newsgroups example</a></li> + <li><a href="/users/classification/bankmarketing-example.html">SGD classifier bank marketing</a></li> + <li><a href="/users/classification/wikipedia-classifier-example.html">Wikipedia XML parser and classifier</a></li> + <li class="nav-header">Clustering</li> + <li><a href="/users/clustering/k-means-clustering.html">k-Means</a></li> + <li><a href="/users/clustering/canopy-clustering.html">Canopy</a></li> + <li><a href="/users/clustering/fuzzy-k-means.html">Fuzzy k-Means</a></li> + <li><a href="/users/clustering/streaming-k-means.html">Streaming KMeans</a></li> + <li><a href="/users/clustering/spectral-clustering.html">Spectral Clustering</a></li> + <li class="nav-header">Clustering Commandline usage</li> + <li><a href="/users/clustering/k-means-commandline.html">Options for k-Means</a></li> + <li><a href="/users/clustering/canopy-commandline.html">Options for Canopy</a></li> + <li><a href="/users/clustering/fuzzy-k-means-commandline.html">Options for Fuzzy k-Means</a></li> + <li class="nav-header">Clustering Examples</li> + <li><a href="/users/clustering/clustering-of-synthetic-control-data.html">Synthetic data</a></li> + <li class="nav-header">Cluster Post processing</li> + <li><a href="/users/clustering/cluster-dumper.html">Cluster Dumper tool</a></li> + <li><a href="/users/clustering/visualizing-sample-clusters.html">Cluster visualisation</a></li> + <li class="nav-header">Recommendations</li> + <li><a href="/users/recommender/recommender-first-timer-faq.html">First Timer FAQ</a></li> + <li><a href="/users/recommender/userbased-5-minutes.html">A user-based recommender <br/>in 5 minutes</a></li> + <li><a href="/users/recommender/matrix-factorization.html">Matrix factorization-based<br/> recommenders</a></li> + <li><a href="/users/recommender/recommender-documentation.html">Overview</a></li> + <li><a href="/users/recommender/intro-itembased-hadoop.html">Intro to item-based recommendations<br/> with Hadoop</a></li> + <li><a href="/users/recommender/intro-als-hadoop.html">Intro to ALS recommendations<br/> with Hadoop</a></li> + </ul> + </li> + <!-- <li class="dropdown"> <a href="#" class="dropdown-toggle" data-toggle="dropdown">Recommendations<b class="caret"></b></a> + <ul class="dropdown-menu"> + + </ul> --> + </li> + </ul> +</div><!--/.nav-collapse --> + </div> + </div> + </div> + +</div> + + <div id="sidebar"> + <div id="sidebar-wrap"> + <h2>Twitter</h2> + <ul class="sidemenu"> + <li> +<a class="twitter-timeline" href="https://twitter.com/ApacheMahout" data-widget-id="422861673444028416">Tweets by @ApacheMahout</a> +<script>!function(d,s,id){var js,fjs=d.getElementsByTagName(s)[0],p=/^http:/.test(d.location)?'http':'https';if(!d.getElementById(id)){js=d.createElement(s);js.id=id;js.src=p+"://platform.twitter.com/widgets.js";fjs.parentNode.insertBefore(js,fjs);}}(document,"script","twitter-wjs");</script> +</li> + </ul> + <h2>Apache Software Foundation</h2> + <ul class="sidemenu"> + <li><a href="http://www.apache.org/foundation/how-it-works.html">How the ASF works</a></li> + <li><a href="http://www.apache.org/foundation/getinvolved.html">Get Involved</a></li> + <li><a href="http://www.apache.org/dev/">Developer Resources</a></li> + <li><a href="http://www.apache.org/foundation/sponsorship.html">Sponsorship</a></li> + <li><a href="http://www.apache.org/foundation/thanks.html">Thanks</a></li> + </ul> + <h2>Related Projects</h2> + <ul class="sidemenu"> + <li><a href="http://lucene.apache.org/">Apache Lucene</a></li> + <li><a href="http://hadoop.apache.org/">Apache Hadoop</a></li> + <li><a href="http://bigtop.apache.org/">Apache Bigtop</a></li> + <li><a href="http://spark.apache.org/">Apache Spark</a></li> + <li><a href="http://flink.apache.org/">Apache Flink</a></li> + </ul> + </div> +</div> + + <div id="content-wrap" class="clearfix"> + <div id="main"> + + <h1 id="distributed-stochastic-pca">Distributed Stochastic PCA</h1> + +<h2 id="intro">Intro</h2> + +<p>Mahout has a distributed implementation of Stochastic PCA<a href="Lyubimov and Palumbo, ["Apache Mahout: Beyond MapReduce; Distributed Algorithm Design"](https://www.amazon.com/Apache-Mahout-MapReduce-Dmitriy-Lyubimov/dp/1523775785)">1</a>. This algorithm computes the exact equivalent of Mahoutâs dssvd(<code class="highlighter-rouge">\(\mathbf{A-1\mu^\top}\)</code>) by modifying the <code class="highlighter-rouge">dssvd</code> algorithm so as to avoid forming <code class="highlighter-rouge">\(\mathbf{A-1\mu^\top}\)</code>, which would densify a sparse input. Thus, it is suitable for work with both dense and sparse inputs.</p> + +<h2 id="algorithm">Algorithm</h2> + +<p>Given an <em>m</em> <code class="highlighter-rouge">\(\times\)</code> <em>n</em> matrix <code class="highlighter-rouge">\(\mathbf{A}\)</code>, a target rank <em>k</em>, and an oversampling parameter <em>p</em>, this procedure computes a <em>k</em>-rank PCA by finding the unknowns in <code class="highlighter-rouge">\(\mathbf{Aâ1\mu^\top \approx U\Sigma V^\top}\)</code>:</p> + +<ol> + <li>Create seed for random <em>n</em> <code class="highlighter-rouge">\(\times\)</code> <em>(k+p)</em> matrix <code class="highlighter-rouge">\(\Omega\)</code>.</li> + <li><code class="highlighter-rouge">\(\mathbf{s_\Omega \leftarrow \Omega^\top \mu}\)</code>.</li> + <li><code class="highlighter-rouge">\(\mathbf{Y_0 \leftarrow A\Omega â 1 {s_\Omega}^\top, Y \in \mathbb{R}^{m\times(k+p)}}\)</code>.</li> + <li>Column-orthonormalize <code class="highlighter-rouge">\(\mathbf{Y_0} \rightarrow \mathbf{Q}\)</code> by computing thin decomposition <code class="highlighter-rouge">\(\mathbf{Y_0} = \mathbf{QR}\)</code>. Also, <code class="highlighter-rouge">\(\mathbf{Q}\in\mathbb{R}^{m\times(k+p)}, \mathbf{R}\in\mathbb{R}^{(k+p)\times(k+p)}\)</code>.</li> + <li><code class="highlighter-rouge">\(\mathbf{s_Q \leftarrow Q^\top 1}\)</code>.</li> + <li><code class="highlighter-rouge">\(\mathbf{B_0 \leftarrow Q^\top A: B \in \mathbb{R}^{(k+p)\times n}}\)</code>.</li> + <li><code class="highlighter-rouge">\(\mathbf{s_B \leftarrow {B_0}^\top \mu}\)</code>.</li> + <li>For <em>i</em> in 1..<em>q</em> repeat (power iterations): + <ul> + <li>For <em>j</em> in 1..<em>n</em> apply <code class="highlighter-rouge">\(\mathbf{(B_{iâ1})_{âj} \leftarrow (B_{iâ1})_{âj}â\mu_j s_Q}\)</code>.</li> + <li><code class="highlighter-rouge">\(\mathbf{Y_i \leftarrow A{B_{iâ1}}^\topâ1(s_Bâ\mu^\top \mu s_Q)^\top}\)</code>.</li> + <li>Column-orthonormalize <code class="highlighter-rouge">\(\mathbf{Y_i} \rightarrow \mathbf{Q}\)</code> by computing thin decomposition <code class="highlighter-rouge">\(\mathbf{Y_i = QR}\)</code>.</li> + <li><code class="highlighter-rouge">\(\mathbf{s_Q \leftarrow Q^\top 1}\)</code>.</li> + <li><code class="highlighter-rouge">\(\mathbf{B_i \leftarrow Q^\top A}\)</code>.</li> + <li><code class="highlighter-rouge">\(\mathbf{s_B \leftarrow {B_i}^\top \mu}\)</code>.</li> + </ul> + </li> + <li>Let <code class="highlighter-rouge">\(\mathbf{C \triangleq s_Q {s_B}^\top}\)</code>. <code class="highlighter-rouge">\(\mathbf{M \leftarrow B_q {B_q}^\top â C â C^\top + \mu^\top \mu s_Q {s_Q}^\top}\)</code>.</li> + <li>Compute an eigensolution of the small symmetric <code class="highlighter-rouge">\(\mathbf{M = \hat{U} \Lambda \hat{U}^\top: M \in \mathbb{R}^{(k+p)\times(k+p)}}\)</code>.</li> + <li>The singular values <code class="highlighter-rouge">\(\Sigma = \Lambda^{\circ 0.5}\)</code>, or, in other words, <code class="highlighter-rouge">\(\mathbf{\sigma_i= \sqrt{\lambda_i}}\)</code>.</li> + <li>If needed, compute <code class="highlighter-rouge">\(\mathbf{U = Q\hat{U}}\)</code>.</li> + <li>If needed, compute <code class="highlighter-rouge">\(\mathbf{V = B^\top \hat{U} \Sigma^{â1}}\)</code>.</li> + <li>If needed, items converted to the PCA space can be computed as <code class="highlighter-rouge">\(\mathbf{U\Sigma}\)</code>.</li> +</ol> + +<h2 id="implementation">Implementation</h2> + +<p>Mahout <code class="highlighter-rouge">dspca(...)</code> is implemented in the mahout <code class="highlighter-rouge">math-scala</code> algebraic optimizer which translates Mahoutâs R-like linear algebra operators into a physical plan for both Spark and H2O distributed engines.</p> + +<div class="highlighter-rouge"><pre class="highlight"><code>def dspca[K](drmA: DrmLike[K], k: Int, p: Int = 15, q: Int = 0): +(DrmLike[K], DrmLike[Int], Vector) = { + + // Some mapBlock() calls need it + implicit val ktag = drmA.keyClassTag + + val drmAcp = drmA.checkpoint() + implicit val ctx = drmAcp.context + + val m = drmAcp.nrow + val n = drmAcp.ncol + assert(k <= (m min n), "k cannot be greater than smaller of m, n.") + val pfxed = safeToNonNegInt((m min n) - k min p) + + // Actual decomposition rank + val r = k + pfxed + + // Dataset mean + val mu = drmAcp.colMeans + + val mtm = mu dot mu + + // We represent Omega by its seed. + val omegaSeed = RandomUtils.getRandom().nextInt() + val omega = Matrices.symmetricUniformView(n, r, omegaSeed) + + // This done in front in a single-threaded fashion for now. Even though it doesn't require any + // memory beyond that is required to keep xi around, it still might be parallelized to backs + // for significantly big n and r. TODO + val s_o = omega.t %*% mu + + val bcastS_o = drmBroadcast(s_o) + val bcastMu = drmBroadcast(mu) + + var drmY = drmAcp.mapBlock(ncol = r) { + case (keys, blockA) â + val s_o:Vector = bcastS_o + val blockY = blockA %*% Matrices.symmetricUniformView(n, r, omegaSeed) + for (row â 0 until blockY.nrow) blockY(row, ::) -= s_o + keys â blockY + } + // Checkpoint Y + .checkpoint() + + var drmQ = dqrThin(drmY, checkRankDeficiency = false)._1.checkpoint() + + var s_q = drmQ.colSums() + var bcastVarS_q = drmBroadcast(s_q) + + // This actually should be optimized as identically partitioned map-side A'B since A and Q should + // still be identically partitioned. + var drmBt = (drmAcp.t %*% drmQ).checkpoint() + + var s_b = (drmBt.t %*% mu).collect(::, 0) + var bcastVarS_b = drmBroadcast(s_b) + + for (i â 0 until q) { + + // These closures don't seem to live well with outside-scope vars. This doesn't record closure + // attributes correctly. So we create additional set of vals for broadcast vars to properly + // create readonly closure attributes in this very scope. + val bcastS_q = bcastVarS_q + val bcastMuInner = bcastMu + + // Fix Bt as B' -= xi cross s_q + drmBt = drmBt.mapBlock() { + case (keys, block) â + val s_q: Vector = bcastS_q + val mu: Vector = bcastMuInner + keys.zipWithIndex.foreach { + case (key, idx) â block(idx, ::) -= s_q * mu(key) + } + keys â block + } + + drmY.uncache() + drmQ.uncache() + + val bCastSt_b = drmBroadcast(s_b -=: mtm * s_q) + + drmY = (drmAcp %*% drmBt) + // Fix Y by subtracting st_b from each row of the AB' + .mapBlock() { + case (keys, block) â + val st_b: Vector = bCastSt_b + block := { (_, c, v) â v - st_b(c) } + keys â block + } + // Checkpoint Y + .checkpoint() + + drmQ = dqrThin(drmY, checkRankDeficiency = false)._1.checkpoint() + + s_q = drmQ.colSums() + bcastVarS_q = drmBroadcast(s_q) + + // This on the other hand should be inner-join-and-map A'B optimization since A and Q_i are not + // identically partitioned anymore. + drmBt = (drmAcp.t %*% drmQ).checkpoint() + + s_b = (drmBt.t %*% mu).collect(::, 0) + bcastVarS_b = drmBroadcast(s_b) + } + + val c = s_q cross s_b + val inCoreBBt = (drmBt.t %*% drmBt).checkpoint(CacheHint.NONE).collect -=: + c -=: c.t +=: mtm *=: (s_q cross s_q) + val (inCoreUHat, d) = eigen(inCoreBBt) + val s = d.sqrt + + // Since neither drmU nor drmV are actually computed until actually used, we don't need the flags + // instructing compute (or not compute) either of the U,V outputs anymore. Neat, isn't it? + val drmU = drmQ %*% inCoreUHat + val drmV = drmBt %*% (inCoreUHat %*% diagv(1 / s)) + + (drmU(::, 0 until k), drmV(::, 0 until k), s(0 until k)) +} +</code></pre> +</div> + +<h2 id="usage">Usage</h2> + +<p>The scala <code class="highlighter-rouge">dspca(...)</code> method can easily be called in any Spark, Flink, or H2O application built with the <code class="highlighter-rouge">math-scala</code> library and the corresponding <code class="highlighter-rouge">Spark</code>, <code class="highlighter-rouge">Flink</code>, or <code class="highlighter-rouge">H2O</code> engine module as follows:</p> + +<div class="highlighter-rouge"><pre class="highlight"><code>import org.apache.mahout.math._ +import decompositions._ +import drm._ + +val (drmU, drmV, s) = dspca(drmA, k=200, q=1) +</code></pre> +</div> + +<p>Note the parameter is optional and its default value is zero.</p> + +<h2 id="references">References</h2> + + + </div> + </div> +</div> + <footer class="footer" align="center"> + <div class="container"> + <p> + Copyright © 2014-2016 The Apache Software Foundation, Licensed under + the <a href="http://www.apache.org/licenses/LICENSE-2.0">Apache License, Version 2.0</a>. + <br /> + Apache Mahout, Mahout, Apache, the Apache feather logo, and the elephant rider logo are either registered trademarks or trademarks of <a href="http://www.apache.org/foundation/marks/">The Apache Software Foundation</a> in the United States and other countries. + </p> + </div> + </footer> + + <script src="/assets/themes/mahout-retro/js/jquery-1.9.1.min.js"></script> + <script src="/assets/themes/mahout-retro/js/bootstrap.min.js"></script> + <script> + (function() { + var cx = '012254517474945470291:vhsfv7eokdc'; + var gcse = document.createElement('script'); + gcse.type = 'text/javascript'; + gcse.async = true; + gcse.src = (document.location.protocol == 'https:' ? 'https:' : 'http:') + + '//www.google.com/cse/cse.js?cx=' + cx; + var s = document.getElementsByTagName('script')[0]; + s.parentNode.insertBefore(gcse, s); + })(); + </script> +</body> +</html> +
http://git-wip-us.apache.org/repos/asf/mahout/blob/0e718ec9/website/oldsite/_site/users/algorithms/d-ssvd.html ---------------------------------------------------------------------- diff --git a/website/oldsite/_site/users/algorithms/d-ssvd.html b/website/oldsite/_site/users/algorithms/d-ssvd.html new file mode 100644 index 0000000..e0daa43 --- /dev/null +++ b/website/oldsite/_site/users/algorithms/d-ssvd.html @@ -0,0 +1,443 @@ + + +<!DOCTYPE html> +<!-- + + Licensed to the Apache Software Foundation (ASF) under one or more + contributor license agreements. See the NOTICE file distributed with + this work for additional information regarding copyright ownership. + The ASF licenses this file to You under the Apache License, Version 2.0 + (the "License"); you may not use this file except in compliance with + the License. You may obtain a copy of the License at + + http://www.apache.org/licenses/LICENSE-2.0 + + Unless required by applicable law or agreed to in writing, software + distributed under the License is distributed on an "AS IS" BASIS, + WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + See the License for the specific language governing permissions and + limitations under the License. +--> + +<html xmlns="http://www.w3.org/1999/xhtml" xml:lang="en" lang="en"><head><meta http-equiv="Content-Type" content="text/html; charset=UTF-8"> + <title>Apache Mahout: Scalable machine learning and data mining</title> + <meta http-equiv="Content-Type" content="text/html; charset=utf-8"> + <meta name="Distribution" content="Global"> + <meta name="Robots" content="index,follow"> + <meta name="keywords" content="apache, apache hadoop, apache lucene, + business data mining, cluster analysis, + collaborative filtering, data extraction, data filtering, data framework, data integration, + data matching, data mining, data mining algorithms, data mining analysis, data mining data, + data mining introduction, data mining software, + data mining techniques, data representation, data set, datamining, + feature extraction, fuzzy k means, genetic algorithm, hadoop, + hierarchical clustering, high dimensional, introduction to data mining, kmeans, + knowledge discovery, learning approach, learning approaches, learning methods, + learning techniques, lucene, machine learning, machine translation, mahout apache, + mahout taste, map reduce hadoop, mining data, mining methods, naive bayes, + natural language processing, + supervised, text mining, time series data, unsupervised, web data mining"> + <link rel="shortcut icon" type="image/x-icon" href="https://mahout.apache.org/images/favicon.ico"> + <!--<script type="text/javascript" src="/js/prototype.js"></script>--> + <script type="text/javascript" src="https://ajax.googleapis.com/ajax/libs/prototype/1.7.2.0/prototype.js"></script> + <script type="text/javascript" src="/assets/themes/mahout-retro/js/effects.js"></script> + <script type="text/javascript" src="/assets/themes/mahout-retro/js/search.js"></script> + <script type="text/javascript" src="/assets/themes/mahout-retro/js/slides.js"></script> + + <link href="/assets/themes/mahout-retro/css/bootstrap.min.css" rel="stylesheet" media="screen"> + <link href="/assets/themes/mahout-retro/css/bootstrap-responsive.css" rel="stylesheet"> + <link rel="stylesheet" href="/assets/themes/mahout-retro/css/global.css" type="text/css"> + + <!-- mathJax stuff -- use `\(...\)` for inline style math in markdown --> + <script type="text/x-mathjax-config"> + MathJax.Hub.Config({ + tex2jax: { + skipTags: ['script', 'noscript', 'style', 'textarea', 'pre'] + } + }); + MathJax.Hub.Queue(function() { + var all = MathJax.Hub.getAllJax(), i; + for(i = 0; i < all.length; i += 1) { + all[i].SourceElement().parentNode.className += ' has-jax'; + } + }); + </script> + <script type="text/javascript"> + var mathjax = document.createElement('script'); + mathjax.type = 'text/javascript'; + mathjax.async = true; + + mathjax.src = ('https:' == document.location.protocol) ? + 'https://c328740.ssl.cf1.rackcdn.com/mathjax/latest/MathJax.js?config=TeX-AMS-MML_HTMLorMML' : + 'http://cdn.mathjax.org/mathjax/latest/MathJax.js?config=TeX-AMS-MML_HTMLorMML'; + + var s = document.getElementsByTagName('script')[0]; + s.parentNode.insertBefore(mathjax, s); + </script> +</head> + +<body id="home" data-twttr-rendered="true"> + <div id="wrap"> + <div id="header"> + <div id="logo"><a href="/"><img src="/assets/img/mahout-logo-brudman.png" alt="Logos for Mahout and Apache Software Foundation" /></a></div> + <div id="search"> + <form id="search-form" action="http://www.google.com/search" method="get" class="navbar-search pull-right"> + <input value="http://mahout.apache.org" name="sitesearch" type="hidden"> + <input class="search-query" name="q" id="query" type="text"> + <input id="submission" type="image" src="/assets/img/mahout-lupe.png" alt="Search" /> + </form> + </div> + + <div class="navbar navbar-inverse" style="position:absolute;top:133px;padding-right:0px;padding-left:0px;"> + <div class="navbar-inner" style="border: none; background: #999; border: none; border-radius: 0px;"> + <div class="container"> + <button type="button" class="btn btn-navbar" data-toggle="collapse" data-target=".nav-collapse"> + <span class="icon-bar"></span> + <span class="icon-bar"></span> + <span class="icon-bar"></span> + </button> + <!-- <a class="brand" href="#">Apache Community Development Project</a> --> + <!--<div class="nav-collapse collapse">--> +<div class="collapse navbar-collapse" id="main-navbar"> + <ul class="nav navbar-nav"> + <!-- <li><a href="/">Home</a></li> --> + <li class="dropdown"> <a href="#" class="dropdown-toggle" data-toggle="dropdown">General<b class="caret"></b></a> + <ul class="dropdown-menu"> + <li><a href="/general/downloads.html">Downloads</a> + <li><a href="/general/who-we-are.html">Who we are</a> + <li><a href="/general/mailing-lists,-irc-and-archives.html">Mailing Lists</a> + <li><a href="/general/release-notes.html">Release Notes</a> + <li><a href="/general/books-tutorials-and-talks.html">Books, Tutorials, Talks</a></li> + <li><a href="/general/powered-by-mahout.html">Powered By Mahout</a> + <li><a href="/general/professional-support.html">Professional Support</a> + <li class="divider"></li> + <li class="nav-header">Resources</li> + <li><a href="/general/reference-reading.html">Reference Reading</a> + <li><a href="/general/faq.html">FAQ</a> + <li class="divider"></li> + <li class="nav-header">Legal</li> + <li><a href="http://www.apache.org/licenses/">License</a></li> + <li><a href="http://www.apache.org/security/">Security</a></li> + <li><a href="/general/privacy-policy.html">Privacy Policy</a> + </ul> + </li> + <li class="dropdown"> <a href="#" class="dropdown-toggle" data-toggle="dropdown">Developers<b class="caret"></b></a> + <ul class="dropdown-menu"> + <li><a href="/developers/developer-resources.html">Developer resources</a></li> + <li><a href="/developers/version-control.html">Version control</a></li> + <li><a href="/developers/buildingmahout.html">Build from source</a></li> + <li><a href="/developers/issue-tracker.html">Issue tracker</a></li> + <li><a href="https://builds.apache.org/job/Mahout-Quality/" target="_blank">Code quality reports</a></li> + <li class="divider"></li> + <li class="nav-header">Contributions</li> + <li><a href="/developers/how-to-contribute.html">How to contribute</a></li> + <li><a href="/developers/how-to-become-a-committer.html">How to become a committer</a></li> + <li><a href="/developers/gsoc.html">GSoC</a></li> + <li class="divider"></li> + <li class="nav-header">For committers</li> + <li><a href="/developers/how-to-update-the-website.html">How to update the website</a></li> + <li><a href="/developers/patch-check-list.html">Patch check list</a></li> + <li><a href="/developers/github.html">Handling Github PRs</a></li> + <li><a href="/developers/how-to-release.html">How to release</a></li> + <li><a href="/developers/thirdparty-dependencies.html">Third party dependencies</a></li> + </ul> + </li> + <li class="dropdown"> <a href="#" class="dropdown-toggle" data-toggle="dropdown">Mahout-Samsara<b class="caret"></b></a> + <ul class="dropdown-menu"> + <li><a href="/users/sparkbindings/home.html">Scala & Spark Bindings Overview</a></li> + <li><a href="/users/sparkbindings/faq.html">FAQ</a></li> + <li><a href="/users/flinkbindings/playing-with-samsara-flink.html">Flink Bindings Overview</a></li> + <li class="nav-header">Engines</li> + <li><a href="/users/sparkbindings/home.html">Spark</a></li> + <li><a href="/users/environment/h2o-internals.html">H2O</a></li> + <li><a href="/users/flinkbindings/flink-internals.html">Flink</a></li> + <li class="nav-header">References</li> + <li><a href="/users/environment/in-core-reference.html">In-Core Algebraic DSL Reference</a></li> + <li><a href="/users/environment/out-of-core-reference.html">Distributed Algebraic DSL Reference</a></li> + <li class="nav-header">Tutorials</li> + <li><a href="/users/sparkbindings/play-with-shell.html">Playing with Mahout's Spark Shell</a></li> + <li><a href="/users/environment/how-to-build-an-app.html">How to build an app</a></li> + <li><a href="/users/environment/classify-a-doc-from-the-shell.html">Building a text classifier in Mahout's Spark Shell</a></li> + </ul> + </li> + <li class="dropdown"> <a href="#" class="dropdown-toggle" data-toggle="dropdown">Algorithms<b class="caret"></b></a> + <ul class="dropdown-menu"> + <li><a href="/users/basics/algorithms.html">List of algorithms</a> + <li class="nav-header">Distributed Matrix Decomposition</li> + <li><a href="/users/algorithms/d-qr.html">Cholesky QR</a></li> + <li><a href="/users/algorithms/d-ssvd.html">SSVD</a></li> + <li><a href="/users/algorithms/d-als.html">Distributed ALS</a></li> + <li><a href="/users/algorithms/d-spca.html">SPCA</a></li> + <li class="nav-header">Recommendations</li> + <li><a href="/users/algorithms/recommender-overview.html">Recommender Overview</a></li> + <li><a href="/users/algorithms/intro-cooccurrence-spark.html">Intro to cooccurrence-based<br/> recommendations with Spark</a></li> + <li class="nav-header">Classification</li> + <li><a href="/users/algorithms/spark-naive-bayes.html">Spark Naive Bayes</a></li> + </ul> + </li> + <li class="dropdown"> <a href="#" class="dropdown-toggle" data-toggle="dropdown">MapReduce Basics<b class="caret"></b></a> + <ul class="dropdown-menu"> + <li><a href="/users/basics/algorithms.html">List of algorithms</a> + <li><a href="/users/basics/quickstart.html">Overview</a> + <li class="divider"></li> + <li class="nav-header">Working with text</li> + <li><a href="/users/basics/creating-vectors-from-text.html">Creating vectors from text</a> + <li><a href="/users/basics/collocations.html">Collocations</a> + <li class="divider"></li> + <li class="nav-header">Dimensionality reduction</li> + <li><a href="/users/dim-reduction/dimensional-reduction.html">Singular Value Decomposition</a></li> + <li><a href="/users/dim-reduction/ssvd.html">Stochastic SVD</a></li> + <li class="divider"></li> + <li class="nav-header">Topic Models</li> + <li><a href="/users/clustering/latent-dirichlet-allocation.html">Latent Dirichlet Allocation</a></li> + </ul> + </li> + <li class="dropdown"> <a href="#" class="dropdown-toggle" data-toggle="dropdown">Mahout MapReduce<b class="caret"></b></a> + <ul class="dropdown-menu"> + <li class="nav-header">Classification</li> + <li><a href="/users/classification/bayesian.html">Naive Bayes</a></li> + <li><a href="/users/classification/hidden-markov-models.html">Hidden Markov Models</a></li> + <li><a href="/users/classification/logistic-regression.html">Logistic Regression (Single Machine)</a></li> + <li><a href="/users/classification/partial-implementation.html">Random Forest</a></li> + <li class="nav-header">Classification Examples</li> + <li><a href="/users/classification/breiman-example.html">Breiman example</a></li> + <li><a href="/users/classification/twenty-newsgroups.html">20 newsgroups example</a></li> + <li><a href="/users/classification/bankmarketing-example.html">SGD classifier bank marketing</a></li> + <li><a href="/users/classification/wikipedia-classifier-example.html">Wikipedia XML parser and classifier</a></li> + <li class="nav-header">Clustering</li> + <li><a href="/users/clustering/k-means-clustering.html">k-Means</a></li> + <li><a href="/users/clustering/canopy-clustering.html">Canopy</a></li> + <li><a href="/users/clustering/fuzzy-k-means.html">Fuzzy k-Means</a></li> + <li><a href="/users/clustering/streaming-k-means.html">Streaming KMeans</a></li> + <li><a href="/users/clustering/spectral-clustering.html">Spectral Clustering</a></li> + <li class="nav-header">Clustering Commandline usage</li> + <li><a href="/users/clustering/k-means-commandline.html">Options for k-Means</a></li> + <li><a href="/users/clustering/canopy-commandline.html">Options for Canopy</a></li> + <li><a href="/users/clustering/fuzzy-k-means-commandline.html">Options for Fuzzy k-Means</a></li> + <li class="nav-header">Clustering Examples</li> + <li><a href="/users/clustering/clustering-of-synthetic-control-data.html">Synthetic data</a></li> + <li class="nav-header">Cluster Post processing</li> + <li><a href="/users/clustering/cluster-dumper.html">Cluster Dumper tool</a></li> + <li><a href="/users/clustering/visualizing-sample-clusters.html">Cluster visualisation</a></li> + <li class="nav-header">Recommendations</li> + <li><a href="/users/recommender/recommender-first-timer-faq.html">First Timer FAQ</a></li> + <li><a href="/users/recommender/userbased-5-minutes.html">A user-based recommender <br/>in 5 minutes</a></li> + <li><a href="/users/recommender/matrix-factorization.html">Matrix factorization-based<br/> recommenders</a></li> + <li><a href="/users/recommender/recommender-documentation.html">Overview</a></li> + <li><a href="/users/recommender/intro-itembased-hadoop.html">Intro to item-based recommendations<br/> with Hadoop</a></li> + <li><a href="/users/recommender/intro-als-hadoop.html">Intro to ALS recommendations<br/> with Hadoop</a></li> + </ul> + </li> + <!-- <li class="dropdown"> <a href="#" class="dropdown-toggle" data-toggle="dropdown">Recommendations<b class="caret"></b></a> + <ul class="dropdown-menu"> + + </ul> --> + </li> + </ul> +</div><!--/.nav-collapse --> + </div> + </div> + </div> + +</div> + + <div id="sidebar"> + <div id="sidebar-wrap"> + <h2>Twitter</h2> + <ul class="sidemenu"> + <li> +<a class="twitter-timeline" href="https://twitter.com/ApacheMahout" data-widget-id="422861673444028416">Tweets by @ApacheMahout</a> +<script>!function(d,s,id){var js,fjs=d.getElementsByTagName(s)[0],p=/^http:/.test(d.location)?'http':'https';if(!d.getElementById(id)){js=d.createElement(s);js.id=id;js.src=p+"://platform.twitter.com/widgets.js";fjs.parentNode.insertBefore(js,fjs);}}(document,"script","twitter-wjs");</script> +</li> + </ul> + <h2>Apache Software Foundation</h2> + <ul class="sidemenu"> + <li><a href="http://www.apache.org/foundation/how-it-works.html">How the ASF works</a></li> + <li><a href="http://www.apache.org/foundation/getinvolved.html">Get Involved</a></li> + <li><a href="http://www.apache.org/dev/">Developer Resources</a></li> + <li><a href="http://www.apache.org/foundation/sponsorship.html">Sponsorship</a></li> + <li><a href="http://www.apache.org/foundation/thanks.html">Thanks</a></li> + </ul> + <h2>Related Projects</h2> + <ul class="sidemenu"> + <li><a href="http://lucene.apache.org/">Apache Lucene</a></li> + <li><a href="http://hadoop.apache.org/">Apache Hadoop</a></li> + <li><a href="http://bigtop.apache.org/">Apache Bigtop</a></li> + <li><a href="http://spark.apache.org/">Apache Spark</a></li> + <li><a href="http://flink.apache.org/">Apache Flink</a></li> + </ul> + </div> +</div> + + <div id="content-wrap" class="clearfix"> + <div id="main"> + + <h1 id="distributed-stochastic-singular-value-decomposition">Distributed Stochastic Singular Value Decomposition</h1> + +<h2 id="intro">Intro</h2> + +<p>Mahout has a distributed implementation of Stochastic Singular Value Decomposition <a href="[Mahout Scala and Mahout Spark Bindings for Linear Algebra Subroutines](http://mahout.apache.org/users/sparkbindings/ScalaSparkBindings.pdf)">1</a> using the parallelization strategy comprehensively defined in Nathan Halkoâs dissertation <a href="http://amath.colorado.edu/faculty/martinss/Pubs/2012_halko_dissertation.pdf">âRandomized methods for computing low-rank approximations of matricesâ</a> <a href="[Halko, Martinsson, Tropp](http://arxiv.org/abs/0909.4061)">2</a>.</p> + +<h2 id="modified-ssvd-algorithm">Modified SSVD Algorithm</h2> + +<p>Given an <code class="highlighter-rouge">\(m\times n\)</code> +matrix <code class="highlighter-rouge">\(\mathbf{A}\)</code>, a target rank <code class="highlighter-rouge">\(k\in\mathbb{N}_{1}\)</code> +, an oversampling parameter <code class="highlighter-rouge">\(p\in\mathbb{N}_{1}\)</code>, +and the number of additional power iterations <code class="highlighter-rouge">\(q\in\mathbb{N}_{0}\)</code>, +this procedure computes an <code class="highlighter-rouge">\(m\times\left(k+p\right)\)</code> +SVD <code class="highlighter-rouge">\(\mathbf{A\approx U}\boldsymbol{\Sigma}\mathbf{V}^{\top}\)</code>:</p> + +<ol> + <li> + <p>Create seed for random <code class="highlighter-rouge">\(n\times\left(k+p\right)\)</code> + matrix <code class="highlighter-rouge">\(\boldsymbol{\Omega}\)</code>. The seed defines matrix <code class="highlighter-rouge">\(\mathbf{\Omega}\)</code> + using Gaussian unit vectors per one of suggestions in [Halko, Martinsson, Tropp].</p> + </li> + <li> + <p><code class="highlighter-rouge">\(\mathbf{Y=A\boldsymbol{\Omega}},\,\mathbf{Y}\in\mathbb{R}^{m\times\left(k+p\right)}\)</code></p> + </li> + <li> + <p>Column-orthonormalize <code class="highlighter-rouge">\(\mathbf{Y}\rightarrow\mathbf{Q}\)</code> + by computing thin decomposition <code class="highlighter-rouge">\(\mathbf{Y}=\mathbf{Q}\mathbf{R}\)</code>. + Also, <code class="highlighter-rouge">\(\mathbf{Q}\in\mathbb{R}^{m\times\left(k+p\right)},\,\mathbf{R}\in\mathbb{R}^{\left(k+p\right)\times\left(k+p\right)}\)</code>; denoted as <code class="highlighter-rouge">\(\mathbf{Q}=\mbox{qr}\left(\mathbf{Y}\right).\mathbf{Q}\)</code></p> + </li> + <li> + <p><code class="highlighter-rouge">\(\mathbf{B}_{0}=\mathbf{Q}^{\top}\mathbf{A}:\,\,\mathbf{B}\in\mathbb{R}^{\left(k+p\right)\times n}\)</code>.</p> + </li> + <li> + <p>If <code class="highlighter-rouge">\(q>0\)</code> + repeat: for <code class="highlighter-rouge">\(i=1..q\)</code>: + <code class="highlighter-rouge">\(\mathbf{B}_{i}^{\top}=\mathbf{A}^{\top}\mbox{qr}\left(\mathbf{A}\mathbf{B}_{i-1}^{\top}\right).\mathbf{Q}\)</code> + (power iterations step).</p> + </li> + <li> + <p>Compute Eigensolution of a small Hermitian <code class="highlighter-rouge">\(\mathbf{B}_{q}\mathbf{B}_{q}^{\top}=\mathbf{\hat{U}}\boldsymbol{\Lambda}\mathbf{\hat{U}}^{\top}\)</code>, + <code class="highlighter-rouge">\(\mathbf{B}_{q}\mathbf{B}_{q}^{\top}\in\mathbb{R}^{\left(k+p\right)\times\left(k+p\right)}\)</code>.</p> + </li> + <li> + <p>Singular values <code class="highlighter-rouge">\(\mathbf{\boldsymbol{\Sigma}}=\boldsymbol{\Lambda}^{0.5}\)</code>, + or, in other words, <code class="highlighter-rouge">\(s_{i}=\sqrt{\sigma_{i}}\)</code>.</p> + </li> + <li> + <p>If needed, compute <code class="highlighter-rouge">\(\mathbf{U}=\mathbf{Q}\hat{\mathbf{U}}\)</code>.</p> + </li> + <li> + <p>If needed, compute <code class="highlighter-rouge">\(\mathbf{V}=\mathbf{B}_{q}^{\top}\hat{\mathbf{U}}\boldsymbol{\Sigma}^{-1}\)</code>. +Another way is <code class="highlighter-rouge">\(\mathbf{V}=\mathbf{A}^{\top}\mathbf{U}\boldsymbol{\Sigma}^{-1}\)</code>.</p> + </li> +</ol> + +<h2 id="implementation">Implementation</h2> + +<p>Mahout <code class="highlighter-rouge">dssvd(...)</code> is implemented in the mahout <code class="highlighter-rouge">math-scala</code> algebraic optimizer which translates Mahoutâs R-like linear algebra operators into a physical plan for both Spark and H2O distributed engines.</p> + +<div class="highlighter-rouge"><pre class="highlight"><code>def dssvd[K: ClassTag](drmA: DrmLike[K], k: Int, p: Int = 15, q: Int = 0): + (DrmLike[K], DrmLike[Int], Vector) = { + + val drmAcp = drmA.checkpoint() + + val m = drmAcp.nrow + val n = drmAcp.ncol + assert(k <= (m min n), "k cannot be greater than smaller of m, n.") + val pfxed = safeToNonNegInt((m min n) - k min p) + + // Actual decomposition rank + val r = k + pfxed + + // We represent Omega by its seed. + val omegaSeed = RandomUtils.getRandom().nextInt() + + // Compute Y = A*Omega. + var drmY = drmAcp.mapBlock(ncol = r) { + case (keys, blockA) => + val blockY = blockA %*% Matrices.symmetricUniformView(n, r, omegaSeed) + keys -> blockY + } + + var drmQ = dqrThin(drmY.checkpoint())._1 + + // Checkpoint Q if last iteration + if (q == 0) drmQ = drmQ.checkpoint() + + var drmBt = drmAcp.t %*% drmQ + + // Checkpoint B' if last iteration + if (q == 0) drmBt = drmBt.checkpoint() + + for (i <- 0 until q) { + drmY = drmAcp %*% drmBt + drmQ = dqrThin(drmY.checkpoint())._1 + + // Checkpoint Q if last iteration + if (i == q - 1) drmQ = drmQ.checkpoint() + + drmBt = drmAcp.t %*% drmQ + + // Checkpoint B' if last iteration + if (i == q - 1) drmBt = drmBt.checkpoint() + } + + val (inCoreUHat, d) = eigen(drmBt.t %*% drmBt) + val s = d.sqrt + + // Since neither drmU nor drmV are actually computed until actually used + // we don't need the flags instructing compute (or not compute) either of the U,V outputs + val drmU = drmQ %*% inCoreUHat + val drmV = drmBt %*% (inCoreUHat %*%: diagv(1 /: s)) + + (drmU(::, 0 until k), drmV(::, 0 until k), s(0 until k)) +} +</code></pre> +</div> + +<p>Note: As a side effect of checkpointing, U and V values are returned as logical operators (i.e. they are neither checkpointed nor computed). Therefore there is no physical work actually done to compute <code class="highlighter-rouge">\(\mathbf{U}\)</code> or <code class="highlighter-rouge">\(\mathbf{V}\)</code> until they are used in a subsequent expression.</p> + +<h2 id="usage">Usage</h2> + +<p>The scala <code class="highlighter-rouge">dssvd(...)</code> method can easily be called in any Spark or H2O application built with the <code class="highlighter-rouge">math-scala</code> library and the corresponding <code class="highlighter-rouge">Spark</code> or <code class="highlighter-rouge">H2O</code> engine module as follows:</p> + +<div class="highlighter-rouge"><pre class="highlight"><code>import org.apache.mahout.math._ +import decompositions._ +import drm._ + + +val(drmU, drmV, s) = dssvd(drma, k = 40, q = 1) +</code></pre> +</div> + +<h2 id="references">References</h2> + +<p>approximations of matrices](http://amath.colorado.edu/faculty/martinss/Pubs/2012_halko_dissertation.pdf)</p> + + + </div> + </div> +</div> + <footer class="footer" align="center"> + <div class="container"> + <p> + Copyright © 2014-2016 The Apache Software Foundation, Licensed under + the <a href="http://www.apache.org/licenses/LICENSE-2.0">Apache License, Version 2.0</a>. + <br /> + Apache Mahout, Mahout, Apache, the Apache feather logo, and the elephant rider logo are either registered trademarks or trademarks of <a href="http://www.apache.org/foundation/marks/">The Apache Software Foundation</a> in the United States and other countries. + </p> + </div> + </footer> + + <script src="/assets/themes/mahout-retro/js/jquery-1.9.1.min.js"></script> + <script src="/assets/themes/mahout-retro/js/bootstrap.min.js"></script> + <script> + (function() { + var cx = '012254517474945470291:vhsfv7eokdc'; + var gcse = document.createElement('script'); + gcse.type = 'text/javascript'; + gcse.async = true; + gcse.src = (document.location.protocol == 'https:' ? 'https:' : 'http:') + + '//www.google.com/cse/cse.js?cx=' + cx; + var s = document.getElementsByTagName('script')[0]; + s.parentNode.insertBefore(gcse, s); + })(); + </script> +</body> +</html> +
