Added: websites/staging/mahout/trunk/content/users/mapreduce/clustering/k-means-clustering.html ============================================================================== --- websites/staging/mahout/trunk/content/users/mapreduce/clustering/k-means-clustering.html (added) +++ websites/staging/mahout/trunk/content/users/mapreduce/clustering/k-means-clustering.html Thu Mar 19 21:21:45 2015 @@ -0,0 +1,425 @@ +<!DOCTYPE html> +<!-- + + Licensed to the Apache Software Foundation (ASF) under one or more + contributor license agreements. See the NOTICE file distributed with + this work for additional information regarding copyright ownership. + The ASF licenses this file to You under the Apache License, Version 2.0 + (the "License"); you may not use this file except in compliance with + the License. You may obtain a copy of the License at + + http://www.apache.org/licenses/LICENSE-2.0 + + Unless required by applicable law or agreed to in writing, software + distributed under the License is distributed on an "AS IS" BASIS, + WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + See the License for the specific language governing permissions and + limitations under the License. +--> + +<html xmlns="http://www.w3.org/1999/xhtml" xml:lang="en" lang="en"><head><meta http-equiv="Content-Type" content="text/html; charset=UTF-8"> + <title>Apache Mahout: Scalable machine learning and data mining</title> + <meta http-equiv="Content-Type" content="text/html; charset=utf-8"> + <meta name="Distribution" content="Global"> + <meta name="Robots" content="index,follow"> + <meta name="keywords" content="apache, apache hadoop, apache lucene, + business data mining, cluster analysis, + collaborative filtering, data extraction, data filtering, data framework, data integration, + data matching, data mining, data mining algorithms, data mining analysis, data mining data, + data mining introduction, data mining software, + data mining techniques, data representation, data set, datamining, + feature extraction, fuzzy k means, genetic algorithm, hadoop, + hierarchical clustering, high dimensional, introduction to data mining, kmeans, + knowledge discovery, learning approach, learning approaches, learning methods, + learning techniques, lucene, machine learning, machine translation, mahout apache, + mahout taste, map reduce hadoop, mining data, mining methods, naive bayes, + natural language processing, + supervised, text mining, time series data, unsupervised, web data mining"> + <link rel="shortcut icon" type="image/x-icon" href="http://mahout.apache.org/images/favicon.ico"> + <script type="text/javascript" src="/js/prototype.js"></script> + <script type="text/javascript" src="/js/effects.js"></script> + <script type="text/javascript" src="/js/search.js"></script> + <script type="text/javascript" src="/js/slides.js"></script> + + <link href="/css/bootstrap.min.css" rel="stylesheet" media="screen"> + <link href="/css/bootstrap-responsive.css" rel="stylesheet"> + <link rel="stylesheet" href="/css/global.css" type="text/css"> + + <!-- mathJax stuff -- use `\(...\)` for inline style math in markdown --> + <script type="text/x-mathjax-config"> + MathJax.Hub.Config({ + tex2jax: { + skipTags: ['script', 'noscript', 'style', 'textarea', 'pre'] + } + }); + MathJax.Hub.Queue(function() { + var all = MathJax.Hub.getAllJax(), i; + for(i = 0; i < all.length; i += 1) { + all[i].SourceElement().parentNode.className += ' has-jax'; + } + }); + </script> + <script type="text/javascript"> + var mathjax = document.createElement('script'); + mathjax.type = 'text/javascript'; + mathjax.async = true; + + mathjax.src = ('https:' == document.location.protocol) ? + 'https://c328740.ssl.cf1.rackcdn.com/mathjax/latest/MathJax.js?config=TeX-AMS-MML_HTMLorMML' : + 'http://cdn.mathjax.org/mathjax/latest/MathJax.js?config=TeX-AMS-MML_HTMLorMML'; + + var s = document.getElementsByTagName('script')[0]; + s.parentNode.insertBefore(mathjax, s); + </script> +</head> + +<body id="home" data-twttr-rendered="true"> + <div id="wrap"> + <div id="header"> + <div id="logo"><a href="/overview.html"></a></div> + <div id="search"> + <form id="search-form" action="http://www.google.com/search" method="get" class="navbar-search pull-right"> + <input value="http://mahout.apache.org" name="sitesearch" type="hidden"> + <input class="search-query" name="q" id="query" type="text"> + <input id="submission" type="image" src="/images/mahout-lupe.png" alt="Search" /> + </form> + </div> + + <div class="navbar navbar-inverse" style="position:absolute;top:133px;padding-right:0px;padding-left:0px;"> + <div class="navbar-inner" style="border: none; background: #999; border: none; border-radius: 0px;"> + <div class="container"> + <button type="button" class="btn btn-navbar" data-toggle="collapse" data-target=".nav-collapse"> + <span class="icon-bar"></span> + <span class="icon-bar"></span> + <span class="icon-bar"></span> + </button> + <!-- <a class="brand" href="#">Apache Community Development Project</a> --> + <div class="nav-collapse collapse"> + <ul class="nav"> + <li><a href="/">Home</a></li> + <li class="dropdown"> <a href="#" class="dropdown-toggle" data-toggle="dropdown">General<b class="caret"></b></a> + <ul class="dropdown-menu"> + <li><a href="/general/downloads.html">Downloads</a> + <li><a href="/general/who-we-are.html">Who we are</a> + <li><a href="/general/mailing-lists,-irc-and-archives.html">Mailing Lists</a> + <li><a href="/general/release-notes.html">Release Notes</a> + <li><a href="/general/books-tutorials-and-talks.html">Books, Tutorials, Talks</a></li> + <li><a href="/general/powered-by-mahout.html">Powered By Mahout</a> + <li><a href="/general/professional-support.html">Professional Support</a> + <li class="divider"></li> + <li class="nav-header">Resources</li> + <li><a href="/general/reference-reading.html">Reference Reading</a> + <li><a href="/general/faq.html">FAQ</a> + <li class="divider"></li> + <li class="nav-header">Legal</li> + <li><a href="http://www.apache.org/licenses/">License</a></li> + <li><a href="http://www.apache.org/security/">Security</a></li> + <li><a href="/general/privacy-policy.html">Privacy Policy</a> + </ul> + </li> + <li class="dropdown"> <a href="#" class="dropdown-toggle" data-toggle="dropdown">Developers<b class="caret"></b></a> + <ul class="dropdown-menu"> + <li><a href="/developers/developer-resources.html">Developer resources</a></li> + <li><a href="/developers/version-control.html">Version control</a></li> + <li><a href="/developers/buildingmahout.html">Build from source</a></li> + <li><a href="/developers/issue-tracker.html">Issue tracker</a></li> + <li><a href="https://builds.apache.org/job/Mahout-Quality/" target="_blank">Code quality reports</a></li> + <li class="divider"></li> + <li class="nav-header">Contributions</li> + <li><a href="/developers/how-to-contribute.html">How to contribute</a></li> + <li><a href="/developers/how-to-become-a-committer.html">How to become a committer</a></li> + <li><a href="/developers/gsoc.html">GSoC</a></li> + <li class="divider"></li> + <li class="nav-header">For committers</li> + <li><a href="/developers/how-to-update-the-website.html">How to update the website</a></li> + <li><a href="/developers/patch-check-list.html">Patch check list</a></li> + <li><a href="/developers/github.html">Handling Github PRs</a></li> + <li><a href="/developers/how-to-release.html">How to release</a></li> + <li><a href="/developers/thirdparty-dependencies.html">Third party dependencies</a></li> + </ul> + </li> + <li class="dropdown"> <a href="#" class="dropdown-toggle" data-toggle="dropdown">Basics<b class="caret"></b></a> + <ul class="dropdown-menu"> + <li><a href="/users/basics/algorithms.html">List of algorithms</a> + <li><a href="/users/basics/quickstart.html">Quickstart</a> + <li class="divider"></li> + <li class="nav-header">Working with text</li> + <li><a href="/users/basics/creating-vectors-from-text.html">Creating vectors from text</a> + <li><a href="/users/basics/collocations.html">Collocations</a> + <li class="divider"></li> + <li class="nav-header">Dimensionality reduction</li> + <li><a href="/users/dim-reduction/dimensional-reduction.html">Singular Value Decomposition</a></li> + <li><a href="/users/dim-reduction/ssvd.html">Stochastic SVD</a></li> + <li class="divider"></li> + <li class="nav-header">Topic Models</li> + <li><a href="/users/clustering/latent-dirichlet-allocation.html">Latent Dirichlet Allocation</a></li> + </ul> + </li> + <li class="dropdown"> <a href="#" class="dropdown-toggle" data-toggle="dropdown">Spark<b class="caret"></b></a> + <ul class="dropdown-menu"> + <li><a href="/users/sparkbindings/home.html">Scala & Spark Bindings Overview</a></li> + <li><a href="/users/sparkbindings/play-with-shell.html">Playing with Mahout's Spark Shell</a></li> + <li class="divider"></li> + <li><a href="/users/sparkbindings/faq.html">FAQ</a></li> + </ul> + </li> + <li class="dropdown"> <a href="#" class="dropdown-toggle" data-toggle="dropdown">Classification<b class="caret"></b></a> + <ul class="dropdown-menu"> + <li><a href="/users/mapreduce/classification/bayesian.html">Naive Bayes</a></li> + <li><a href="/users/mapreduce/classification/hidden-markov-models.html">Hidden Markov Models</a></li> + <li><a href="/users/mapreduce/classification/logistic-regression.html">Logistic Regression</a></li> + <li><a href="/users/mapreduce/classification/partial-implementation.html">Random Forest</a></li> + + <li class="divider"></li> + <li class="nav-header">Examples</li> + <li><a href="/users/mapreduce/classification/breiman-example.html">Breiman example</a></li> + <li><a href="/users/mapreduce/classification/twenty-newsgroups.html">20 newsgroups example</a></li> + </ul></li> + <li class="dropdown"> <a href="#" class="dropdown-toggle" data-toggle="dropdown">Clustering<b class="caret"></b></a> + <ul class="dropdown-menu"> + <li><a href="/users/mapreduce/clustering/k-means-clustering.html">k-Means</a></li> + <li><a href="/users/mapreduce/clustering/canopy-clustering.html">Canopy</a></li> + <li><a href="/users/mapreduce/clustering/fuzzy-k-means.html">Fuzzy k-Means</a></li> + <li><a href="/users/mapreduce/clustering/streaming-k-means.html">Streaming KMeans</a></li> + <li><a href="/users/mapreduce/clustering/spectral-clustering.html">Spectral Clustering</a></li> + <li class="divider"></li> + <li class="nav-header">Commandline usage</li> + <li><a href="/users/mapreduce/clustering/k-means-commandline.html">Options for k-Means</a></li> + <li><a href="/users/mapreduce/clustering/canopy-commandline.html">Options for Canopy</a></li> + <li><a href="/users/mapreduce/clustering/fuzzy-k-means-commandline.html">Options for Fuzzy k-Means</a></li> + <li class="divider"></li> + <li class="nav-header">Examples</li> + <li><a href="/users/mapreduce/clustering/clustering-of-synthetic-control-data.html">Synthetic data</a></li> + <li class="divider"></li> + <li class="nav-header">Post processing</li> + <li><a href="/users/mapreduce/clustering/cluster-dumper.html">Cluster Dumper tool</a></li> + <li><a href="/users/mapreduce/clustering/visualizing-sample-clusters.html">Cluster visualisation</a></li> + </ul></li> + <li class="dropdown"> <a href="#" class="dropdown-toggle" data-toggle="dropdown">Recommendations<b class="caret"></b></a> + <ul class="dropdown-menu"> + <li><a href="/users/mapreduce/recommender/quickstart.html">Quickstart</a></li> + <li><a href="/users/mapreduce/recommender/recommender-first-timer-faq.html">First Timer FAQ</a></li> + <li><a href="/users/mapreduce/recommender/userbased-5-minutes.html">A user-based recommender <br/>in 5 minutes</a></li> + <li><a href="/users/mapreduce/recommender/matrix-factorization.html">Matrix factorization-based<br/> recommenders</a></li> + <li><a href="/users/mapreduce/recommender/recommender-documentation.html">Overview</a></li> + <li class="divider"></li> + <li class="nav-header">Hadoop</li> + <li><a href="/users/mapreduce/recommender/intro-itembased-hadoop.html">Intro to item-based recommendations<br/> with Hadoop</a></li> + <li><a href="/users/mapreduce/recommender/intro-als-hadoop.html">Intro to ALS recommendations<br/> with Hadoop</a></li> + <li class="nav-header">Spark</li> + <li><a href="/users/mapreduce/recommender/intro-cooccurrence-spark.html">Intro to cooccurrence-based<br/> recommendations with Spark</a></li> + </ul> + </li> + </ul> + </div><!--/.nav-collapse --> + </div> + </div> + </div> + +</div> + + <div id="sidebar"> + <div id="sidebar-wrap"> + <h2>Twitter</h2> + <ul class="sidemenu"> + <li> +<a class="twitter-timeline" href="https://twitter.com/ApacheMahout" data-widget-id="422861673444028416">Tweets by @ApacheMahout</a> +<script>!function(d,s,id){var js,fjs=d.getElementsByTagName(s)[0],p=/^http:/.test(d.location)?'http':'https';if(!d.getElementById(id)){js=d.createElement(s);js.id=id;js.src=p+"://platform.twitter.com/widgets.js";fjs.parentNode.insertBefore(js,fjs);}}(document,"script","twitter-wjs");</script> +</li> + </ul> + <h2>Apache Software Foundation</h2> + <ul class="sidemenu"> + <li><a href="http://www.apache.org/foundation/how-it-works.html">How the ASF works</a></li> + <li><a href="http://www.apache.org/foundation/getinvolved.html">Get Involved</a></li> + <li><a href="http://www.apache.org/dev/">Developer Resources</a></li> + <li><a href="http://www.apache.org/foundation/sponsorship.html">Sponsorship</a></li> + <li><a href="http://www.apache.org/foundation/thanks.html">Thanks</a></li> + </ul> + <h2>Related Projects</h2> + <ul class="sidemenu"> + <li><a href="http://lucene.apache.org/">Lucene</a></li> + <li><a href="http://hadoop.apache.org/">Hadoop</a></li> + </ul> + </div> +</div> + + <div id="content-wrap" class="clearfix"> + <div id="main"> + <h1 id="k-means-clustering-basics">k-Means clustering - basics</h1> +<p><a href="http://en.wikipedia.org/wiki/Kmeans">k-Means</a> is a simple but well-known algorithm for grouping objects, clustering. All objects need to be represented +as a set of numerical features. In addition, the user has to specify the +number of groups (referred to as <em>k</em>) she wishes to identify.</p> +<p>Each object can be thought of as being represented by some feature vector +in an <em>n</em> dimensional space, <em>n</em> being the number of all features used to +describe the objects to cluster. The algorithm then randomly chooses <em>k</em> +points in that vector space, these point serve as the initial centers of +the clusters. Afterwards all objects are each assigned to the center they +are closest to. Usually the distance measure is chosen by the user and +determined by the learning task.</p> +<p>After that, for each cluster a new center is computed by averaging the +feature vectors of all objects assigned to it. The process of assigning +objects and recomputing centers is repeated until the process converges. +The algorithm can be proven to converge after a finite number of +iterations.</p> +<p>Several tweaks concerning distance measure, initial center choice and +computation of new average centers have been explored, as well as the +estimation of the number of clusters <em>k</em>. Yet the main principle always +remains the same.</p> +<p><a name="K-MeansClustering-Quickstart"></a></p> +<h2 id="quickstart">Quickstart</h2> +<p><a href="https://github.com/apache/mahout/blob/master/examples/bin/cluster-reuters.sh">Here</a> + is a short shell script outline that will get you started quickly with +k-means. This does the following:</p> +<ul> +<li>Accepts clustering type: <em>kmeans</em>, <em>fuzzykmeans</em>, <em>lda</em>, or <em>streamingkmeans</em></li> +<li>Gets the Reuters dataset</li> +<li>Runs org.apache.lucene.benchmark.utils.ExtractReuters to generate +reuters-out from reuters-sgm (the downloaded archive)</li> +<li>Runs seqdirectory to convert reuters-out to SequenceFile format</li> +<li>Runs seq2sparse to convert SequenceFiles to sparse vector format</li> +<li>Runs k-means with 20 clusters</li> +<li>Runs clusterdump to show results</li> +</ul> +<p>After following through the output that scrolls past, reading the code will +offer you a better understanding.</p> +<p><a name="K-MeansClustering-Designofimplementation"></a></p> +<h2 id="implementation">Implementation</h2> +<p>The implementation accepts two input directories: one for the data points +and one for the initial clusters. The data directory contains multiple +input files of SequenceFile(Key, VectorWritable), while the clusters +directory contains one or more SequenceFiles(Text, Cluster) +containing <em>k</em> initial clusters or canopies. None of the input directories +are modified by the implementation, allowing experimentation with initial +clustering and convergence values.</p> +<p>Canopy clustering can be used to compute the initial clusters for k-KMeans:</p> +<div class="codehilite"><pre><span class="c1">// run the CanopyDriver job</span> +<span class="n">CanopyDriver</span><span class="p">.</span><span class="n">runJob</span><span class="p">(</span><span class="s">"testdata"</span><span class="p">,</span> <span class="s">"output"</span> +<span class="n">ManhattanDistanceMeasure</span><span class="p">.</span><span class="k">class</span><span class="p">.</span><span class="n">getName</span><span class="p">(),</span> <span class="p">(</span><span class="n">float</span><span class="p">)</span> <span class="mf">3.1</span><span class="p">,</span> <span class="p">(</span><span class="n">float</span><span class="p">)</span> <span class="mf">2.1</span><span class="p">,</span> <span class="n">false</span><span class="p">);</span> + +<span class="c1">// now run the KMeansDriver job</span> +<span class="n">KMeansDriver</span><span class="p">.</span><span class="n">runJob</span><span class="p">(</span><span class="s">"testdata"</span><span class="p">,</span> <span class="s">"output/clusters-0"</span><span class="p">,</span> <span class="s">"output"</span><span class="p">,</span> +<span class="n">EuclideanDistanceMeasure</span><span class="p">.</span><span class="k">class</span><span class="p">.</span><span class="n">getName</span><span class="p">(),</span> <span class="s">"0.001"</span><span class="p">,</span> <span class="s">"10"</span><span class="p">,</span> <span class="n">true</span><span class="p">);</span> +</pre></div> + + +<p>In the above example, the input data points are stored in 'testdata' and +the CanopyDriver is configured to output to the 'output/clusters-0' +directory. Once the driver executes it will contain the canopy definition +files. Upon running the KMeansDriver the output directory will have two or +more new directories: 'clusters-N'' containining the clusters for each +iteration and 'clusteredPoints' will contain the clustered data points.</p> +<p>This diagram shows the examplary dataflow of the k-Means example +implementation provided by Mahout: +<img src="../../images/Example implementation of k-Means provided with Mahout.png"></p> +<p><a name="K-MeansClustering-Runningk-MeansClustering"></a></p> +<h2 id="running-k-means-clustering">Running k-Means Clustering</h2> +<p>The k-Means clustering algorithm may be run using a command-line invocation +on KMeansDriver.main or by making a Java call to KMeansDriver.runJob().</p> +<p>Invocation using the command line takes the form:</p> +<div class="codehilite"><pre><span class="n">bin</span><span class="o">/</span><span class="n">mahout</span> <span class="n">kmeans</span> <span class="o">\</span> + <span class="o">-</span><span class="nb">i</span> <span class="o"><</span><span class="n">input</span> <span class="n">vectors</span> <span class="n">directory</span><span class="o">></span> <span class="o">\</span> + <span class="o">-</span><span class="n">c</span> <span class="o"><</span><span class="n">input</span> <span class="n">clusters</span> <span class="n">directory</span><span class="o">></span> <span class="o">\</span> + <span class="o">-</span><span class="n">o</span> <span class="o"><</span><span class="n">output</span> <span class="n">working</span> <span class="n">directory</span><span class="o">></span> <span class="o">\</span> + <span class="o">-</span><span class="n">k</span> <span class="o"><</span><span class="n">optional</span> <span class="n">number</span> <span class="n">of</span> <span class="n">initial</span> <span class="n">clusters</span> <span class="n">to</span> <span class="n">sample</span> <span class="n">from</span> <span class="n">input</span> <span class="n">vectors</span><span class="o">></span> <span class="o">\</span> + <span class="o">-</span><span class="n">dm</span> <span class="o"><</span><span class="n">DistanceMeasure</span><span class="o">></span> <span class="o">\</span> + <span class="o">-</span><span class="n">x</span> <span class="o"><</span><span class="n">maximum</span> <span class="n">number</span> <span class="n">of</span> <span class="n">iterations</span><span class="o">></span> <span class="o">\</span> + <span class="o">-</span><span class="n">cd</span> <span class="o"><</span><span class="n">optional</span> <span class="n">convergence</span> <span class="n">delta</span><span class="p">.</span> <span class="n">Default</span> <span class="n">is</span> 0<span class="p">.</span>5<span class="o">></span> <span class="o">\</span> + <span class="o">-</span><span class="n">ow</span> <span class="o"><</span><span class="n">overwrite</span> <span class="n">output</span> <span class="n">directory</span> <span class="k">if</span> <span class="n">present</span><span class="o">></span> + <span class="o">-</span><span class="n">cl</span> <span class="o"><</span><span class="n">run</span> <span class="n">input</span> <span class="n">vector</span> <span class="n">clustering</span> <span class="n">after</span> <span class="n">computing</span> <span class="n">Canopies</span><span class="o">></span> + <span class="o">-</span><span class="n">xm</span> <span class="o"><</span><span class="n">execution</span> <span class="n">method</span><span class="p">:</span> <span class="n">sequential</span> <span class="n">or</span> <span class="n">mapreduce</span><span class="o">></span> +</pre></div> + + +<p>Note: if the -k argument is supplied, any clusters in the -c directory +will be overwritten and -k random points will be sampled from the input +vectors to become the initial cluster centers.</p> +<p>Invocation using Java involves supplying the following arguments:</p> +<ol> +<li>input: a file path string to a directory containing the input data set a +SequenceFile(WritableComparable, VectorWritable). The sequence file <em>key</em> +is not used.</li> +<li>clusters: a file path string to a directory containing the initial +clusters, a SequenceFile(key, Cluster \| Canopy). Both KMeans clusters and +Canopy canopies may be used for the initial clusters.</li> +<li>output: a file path string to an empty directory which is used for all +output from the algorithm.</li> +<li>distanceMeasure: the fully-qualified class name of an instance of +DistanceMeasure which will be used for the clustering.</li> +<li>convergenceDelta: a double value used to determine if the algorithm has +converged (clusters have not moved more than the value in the last +iteration)</li> +<li>maxIter: the maximum number of iterations to run, independent of the +convergence specified</li> +<li>runClustering: a boolean indicating, if true, that the clustering step is +to be executed after clusters have been determined.</li> +<li>runSequential: a boolean indicating, if true, that the k-means sequential +implementation is to be used to process the input data.</li> +</ol> +<p>After running the algorithm, the output directory will contain: +1. clusters-N: directories containing SequenceFiles(Text, Cluster) produced +by the algorithm for each iteration. The Text <em>key</em> is a cluster identifier +string. +1. clusteredPoints: (if --clustering enabled) a directory containing +SequenceFile(IntWritable, WeightedVectorWritable). The IntWritable <em>key</em> is +the clusterId. The WeightedVectorWritable <em>value</em> is a bean containing a +double <em>weight</em> and a VectorWritable <em>vector</em> where the weight indicates +the probability that the vector is a member of the cluster. For k-Means +clustering, the weights are computed as 1/(1+distance) where the distance +is between the cluster center and the vector using the chosen +DistanceMeasure.</p> +<p><a name="K-MeansClustering-Examples"></a></p> +<h1 id="examples">Examples</h1> +<p>The following images illustrate k-Means clustering applied to a set of +randomly-generated 2-d data points. The points are generated using a normal +distribution centered at a mean location and with a constant standard +deviation. See the README file in the <a href="https://github.com/apache/mahout/blob/master/examples/src/main/java/org/apache/mahout/clustering/display/README.txt">/examples/src/main/java/org/apache/mahout/clustering/display/README.txt</a> + for details on running similar examples.</p> +<p>The points are generated as follows:</p> +<ul> +<li>500 samples m=[1.0, 1.0](1.0,-1.0.html) + sd=3.0</li> +<li>300 samples m=[1.0, 0.0](1.0,-0.0.html) + sd=0.5</li> +<li>300 samples m=[0.0, 2.0](0.0,-2.0.html) + sd=0.1</li> +</ul> +<p>In the first image, the points are plotted and the 3-sigma boundaries of +their generator are superimposed.</p> +<p><img alt="Sample data graph" src="../../images/SampleData.png" /></p> +<p>In the second image, the resulting clusters (k=3) are shown superimposed upon the sample data. As k-Means is an iterative algorithm, the centers of the clusters in each recent iteration are shown using different colors. Bold red is the final clustering and previous iterations are shown in [orange, yellow, green, blue, violet and gray](orange,-yellow,-green,-blue,-violet-and-gray.html) +. Although it misses a lot of the points and cannot capture the original, +superimposed cluster centers, it does a decent job of clustering this data.</p> +<p><img alt="kmeans" src="../../images/KMeans.png" /></p> +<p>The third image shows the results of running k-Means on a different dataset, which is generated using asymmetrical standard deviations. +K-Means does a fair job handling this data set as well.</p> +<p><img alt="2d kmeans" src="../../images/2dKMeans.png" /></p> + </div> + </div> +</div> + <footer class="footer" align="center"> + <div class="container"> + <p> + Copyright © 2014 The Apache Software Foundation, Licensed under + the <a href="http://www.apache.org/licenses/LICENSE-2.0">Apache License, Version 2.0</a>. + <br /> + Apache and the Apache feather logos are trademarks of The Apache Software Foundation. + </p> + </div> + </footer> + + <script src="/js/jquery-1.9.1.min.js"></script> + <script src="/js/bootstrap.min.js"></script> + <script> + (function() { + var cx = '012254517474945470291:vhsfv7eokdc'; + var gcse = document.createElement('script'); + gcse.type = 'text/javascript'; + gcse.async = true; + gcse.src = (document.location.protocol == 'https:' ? 'https:' : 'http:') + + '//www.google.com/cse/cse.js?cx=' + cx; + var s = document.getElementsByTagName('script')[0]; + s.parentNode.insertBefore(gcse, s); + })(); + </script> +</body> +</html>
Added: websites/staging/mahout/trunk/content/users/mapreduce/clustering/k-means-commandline.html ============================================================================== --- websites/staging/mahout/trunk/content/users/mapreduce/clustering/k-means-commandline.html (added) +++ websites/staging/mahout/trunk/content/users/mapreduce/clustering/k-means-commandline.html Thu Mar 19 21:21:45 2015 @@ -0,0 +1,359 @@ +<!DOCTYPE html> +<!-- + + Licensed to the Apache Software Foundation (ASF) under one or more + contributor license agreements. See the NOTICE file distributed with + this work for additional information regarding copyright ownership. + The ASF licenses this file to You under the Apache License, Version 2.0 + (the "License"); you may not use this file except in compliance with + the License. You may obtain a copy of the License at + + http://www.apache.org/licenses/LICENSE-2.0 + + Unless required by applicable law or agreed to in writing, software + distributed under the License is distributed on an "AS IS" BASIS, + WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + See the License for the specific language governing permissions and + limitations under the License. +--> + +<html xmlns="http://www.w3.org/1999/xhtml" xml:lang="en" lang="en"><head><meta http-equiv="Content-Type" content="text/html; charset=UTF-8"> + <title>Apache Mahout: Scalable machine learning and data mining</title> + <meta http-equiv="Content-Type" content="text/html; charset=utf-8"> + <meta name="Distribution" content="Global"> + <meta name="Robots" content="index,follow"> + <meta name="keywords" content="apache, apache hadoop, apache lucene, + business data mining, cluster analysis, + collaborative filtering, data extraction, data filtering, data framework, data integration, + data matching, data mining, data mining algorithms, data mining analysis, data mining data, + data mining introduction, data mining software, + data mining techniques, data representation, data set, datamining, + feature extraction, fuzzy k means, genetic algorithm, hadoop, + hierarchical clustering, high dimensional, introduction to data mining, kmeans, + knowledge discovery, learning approach, learning approaches, learning methods, + learning techniques, lucene, machine learning, machine translation, mahout apache, + mahout taste, map reduce hadoop, mining data, mining methods, naive bayes, + natural language processing, + supervised, text mining, time series data, unsupervised, web data mining"> + <link rel="shortcut icon" type="image/x-icon" href="http://mahout.apache.org/images/favicon.ico"> + <script type="text/javascript" src="/js/prototype.js"></script> + <script type="text/javascript" src="/js/effects.js"></script> + <script type="text/javascript" src="/js/search.js"></script> + <script type="text/javascript" src="/js/slides.js"></script> + + <link href="/css/bootstrap.min.css" rel="stylesheet" media="screen"> + <link href="/css/bootstrap-responsive.css" rel="stylesheet"> + <link rel="stylesheet" href="/css/global.css" type="text/css"> + + <!-- mathJax stuff -- use `\(...\)` for inline style math in markdown --> + <script type="text/x-mathjax-config"> + MathJax.Hub.Config({ + tex2jax: { + skipTags: ['script', 'noscript', 'style', 'textarea', 'pre'] + } + }); + MathJax.Hub.Queue(function() { + var all = MathJax.Hub.getAllJax(), i; + for(i = 0; i < all.length; i += 1) { + all[i].SourceElement().parentNode.className += ' has-jax'; + } + }); + </script> + <script type="text/javascript"> + var mathjax = document.createElement('script'); + mathjax.type = 'text/javascript'; + mathjax.async = true; + + mathjax.src = ('https:' == document.location.protocol) ? + 'https://c328740.ssl.cf1.rackcdn.com/mathjax/latest/MathJax.js?config=TeX-AMS-MML_HTMLorMML' : + 'http://cdn.mathjax.org/mathjax/latest/MathJax.js?config=TeX-AMS-MML_HTMLorMML'; + + var s = document.getElementsByTagName('script')[0]; + s.parentNode.insertBefore(mathjax, s); + </script> +</head> + +<body id="home" data-twttr-rendered="true"> + <div id="wrap"> + <div id="header"> + <div id="logo"><a href="/overview.html"></a></div> + <div id="search"> + <form id="search-form" action="http://www.google.com/search" method="get" class="navbar-search pull-right"> + <input value="http://mahout.apache.org" name="sitesearch" type="hidden"> + <input class="search-query" name="q" id="query" type="text"> + <input id="submission" type="image" src="/images/mahout-lupe.png" alt="Search" /> + </form> + </div> + + <div class="navbar navbar-inverse" style="position:absolute;top:133px;padding-right:0px;padding-left:0px;"> + <div class="navbar-inner" style="border: none; background: #999; border: none; border-radius: 0px;"> + <div class="container"> + <button type="button" class="btn btn-navbar" data-toggle="collapse" data-target=".nav-collapse"> + <span class="icon-bar"></span> + <span class="icon-bar"></span> + <span class="icon-bar"></span> + </button> + <!-- <a class="brand" href="#">Apache Community Development Project</a> --> + <div class="nav-collapse collapse"> + <ul class="nav"> + <li><a href="/">Home</a></li> + <li class="dropdown"> <a href="#" class="dropdown-toggle" data-toggle="dropdown">General<b class="caret"></b></a> + <ul class="dropdown-menu"> + <li><a href="/general/downloads.html">Downloads</a> + <li><a href="/general/who-we-are.html">Who we are</a> + <li><a href="/general/mailing-lists,-irc-and-archives.html">Mailing Lists</a> + <li><a href="/general/release-notes.html">Release Notes</a> + <li><a href="/general/books-tutorials-and-talks.html">Books, Tutorials, Talks</a></li> + <li><a href="/general/powered-by-mahout.html">Powered By Mahout</a> + <li><a href="/general/professional-support.html">Professional Support</a> + <li class="divider"></li> + <li class="nav-header">Resources</li> + <li><a href="/general/reference-reading.html">Reference Reading</a> + <li><a href="/general/faq.html">FAQ</a> + <li class="divider"></li> + <li class="nav-header">Legal</li> + <li><a href="http://www.apache.org/licenses/">License</a></li> + <li><a href="http://www.apache.org/security/">Security</a></li> + <li><a href="/general/privacy-policy.html">Privacy Policy</a> + </ul> + </li> + <li class="dropdown"> <a href="#" class="dropdown-toggle" data-toggle="dropdown">Developers<b class="caret"></b></a> + <ul class="dropdown-menu"> + <li><a href="/developers/developer-resources.html">Developer resources</a></li> + <li><a href="/developers/version-control.html">Version control</a></li> + <li><a href="/developers/buildingmahout.html">Build from source</a></li> + <li><a href="/developers/issue-tracker.html">Issue tracker</a></li> + <li><a href="https://builds.apache.org/job/Mahout-Quality/" target="_blank">Code quality reports</a></li> + <li class="divider"></li> + <li class="nav-header">Contributions</li> + <li><a href="/developers/how-to-contribute.html">How to contribute</a></li> + <li><a href="/developers/how-to-become-a-committer.html">How to become a committer</a></li> + <li><a href="/developers/gsoc.html">GSoC</a></li> + <li class="divider"></li> + <li class="nav-header">For committers</li> + <li><a href="/developers/how-to-update-the-website.html">How to update the website</a></li> + <li><a href="/developers/patch-check-list.html">Patch check list</a></li> + <li><a href="/developers/github.html">Handling Github PRs</a></li> + <li><a href="/developers/how-to-release.html">How to release</a></li> + <li><a href="/developers/thirdparty-dependencies.html">Third party dependencies</a></li> + </ul> + </li> + <li class="dropdown"> <a href="#" class="dropdown-toggle" data-toggle="dropdown">Basics<b class="caret"></b></a> + <ul class="dropdown-menu"> + <li><a href="/users/basics/algorithms.html">List of algorithms</a> + <li><a href="/users/basics/quickstart.html">Quickstart</a> + <li class="divider"></li> + <li class="nav-header">Working with text</li> + <li><a href="/users/basics/creating-vectors-from-text.html">Creating vectors from text</a> + <li><a href="/users/basics/collocations.html">Collocations</a> + <li class="divider"></li> + <li class="nav-header">Dimensionality reduction</li> + <li><a href="/users/dim-reduction/dimensional-reduction.html">Singular Value Decomposition</a></li> + <li><a href="/users/dim-reduction/ssvd.html">Stochastic SVD</a></li> + <li class="divider"></li> + <li class="nav-header">Topic Models</li> + <li><a href="/users/clustering/latent-dirichlet-allocation.html">Latent Dirichlet Allocation</a></li> + </ul> + </li> + <li class="dropdown"> <a href="#" class="dropdown-toggle" data-toggle="dropdown">Spark<b class="caret"></b></a> + <ul class="dropdown-menu"> + <li><a href="/users/sparkbindings/home.html">Scala & Spark Bindings Overview</a></li> + <li><a href="/users/sparkbindings/play-with-shell.html">Playing with Mahout's Spark Shell</a></li> + <li class="divider"></li> + <li><a href="/users/sparkbindings/faq.html">FAQ</a></li> + </ul> + </li> + <li class="dropdown"> <a href="#" class="dropdown-toggle" data-toggle="dropdown">Classification<b class="caret"></b></a> + <ul class="dropdown-menu"> + <li><a href="/users/mapreduce/classification/bayesian.html">Naive Bayes</a></li> + <li><a href="/users/mapreduce/classification/hidden-markov-models.html">Hidden Markov Models</a></li> + <li><a href="/users/mapreduce/classification/logistic-regression.html">Logistic Regression</a></li> + <li><a href="/users/mapreduce/classification/partial-implementation.html">Random Forest</a></li> + + <li class="divider"></li> + <li class="nav-header">Examples</li> + <li><a href="/users/mapreduce/classification/breiman-example.html">Breiman example</a></li> + <li><a href="/users/mapreduce/classification/twenty-newsgroups.html">20 newsgroups example</a></li> + </ul></li> + <li class="dropdown"> <a href="#" class="dropdown-toggle" data-toggle="dropdown">Clustering<b class="caret"></b></a> + <ul class="dropdown-menu"> + <li><a href="/users/mapreduce/clustering/k-means-clustering.html">k-Means</a></li> + <li><a href="/users/mapreduce/clustering/canopy-clustering.html">Canopy</a></li> + <li><a href="/users/mapreduce/clustering/fuzzy-k-means.html">Fuzzy k-Means</a></li> + <li><a href="/users/mapreduce/clustering/streaming-k-means.html">Streaming KMeans</a></li> + <li><a href="/users/mapreduce/clustering/spectral-clustering.html">Spectral Clustering</a></li> + <li class="divider"></li> + <li class="nav-header">Commandline usage</li> + <li><a href="/users/mapreduce/clustering/k-means-commandline.html">Options for k-Means</a></li> + <li><a href="/users/mapreduce/clustering/canopy-commandline.html">Options for Canopy</a></li> + <li><a href="/users/mapreduce/clustering/fuzzy-k-means-commandline.html">Options for Fuzzy k-Means</a></li> + <li class="divider"></li> + <li class="nav-header">Examples</li> + <li><a href="/users/mapreduce/clustering/clustering-of-synthetic-control-data.html">Synthetic data</a></li> + <li class="divider"></li> + <li class="nav-header">Post processing</li> + <li><a href="/users/mapreduce/clustering/cluster-dumper.html">Cluster Dumper tool</a></li> + <li><a href="/users/mapreduce/clustering/visualizing-sample-clusters.html">Cluster visualisation</a></li> + </ul></li> + <li class="dropdown"> <a href="#" class="dropdown-toggle" data-toggle="dropdown">Recommendations<b class="caret"></b></a> + <ul class="dropdown-menu"> + <li><a href="/users/mapreduce/recommender/quickstart.html">Quickstart</a></li> + <li><a href="/users/mapreduce/recommender/recommender-first-timer-faq.html">First Timer FAQ</a></li> + <li><a href="/users/mapreduce/recommender/userbased-5-minutes.html">A user-based recommender <br/>in 5 minutes</a></li> + <li><a href="/users/mapreduce/recommender/matrix-factorization.html">Matrix factorization-based<br/> recommenders</a></li> + <li><a href="/users/mapreduce/recommender/recommender-documentation.html">Overview</a></li> + <li class="divider"></li> + <li class="nav-header">Hadoop</li> + <li><a href="/users/mapreduce/recommender/intro-itembased-hadoop.html">Intro to item-based recommendations<br/> with Hadoop</a></li> + <li><a href="/users/mapreduce/recommender/intro-als-hadoop.html">Intro to ALS recommendations<br/> with Hadoop</a></li> + <li class="nav-header">Spark</li> + <li><a href="/users/mapreduce/recommender/intro-cooccurrence-spark.html">Intro to cooccurrence-based<br/> recommendations with Spark</a></li> + </ul> + </li> + </ul> + </div><!--/.nav-collapse --> + </div> + </div> + </div> + +</div> + + <div id="sidebar"> + <div id="sidebar-wrap"> + <h2>Twitter</h2> + <ul class="sidemenu"> + <li> +<a class="twitter-timeline" href="https://twitter.com/ApacheMahout" data-widget-id="422861673444028416">Tweets by @ApacheMahout</a> +<script>!function(d,s,id){var js,fjs=d.getElementsByTagName(s)[0],p=/^http:/.test(d.location)?'http':'https';if(!d.getElementById(id)){js=d.createElement(s);js.id=id;js.src=p+"://platform.twitter.com/widgets.js";fjs.parentNode.insertBefore(js,fjs);}}(document,"script","twitter-wjs");</script> +</li> + </ul> + <h2>Apache Software Foundation</h2> + <ul class="sidemenu"> + <li><a href="http://www.apache.org/foundation/how-it-works.html">How the ASF works</a></li> + <li><a href="http://www.apache.org/foundation/getinvolved.html">Get Involved</a></li> + <li><a href="http://www.apache.org/dev/">Developer Resources</a></li> + <li><a href="http://www.apache.org/foundation/sponsorship.html">Sponsorship</a></li> + <li><a href="http://www.apache.org/foundation/thanks.html">Thanks</a></li> + </ul> + <h2>Related Projects</h2> + <ul class="sidemenu"> + <li><a href="http://lucene.apache.org/">Lucene</a></li> + <li><a href="http://hadoop.apache.org/">Hadoop</a></li> + </ul> + </div> +</div> + + <div id="content-wrap" class="clearfix"> + <div id="main"> + <p><a name="k-means-commandline-Introduction"></a></p> +<h1 id="kmeans-commandline-introduction">kMeans commandline introduction</h1> +<p>This quick start page describes how to run the kMeans clustering algorithm +on a Hadoop cluster. </p> +<p><a name="k-means-commandline-Steps"></a></p> +<h1 id="steps">Steps</h1> +<p>Mahout's k-Means clustering can be launched from the same command line +invocation whether you are running on a single machine in stand-alone mode +or on a larger Hadoop cluster. The difference is determined by the +$HADOOP_HOME and $HADOOP_CONF_DIR environment variables. If both are set to +an operating Hadoop cluster on the target machine then the invocation will +run k-Means on that cluster. If either of the environment variables are +missing then the stand-alone Hadoop configuration will be invoked instead.</p> +<div class="codehilite"><pre><span class="o">./</span><span class="n">bin</span><span class="o">/</span><span class="n">mahout</span> <span class="n">kmeans</span> <span class="o"><</span><span class="n">OPTIONS</span><span class="o">></span> +</pre></div> + + +<p>In $MAHOUT_HOME/, build the jar containing the job (mvn install) The job +will be generated in $MAHOUT_HOME/core/target/ and it's name will contain +the Mahout version number. For example, when using Mahout 0.3 release, the +job will be mahout-core-0.3.job</p> +<p><a name="k-means-commandline-Testingitononesinglemachinew/ocluster"></a></p> +<h2 id="testing-it-on-one-single-machine-wo-cluster">Testing it on one single machine w/o cluster</h2> +<ul> +<li>Put the data: cp <PATH TO DATA> testdata</li> +<li> +<p>Run the Job: </p> +<p>./bin/mahout kmeans -i testdata -o output -c clusters -dm +org.apache.mahout.common.distance.CosineDistanceMeasure -x 5 -ow -cd 1 -k +25</p> +</li> +</ul> +<p><a name="k-means-commandline-Runningitonthecluster"></a></p> +<h2 id="running-it-on-the-cluster">Running it on the cluster</h2> +<ul> +<li>(As needed) Start up Hadoop: $HADOOP_HOME/bin/start-all.sh</li> +<li>Put the data: $HADOOP_HOME/bin/hadoop fs -put <PATH TO DATA> testdata</li> +<li> +<p>Run the Job: </p> +<p>export HADOOP_HOME=<Hadoop Home Directory> +export HADOOP_CONF_DIR=$HADOOP_HOME/conf +./bin/mahout kmeans -i testdata -o output -c clusters -dm org.apache.mahout.common.distance.CosineDistanceMeasure -x 5 -ow -cd 1 -k 25</p> +</li> +<li> +<p>Get the data out of HDFS and have a look. Use bin/hadoop fs -lsr output +to view all outputs.</p> +</li> +</ul> +<p><a name="k-means-commandline-Commandlineoptions"></a></p> +<h1 id="command-line-options">Command line options</h1> +<div class="codehilite"><pre> <span class="o">--</span><span class="n">input</span> <span class="p">(</span><span class="o">-</span><span class="nb">i</span><span class="p">)</span> <span class="n">input</span> <span class="n">Path</span> <span class="n">to</span> <span class="n">job</span> <span class="n">input</span> <span class="n">directory</span><span class="p">.</span> + <span class="n">Must</span> <span class="n">be</span> <span class="n">a</span> <span class="n">SequenceFile</span> <span class="n">of</span> + <span class="n">VectorWritable</span> + <span class="o">--</span><span class="n">clusters</span> <span class="p">(</span><span class="o">-</span><span class="n">c</span><span class="p">)</span> <span class="n">clusters</span> <span class="n">The</span> <span class="n">input</span> <span class="n">centroids</span><span class="p">,</span> <span class="n">as</span> <span class="n">Vectors</span><span class="p">.</span> + <span class="n">Must</span> <span class="n">be</span> <span class="n">a</span> <span class="n">SequenceFile</span> <span class="n">of</span> + <span class="n">Writable</span><span class="p">,</span> <span class="n">Cluster</span><span class="o">/</span><span class="n">Canopy</span><span class="p">.</span> <span class="n">If</span> <span class="n">k</span> + <span class="n">is</span> <span class="n">also</span> <span class="n">specified</span><span class="p">,</span> <span class="n">then</span> <span class="n">a</span> <span class="n">random</span> + <span class="n">set</span> <span class="n">of</span> <span class="n">vectors</span> <span class="n">will</span> <span class="n">be</span> <span class="n">selected</span> + <span class="n">and</span> <span class="n">written</span> <span class="n">out</span> <span class="n">to</span> <span class="n">this</span> <span class="n">path</span> + <span class="n">first</span> + <span class="o">--</span><span class="n">output</span> <span class="p">(</span><span class="o">-</span><span class="n">o</span><span class="p">)</span> <span class="n">output</span> <span class="n">The</span> <span class="n">directory</span> <span class="n">pathname</span> <span class="k">for</span> + <span class="n">output</span><span class="p">.</span> + <span class="o">--</span><span class="n">distanceMeasure</span> <span class="p">(</span><span class="o">-</span><span class="n">dm</span><span class="p">)</span> <span class="n">distanceMeasure</span> <span class="n">The</span> <span class="n">classname</span> <span class="n">of</span> <span class="n">the</span> + <span class="n">DistanceMeasure</span><span class="p">.</span> <span class="n">Default</span> <span class="n">is</span> + <span class="n">SquaredEuclidean</span> + <span class="o">--</span><span class="n">convergenceDelta</span> <span class="p">(</span><span class="o">-</span><span class="n">cd</span><span class="p">)</span> <span class="n">convergenceDelta</span> <span class="n">The</span> <span class="n">convergence</span> <span class="n">delta</span> <span class="n">value</span><span class="p">.</span> + <span class="n">Default</span> <span class="n">is</span> 0<span class="p">.</span>5 + <span class="o">--</span><span class="n">maxIter</span> <span class="p">(</span><span class="o">-</span><span class="n">x</span><span class="p">)</span> <span class="n">maxIter</span> <span class="n">The</span> <span class="n">maximum</span> <span class="n">number</span> <span class="n">of</span> + <span class="n">iterations</span><span class="p">.</span> + <span class="o">--</span><span class="n">maxRed</span> <span class="p">(</span><span class="o">-</span><span class="n">r</span><span class="p">)</span> <span class="n">maxRed</span> <span class="n">The</span> <span class="n">number</span> <span class="n">of</span> <span class="n">reduce</span> <span class="n">tasks</span><span class="p">.</span> + <span class="n">Defaults</span> <span class="n">to</span> 2 + <span class="o">--</span><span class="n">k</span> <span class="p">(</span><span class="o">-</span><span class="n">k</span><span class="p">)</span> <span class="n">k</span> <span class="n">The</span> <span class="n">k</span> <span class="n">in</span> <span class="n">k</span><span class="o">-</span><span class="n">Means</span><span class="p">.</span> <span class="n">If</span> <span class="n">specified</span><span class="p">,</span> + <span class="n">then</span> <span class="n">a</span> <span class="n">random</span> <span class="n">selection</span> <span class="n">of</span> <span class="n">k</span> + <span class="n">Vectors</span> <span class="n">will</span> <span class="n">be</span> <span class="n">chosen</span> <span class="n">as</span> <span class="n">the</span> + <span class="n">Centroid</span> <span class="n">and</span> <span class="n">written</span> <span class="n">to</span> <span class="n">the</span> + <span class="n">clusters</span> <span class="n">input</span> <span class="n">path</span><span class="p">.</span> + <span class="o">--</span><span class="n">overwrite</span> <span class="p">(</span><span class="o">-</span><span class="n">ow</span><span class="p">)</span> <span class="n">If</span> <span class="n">present</span><span class="p">,</span> <span class="n">overwrite</span> <span class="n">the</span> <span class="n">output</span> + <span class="n">directory</span> <span class="n">before</span> <span class="n">running</span> <span class="n">job</span> + <span class="o">--</span><span class="n">help</span> <span class="p">(</span><span class="o">-</span><span class="n">h</span><span class="p">)</span> <span class="n">Print</span> <span class="n">out</span> <span class="n">help</span> + <span class="o">--</span><span class="n">clustering</span> <span class="p">(</span><span class="o">-</span><span class="n">cl</span><span class="p">)</span> <span class="n">If</span> <span class="n">present</span><span class="p">,</span> <span class="n">run</span> <span class="n">clustering</span> <span class="n">after</span> + <span class="n">the</span> <span class="n">iterations</span> <span class="n">have</span> <span class="n">taken</span> <span class="n">place</span> +</pre></div> + </div> + </div> +</div> + <footer class="footer" align="center"> + <div class="container"> + <p> + Copyright © 2014 The Apache Software Foundation, Licensed under + the <a href="http://www.apache.org/licenses/LICENSE-2.0">Apache License, Version 2.0</a>. + <br /> + Apache and the Apache feather logos are trademarks of The Apache Software Foundation. + </p> + </div> + </footer> + + <script src="/js/jquery-1.9.1.min.js"></script> + <script src="/js/bootstrap.min.js"></script> + <script> + (function() { + var cx = '012254517474945470291:vhsfv7eokdc'; + var gcse = document.createElement('script'); + gcse.type = 'text/javascript'; + gcse.async = true; + gcse.src = (document.location.protocol == 'https:' ? 'https:' : 'http:') + + '//www.google.com/cse/cse.js?cx=' + cx; + var s = document.getElementsByTagName('script')[0]; + s.parentNode.insertBefore(gcse, s); + })(); + </script> +</body> +</html> Added: websites/staging/mahout/trunk/content/users/mapreduce/clustering/latent-dirichlet-allocation.html ============================================================================== --- websites/staging/mahout/trunk/content/users/mapreduce/clustering/latent-dirichlet-allocation.html (added) +++ websites/staging/mahout/trunk/content/users/mapreduce/clustering/latent-dirichlet-allocation.html Thu Mar 19 21:21:45 2015 @@ -0,0 +1,401 @@ +<!DOCTYPE html> +<!-- + + Licensed to the Apache Software Foundation (ASF) under one or more + contributor license agreements. See the NOTICE file distributed with + this work for additional information regarding copyright ownership. + The ASF licenses this file to You under the Apache License, Version 2.0 + (the "License"); you may not use this file except in compliance with + the License. You may obtain a copy of the License at + + http://www.apache.org/licenses/LICENSE-2.0 + + Unless required by applicable law or agreed to in writing, software + distributed under the License is distributed on an "AS IS" BASIS, + WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + See the License for the specific language governing permissions and + limitations under the License. +--> + +<html xmlns="http://www.w3.org/1999/xhtml" xml:lang="en" lang="en"><head><meta http-equiv="Content-Type" content="text/html; charset=UTF-8"> + <title>Apache Mahout: Scalable machine learning and data mining</title> + <meta http-equiv="Content-Type" content="text/html; charset=utf-8"> + <meta name="Distribution" content="Global"> + <meta name="Robots" content="index,follow"> + <meta name="keywords" content="apache, apache hadoop, apache lucene, + business data mining, cluster analysis, + collaborative filtering, data extraction, data filtering, data framework, data integration, + data matching, data mining, data mining algorithms, data mining analysis, data mining data, + data mining introduction, data mining software, + data mining techniques, data representation, data set, datamining, + feature extraction, fuzzy k means, genetic algorithm, hadoop, + hierarchical clustering, high dimensional, introduction to data mining, kmeans, + knowledge discovery, learning approach, learning approaches, learning methods, + learning techniques, lucene, machine learning, machine translation, mahout apache, + mahout taste, map reduce hadoop, mining data, mining methods, naive bayes, + natural language processing, + supervised, text mining, time series data, unsupervised, web data mining"> + <link rel="shortcut icon" type="image/x-icon" href="http://mahout.apache.org/images/favicon.ico"> + <script type="text/javascript" src="/js/prototype.js"></script> + <script type="text/javascript" src="/js/effects.js"></script> + <script type="text/javascript" src="/js/search.js"></script> + <script type="text/javascript" src="/js/slides.js"></script> + + <link href="/css/bootstrap.min.css" rel="stylesheet" media="screen"> + <link href="/css/bootstrap-responsive.css" rel="stylesheet"> + <link rel="stylesheet" href="/css/global.css" type="text/css"> + + <!-- mathJax stuff -- use `\(...\)` for inline style math in markdown --> + <script type="text/x-mathjax-config"> + MathJax.Hub.Config({ + tex2jax: { + skipTags: ['script', 'noscript', 'style', 'textarea', 'pre'] + } + }); + MathJax.Hub.Queue(function() { + var all = MathJax.Hub.getAllJax(), i; + for(i = 0; i < all.length; i += 1) { + all[i].SourceElement().parentNode.className += ' has-jax'; + } + }); + </script> + <script type="text/javascript"> + var mathjax = document.createElement('script'); + mathjax.type = 'text/javascript'; + mathjax.async = true; + + mathjax.src = ('https:' == document.location.protocol) ? + 'https://c328740.ssl.cf1.rackcdn.com/mathjax/latest/MathJax.js?config=TeX-AMS-MML_HTMLorMML' : + 'http://cdn.mathjax.org/mathjax/latest/MathJax.js?config=TeX-AMS-MML_HTMLorMML'; + + var s = document.getElementsByTagName('script')[0]; + s.parentNode.insertBefore(mathjax, s); + </script> +</head> + +<body id="home" data-twttr-rendered="true"> + <div id="wrap"> + <div id="header"> + <div id="logo"><a href="/overview.html"></a></div> + <div id="search"> + <form id="search-form" action="http://www.google.com/search" method="get" class="navbar-search pull-right"> + <input value="http://mahout.apache.org" name="sitesearch" type="hidden"> + <input class="search-query" name="q" id="query" type="text"> + <input id="submission" type="image" src="/images/mahout-lupe.png" alt="Search" /> + </form> + </div> + + <div class="navbar navbar-inverse" style="position:absolute;top:133px;padding-right:0px;padding-left:0px;"> + <div class="navbar-inner" style="border: none; background: #999; border: none; border-radius: 0px;"> + <div class="container"> + <button type="button" class="btn btn-navbar" data-toggle="collapse" data-target=".nav-collapse"> + <span class="icon-bar"></span> + <span class="icon-bar"></span> + <span class="icon-bar"></span> + </button> + <!-- <a class="brand" href="#">Apache Community Development Project</a> --> + <div class="nav-collapse collapse"> + <ul class="nav"> + <li><a href="/">Home</a></li> + <li class="dropdown"> <a href="#" class="dropdown-toggle" data-toggle="dropdown">General<b class="caret"></b></a> + <ul class="dropdown-menu"> + <li><a href="/general/downloads.html">Downloads</a> + <li><a href="/general/who-we-are.html">Who we are</a> + <li><a href="/general/mailing-lists,-irc-and-archives.html">Mailing Lists</a> + <li><a href="/general/release-notes.html">Release Notes</a> + <li><a href="/general/books-tutorials-and-talks.html">Books, Tutorials, Talks</a></li> + <li><a href="/general/powered-by-mahout.html">Powered By Mahout</a> + <li><a href="/general/professional-support.html">Professional Support</a> + <li class="divider"></li> + <li class="nav-header">Resources</li> + <li><a href="/general/reference-reading.html">Reference Reading</a> + <li><a href="/general/faq.html">FAQ</a> + <li class="divider"></li> + <li class="nav-header">Legal</li> + <li><a href="http://www.apache.org/licenses/">License</a></li> + <li><a href="http://www.apache.org/security/">Security</a></li> + <li><a href="/general/privacy-policy.html">Privacy Policy</a> + </ul> + </li> + <li class="dropdown"> <a href="#" class="dropdown-toggle" data-toggle="dropdown">Developers<b class="caret"></b></a> + <ul class="dropdown-menu"> + <li><a href="/developers/developer-resources.html">Developer resources</a></li> + <li><a href="/developers/version-control.html">Version control</a></li> + <li><a href="/developers/buildingmahout.html">Build from source</a></li> + <li><a href="/developers/issue-tracker.html">Issue tracker</a></li> + <li><a href="https://builds.apache.org/job/Mahout-Quality/" target="_blank">Code quality reports</a></li> + <li class="divider"></li> + <li class="nav-header">Contributions</li> + <li><a href="/developers/how-to-contribute.html">How to contribute</a></li> + <li><a href="/developers/how-to-become-a-committer.html">How to become a committer</a></li> + <li><a href="/developers/gsoc.html">GSoC</a></li> + <li class="divider"></li> + <li class="nav-header">For committers</li> + <li><a href="/developers/how-to-update-the-website.html">How to update the website</a></li> + <li><a href="/developers/patch-check-list.html">Patch check list</a></li> + <li><a href="/developers/github.html">Handling Github PRs</a></li> + <li><a href="/developers/how-to-release.html">How to release</a></li> + <li><a href="/developers/thirdparty-dependencies.html">Third party dependencies</a></li> + </ul> + </li> + <li class="dropdown"> <a href="#" class="dropdown-toggle" data-toggle="dropdown">Basics<b class="caret"></b></a> + <ul class="dropdown-menu"> + <li><a href="/users/basics/algorithms.html">List of algorithms</a> + <li><a href="/users/basics/quickstart.html">Quickstart</a> + <li class="divider"></li> + <li class="nav-header">Working with text</li> + <li><a href="/users/basics/creating-vectors-from-text.html">Creating vectors from text</a> + <li><a href="/users/basics/collocations.html">Collocations</a> + <li class="divider"></li> + <li class="nav-header">Dimensionality reduction</li> + <li><a href="/users/dim-reduction/dimensional-reduction.html">Singular Value Decomposition</a></li> + <li><a href="/users/dim-reduction/ssvd.html">Stochastic SVD</a></li> + <li class="divider"></li> + <li class="nav-header">Topic Models</li> + <li><a href="/users/clustering/latent-dirichlet-allocation.html">Latent Dirichlet Allocation</a></li> + </ul> + </li> + <li class="dropdown"> <a href="#" class="dropdown-toggle" data-toggle="dropdown">Spark<b class="caret"></b></a> + <ul class="dropdown-menu"> + <li><a href="/users/sparkbindings/home.html">Scala & Spark Bindings Overview</a></li> + <li><a href="/users/sparkbindings/play-with-shell.html">Playing with Mahout's Spark Shell</a></li> + <li class="divider"></li> + <li><a href="/users/sparkbindings/faq.html">FAQ</a></li> + </ul> + </li> + <li class="dropdown"> <a href="#" class="dropdown-toggle" data-toggle="dropdown">Classification<b class="caret"></b></a> + <ul class="dropdown-menu"> + <li><a href="/users/mapreduce/classification/bayesian.html">Naive Bayes</a></li> + <li><a href="/users/mapreduce/classification/hidden-markov-models.html">Hidden Markov Models</a></li> + <li><a href="/users/mapreduce/classification/logistic-regression.html">Logistic Regression</a></li> + <li><a href="/users/mapreduce/classification/partial-implementation.html">Random Forest</a></li> + + <li class="divider"></li> + <li class="nav-header">Examples</li> + <li><a href="/users/mapreduce/classification/breiman-example.html">Breiman example</a></li> + <li><a href="/users/mapreduce/classification/twenty-newsgroups.html">20 newsgroups example</a></li> + </ul></li> + <li class="dropdown"> <a href="#" class="dropdown-toggle" data-toggle="dropdown">Clustering<b class="caret"></b></a> + <ul class="dropdown-menu"> + <li><a href="/users/mapreduce/clustering/k-means-clustering.html">k-Means</a></li> + <li><a href="/users/mapreduce/clustering/canopy-clustering.html">Canopy</a></li> + <li><a href="/users/mapreduce/clustering/fuzzy-k-means.html">Fuzzy k-Means</a></li> + <li><a href="/users/mapreduce/clustering/streaming-k-means.html">Streaming KMeans</a></li> + <li><a href="/users/mapreduce/clustering/spectral-clustering.html">Spectral Clustering</a></li> + <li class="divider"></li> + <li class="nav-header">Commandline usage</li> + <li><a href="/users/mapreduce/clustering/k-means-commandline.html">Options for k-Means</a></li> + <li><a href="/users/mapreduce/clustering/canopy-commandline.html">Options for Canopy</a></li> + <li><a href="/users/mapreduce/clustering/fuzzy-k-means-commandline.html">Options for Fuzzy k-Means</a></li> + <li class="divider"></li> + <li class="nav-header">Examples</li> + <li><a href="/users/mapreduce/clustering/clustering-of-synthetic-control-data.html">Synthetic data</a></li> + <li class="divider"></li> + <li class="nav-header">Post processing</li> + <li><a href="/users/mapreduce/clustering/cluster-dumper.html">Cluster Dumper tool</a></li> + <li><a href="/users/mapreduce/clustering/visualizing-sample-clusters.html">Cluster visualisation</a></li> + </ul></li> + <li class="dropdown"> <a href="#" class="dropdown-toggle" data-toggle="dropdown">Recommendations<b class="caret"></b></a> + <ul class="dropdown-menu"> + <li><a href="/users/mapreduce/recommender/quickstart.html">Quickstart</a></li> + <li><a href="/users/mapreduce/recommender/recommender-first-timer-faq.html">First Timer FAQ</a></li> + <li><a href="/users/mapreduce/recommender/userbased-5-minutes.html">A user-based recommender <br/>in 5 minutes</a></li> + <li><a href="/users/mapreduce/recommender/matrix-factorization.html">Matrix factorization-based<br/> recommenders</a></li> + <li><a href="/users/mapreduce/recommender/recommender-documentation.html">Overview</a></li> + <li class="divider"></li> + <li class="nav-header">Hadoop</li> + <li><a href="/users/mapreduce/recommender/intro-itembased-hadoop.html">Intro to item-based recommendations<br/> with Hadoop</a></li> + <li><a href="/users/mapreduce/recommender/intro-als-hadoop.html">Intro to ALS recommendations<br/> with Hadoop</a></li> + <li class="nav-header">Spark</li> + <li><a href="/users/mapreduce/recommender/intro-cooccurrence-spark.html">Intro to cooccurrence-based<br/> recommendations with Spark</a></li> + </ul> + </li> + </ul> + </div><!--/.nav-collapse --> + </div> + </div> + </div> + +</div> + + <div id="sidebar"> + <div id="sidebar-wrap"> + <h2>Twitter</h2> + <ul class="sidemenu"> + <li> +<a class="twitter-timeline" href="https://twitter.com/ApacheMahout" data-widget-id="422861673444028416">Tweets by @ApacheMahout</a> +<script>!function(d,s,id){var js,fjs=d.getElementsByTagName(s)[0],p=/^http:/.test(d.location)?'http':'https';if(!d.getElementById(id)){js=d.createElement(s);js.id=id;js.src=p+"://platform.twitter.com/widgets.js";fjs.parentNode.insertBefore(js,fjs);}}(document,"script","twitter-wjs");</script> +</li> + </ul> + <h2>Apache Software Foundation</h2> + <ul class="sidemenu"> + <li><a href="http://www.apache.org/foundation/how-it-works.html">How the ASF works</a></li> + <li><a href="http://www.apache.org/foundation/getinvolved.html">Get Involved</a></li> + <li><a href="http://www.apache.org/dev/">Developer Resources</a></li> + <li><a href="http://www.apache.org/foundation/sponsorship.html">Sponsorship</a></li> + <li><a href="http://www.apache.org/foundation/thanks.html">Thanks</a></li> + </ul> + <h2>Related Projects</h2> + <ul class="sidemenu"> + <li><a href="http://lucene.apache.org/">Lucene</a></li> + <li><a href="http://hadoop.apache.org/">Hadoop</a></li> + </ul> + </div> +</div> + + <div id="content-wrap" class="clearfix"> + <div id="main"> + <p><a name="LatentDirichletAllocation-Overview"></a></p> +<h1 id="overview">Overview</h1> +<p>Latent Dirichlet Allocation (Blei et al, 2003) is a powerful learning +algorithm for automatically and jointly clustering words into "topics" and +documents into mixtures of topics. It has been successfully applied to +model change in scientific fields over time (Griffiths and Steyvers, 2004; +Hall, et al. 2008). </p> +<p>A topic model is, roughly, a hierarchical Bayesian model that associates +with each document a probability distribution over "topics", which are in +turn distributions over words. For instance, a topic in a collection of +newswire might include words about "sports", such as "baseball", "home +run", "player", and a document about steroid use in baseball might include +"sports", "drugs", and "politics". Note that the labels "sports", "drugs", +and "politics", are post-hoc labels assigned by a human, and that the +algorithm itself only assigns associate words with probabilities. The task +of parameter estimation in these models is to learn both what the topics +are, and which documents employ them in what proportions.</p> +<p>Another way to view a topic model is as a generalization of a mixture model +like <a href="http://en.wikipedia.org/wiki/Dirichlet_process">Dirichlet Process Clustering</a> +. Starting from a normal mixture model, in which we have a single global +mixture of several distributions, we instead say that <em>each</em> document has +its own mixture distribution over the globally shared mixture components. +Operationally in Dirichlet Process Clustering, each document has its own +latent variable drawn from a global mixture that specifies which model it +belongs to, while in LDA each word in each document has its own parameter +drawn from a document-wide mixture.</p> +<p>The idea is that we use a probabilistic mixture of a number of models that +we use to explain some observed data. Each observed data point is assumed +to have come from one of the models in the mixture, but we don't know +which. The way we deal with that is to use a so-called latent parameter +which specifies which model each data point came from.</p> +<p><a name="LatentDirichletAllocation-CollapsedVariationalBayes"></a></p> +<h1 id="collapsed-variational-bayes">Collapsed Variational Bayes</h1> +<p>The CVB algorithm which is implemented in Mahout for LDA combines +advantages of both regular Variational Bayes and Gibbs Sampling. The +algorithm relies on modeling dependence of parameters on latest variables +which are in turn mutually independent. The algorithm uses 2 +methodologies to marginalize out parameters when calculating the joint +distribution and the other other is to model the posterior of theta and phi +given the inputs z and x.</p> +<p>A common solution to the CVB algorithm is to compute each expectation term +by using simple Gaussian approximation which is accurate and requires low +computational overhead. The specifics behind the approximation involve +computing the sum of the means and variances of the individual Bernoulli +variables.</p> +<p>CVB with Gaussian approximation is implemented by tracking the mean and +variance and subtracting the mean and variance of the corresponding +Bernoulli variables. The computational cost for the algorithm scales on +the order of O(K) with each update to q(z(i,j)). Also for each +document/word pair only 1 copy of the variational posterior is required +over the latent variable.</p> +<p><a name="LatentDirichletAllocation-InvocationandUsage"></a></p> +<h1 id="invocation-and-usage">Invocation and Usage</h1> +<p>Mahout's implementation of LDA operates on a collection of SparseVectors of +word counts. These word counts should be non-negative integers, though +things will-- probably --work fine if you use non-negative reals. (Note +that the probabilistic model doesn't make sense if you do!) To create these +vectors, it's recommended that you follow the instructions in <a href="../basics/creating-vectors-from-text.html">Creating Vectors From Text</a> +, making sure to use TF and not TFIDF as the scorer.</p> +<p>Invocation takes the form:</p> +<div class="codehilite"><pre><span class="n">bin</span><span class="o">/</span><span class="n">mahout</span> <span class="n">cvb</span> <span class="o">\</span> + <span class="o">-</span><span class="nb">i</span> <span class="o"><</span><span class="n">input</span> <span class="n">path</span> <span class="k">for</span> <span class="n">document</span> <span class="n">vectors</span><span class="o">></span> <span class="o">\</span> + <span class="o">-</span><span class="n">dict</span> <span class="o"><</span><span class="n">path</span> <span class="n">to</span> <span class="n">term</span><span class="o">-</span><span class="n">dictionary</span> <span class="n">file</span><span class="p">(</span><span class="n">s</span><span class="p">)</span> <span class="p">,</span> <span class="n">glob</span> <span class="n">expression</span> <span class="n">supported</span><span class="o">></span> <span class="o">\</span> + <span class="o">-</span><span class="n">o</span> <span class="o"><</span><span class="n">output</span> <span class="n">path</span> <span class="k">for</span> <span class="n">topic</span><span class="o">-</span><span class="n">term</span> <span class="n">distributions</span><span class="o">></span> + <span class="o">-</span><span class="n">dt</span> <span class="o"><</span><span class="n">output</span> <span class="n">path</span> <span class="k">for</span> <span class="n">doc</span><span class="o">-</span><span class="n">topic</span> <span class="n">distributions</span><span class="o">></span> <span class="o">\</span> + <span class="o">-</span><span class="n">k</span> <span class="o"><</span><span class="n">number</span> <span class="n">of</span> <span class="n">latent</span> <span class="n">topics</span><span class="o">></span> <span class="o">\</span> + <span class="o">-</span><span class="n">nt</span> <span class="o"><</span><span class="n">number</span> <span class="n">of</span> <span class="n">unique</span> <span class="n">features</span> <span class="n">defined</span> <span class="n">by</span> <span class="n">input</span> <span class="n">document</span> <span class="n">vectors</span><span class="o">></span> <span class="o">\</span> + <span class="o">-</span><span class="n">mt</span> <span class="o"><</span><span class="n">path</span> <span class="n">to</span> <span class="n">store</span> <span class="n">model</span> <span class="n">state</span> <span class="n">after</span> <span class="n">each</span> <span class="n">iteration</span><span class="o">></span> <span class="o">\</span> + <span class="o">-</span><span class="n">maxIter</span> <span class="o"><</span><span class="n">max</span> <span class="n">number</span> <span class="n">of</span> <span class="n">iterations</span><span class="o">></span> <span class="o">\</span> + <span class="o">-</span><span class="n">mipd</span> <span class="o"><</span><span class="n">max</span> <span class="n">number</span> <span class="n">of</span> <span class="n">iterations</span> <span class="n">per</span> <span class="n">doc</span> <span class="k">for</span> <span class="n">learning</span><span class="o">></span> <span class="o">\</span> + <span class="o">-</span><span class="n">a</span> <span class="o"><</span><span class="n">smoothing</span> <span class="k">for</span> <span class="n">doc</span> <span class="n">topic</span> <span class="n">distributions</span><span class="o">></span> <span class="o">\</span> + <span class="o">-</span><span class="n">e</span> <span class="o"><</span><span class="n">smoothing</span> <span class="k">for</span> <span class="n">term</span> <span class="n">topic</span> <span class="n">distributions</span><span class="o">></span> <span class="o">\</span> + <span class="o">-</span><span class="n">seed</span> <span class="o"><</span><span class="n">random</span> <span class="n">seed</span><span class="o">></span> <span class="o">\</span> + <span class="o">-</span><span class="n">tf</span> <span class="o"><</span><span class="n">fraction</span> <span class="n">of</span> <span class="n">data</span> <span class="n">to</span> <span class="n">hold</span> <span class="k">for</span> <span class="n">testing</span><span class="o">></span> <span class="o">\</span> + <span class="o">-</span><span class="n">block</span> <span class="o"><</span><span class="n">number</span> <span class="n">of</span> <span class="n">iterations</span> <span class="n">per</span> <span class="n">perplexity</span> <span class="n">check</span><span class="p">,</span> <span class="n">ignored</span> <span class="n">unless</span> +</pre></div> + + +<p>test_set_percentage>0> \</p> +<p>Topic smoothing should generally be about 50/K, where K is the number of +topics. The number of words in the vocabulary can be an upper bound, though +it shouldn't be too high (for memory concerns). </p> +<p>Choosing the number of topics is more art than science, and it's +recommended that you try several values.</p> +<p>After running LDA you can obtain an output of the computed topics using the +LDAPrintTopics utility:</p> +<div class="codehilite"><pre><span class="n">bin</span><span class="o">/</span><span class="n">mahout</span> <span class="n">ldatopics</span> <span class="o">\</span> + <span class="o">-</span><span class="nb">i</span> <span class="o"><</span><span class="n">input</span> <span class="n">vectors</span> <span class="n">directory</span><span class="o">></span> <span class="o">\</span> + <span class="o">-</span><span class="n">d</span> <span class="o"><</span><span class="n">input</span> <span class="n">dictionary</span> <span class="n">file</span><span class="o">></span> <span class="o">\</span> + <span class="o">-</span><span class="n">w</span> <span class="o"><</span><span class="n">optional</span> <span class="n">number</span> <span class="n">of</span> <span class="n">words</span> <span class="n">to</span> <span class="n">print</span><span class="o">></span> <span class="o">\</span> + <span class="o">-</span><span class="n">o</span> <span class="o"><</span><span class="n">optional</span> <span class="n">output</span> <span class="n">working</span> <span class="n">directory</span><span class="p">.</span> <span class="n">Default</span> <span class="n">is</span> <span class="n">to</span> <span class="n">console</span><span class="o">></span> <span class="o">\</span> + <span class="o">-</span><span class="n">h</span> <span class="o"><</span><span class="n">print</span> <span class="n">out</span> <span class="n">help</span><span class="o">></span> <span class="o">\</span> + <span class="o">-</span><span class="n">dt</span> <span class="o"><</span><span class="n">optional</span> <span class="n">dictionary</span> <span class="n">type</span> <span class="p">(</span><span class="n">text</span><span class="o">|</span><span class="n">sequencefile</span><span class="p">).</span> <span class="n">Default</span> <span class="n">is</span> <span class="n">text</span><span class="o">></span> +</pre></div> + + +<p><a name="LatentDirichletAllocation-Example"></a></p> +<h1 id="example">Example</h1> +<p>An example is located in mahout/examples/bin/build-reuters.sh. The script +automatically downloads the Reuters-21578 corpus, builds a Lucene index and +converts the Lucene index to vectors. By uncommenting the last two lines +you can then cause it to run LDA on the vectors and finally print the +resultant topics to the console. </p> +<p>To adapt the example yourself, you should note that Lucene has specialized +support for Reuters, and that building your own index will require some +adaptation. The rest should hopefully not differ too much.</p> +<p><a name="LatentDirichletAllocation-ParameterEstimation"></a></p> +<h1 id="parameter-estimation">Parameter Estimation</h1> +<p>We use mean field variational inference to estimate the models. Variational +inference can be thought of as a generalization of <a href="expectation-maximization.html">EM</a> + for hierarchical Bayesian models. The E-Step takes the form of, for each +document, inferring the posterior probability of each topic for each word +in each document. We then take the sufficient statistics and emit them in +the form of (log) pseudo-counts for each word in each topic. The M-Step is +simply to sum these together and (log) normalize them so that we have a +distribution over the entire vocabulary of the corpus for each topic. </p> +<p>In implementation, the E-Step is implemented in the Map, and the M-Step is +executed in the reduce step, with the final normalization happening as a +post-processing step.</p> +<p><a name="LatentDirichletAllocation-References"></a></p> +<h1 id="references">References</h1> +<p><a href="-http://machinelearning.wustl.edu/mlpapers/paper_files/BleiNJ03.pdf">David M. Blei, Andrew Y. Ng, Michael I. Jordan, John Lafferty. 2003. Latent Dirichlet Allocation. JMLR.</a></p> +<p><a href="http://psiexp.ss.uci.edu/research/papers/sciencetopics.pdf">Thomas L. Griffiths and Mark Steyvers. 2004. Finding scientific topics. PNAS. </a></p> +<p><a href="-http://aclweb.org/anthology//D/D08/D08-1038.pdf">David Hall, Dan Jurafsky, and Christopher D. Manning. 2008. Studying the History of Ideas Using Topic Models </a></p> + </div> + </div> +</div> + <footer class="footer" align="center"> + <div class="container"> + <p> + Copyright © 2014 The Apache Software Foundation, Licensed under + the <a href="http://www.apache.org/licenses/LICENSE-2.0">Apache License, Version 2.0</a>. + <br /> + Apache and the Apache feather logos are trademarks of The Apache Software Foundation. + </p> + </div> + </footer> + + <script src="/js/jquery-1.9.1.min.js"></script> + <script src="/js/bootstrap.min.js"></script> + <script> + (function() { + var cx = '012254517474945470291:vhsfv7eokdc'; + var gcse = document.createElement('script'); + gcse.type = 'text/javascript'; + gcse.async = true; + gcse.src = (document.location.protocol == 'https:' ? 'https:' : 'http:') + + '//www.google.com/cse/cse.js?cx=' + cx; + var s = document.getElementsByTagName('script')[0]; + s.parentNode.insertBefore(gcse, s); + })(); + </script> +</body> +</html>
