Added: websites/staging/mahout/trunk/content/users/mapreduce/clustering/clustering-of-synthetic-control-data.html ============================================================================== --- websites/staging/mahout/trunk/content/users/mapreduce/clustering/clustering-of-synthetic-control-data.html (added) +++ websites/staging/mahout/trunk/content/users/mapreduce/clustering/clustering-of-synthetic-control-data.html Thu Mar 19 21:21:45 2015 @@ -0,0 +1,318 @@ +<!DOCTYPE html> +<!-- + + Licensed to the Apache Software Foundation (ASF) under one or more + contributor license agreements. See the NOTICE file distributed with + this work for additional information regarding copyright ownership. + The ASF licenses this file to You under the Apache License, Version 2.0 + (the "License"); you may not use this file except in compliance with + the License. You may obtain a copy of the License at + + http://www.apache.org/licenses/LICENSE-2.0 + + Unless required by applicable law or agreed to in writing, software + distributed under the License is distributed on an "AS IS" BASIS, + WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + See the License for the specific language governing permissions and + limitations under the License. +--> + +<html xmlns="http://www.w3.org/1999/xhtml" xml:lang="en" lang="en"><head><meta http-equiv="Content-Type" content="text/html; charset=UTF-8"> + <title>Apache Mahout: Scalable machine learning and data mining</title> + <meta http-equiv="Content-Type" content="text/html; charset=utf-8"> + <meta name="Distribution" content="Global"> + <meta name="Robots" content="index,follow"> + <meta name="keywords" content="apache, apache hadoop, apache lucene, + business data mining, cluster analysis, + collaborative filtering, data extraction, data filtering, data framework, data integration, + data matching, data mining, data mining algorithms, data mining analysis, data mining data, + data mining introduction, data mining software, + data mining techniques, data representation, data set, datamining, + feature extraction, fuzzy k means, genetic algorithm, hadoop, + hierarchical clustering, high dimensional, introduction to data mining, kmeans, + knowledge discovery, learning approach, learning approaches, learning methods, + learning techniques, lucene, machine learning, machine translation, mahout apache, + mahout taste, map reduce hadoop, mining data, mining methods, naive bayes, + natural language processing, + supervised, text mining, time series data, unsupervised, web data mining"> + <link rel="shortcut icon" type="image/x-icon" href="http://mahout.apache.org/images/favicon.ico"> + <script type="text/javascript" src="/js/prototype.js"></script> + <script type="text/javascript" src="/js/effects.js"></script> + <script type="text/javascript" src="/js/search.js"></script> + <script type="text/javascript" src="/js/slides.js"></script> + + <link href="/css/bootstrap.min.css" rel="stylesheet" media="screen"> + <link href="/css/bootstrap-responsive.css" rel="stylesheet"> + <link rel="stylesheet" href="/css/global.css" type="text/css"> + + <!-- mathJax stuff -- use `\(...\)` for inline style math in markdown --> + <script type="text/x-mathjax-config"> + MathJax.Hub.Config({ + tex2jax: { + skipTags: ['script', 'noscript', 'style', 'textarea', 'pre'] + } + }); + MathJax.Hub.Queue(function() { + var all = MathJax.Hub.getAllJax(), i; + for(i = 0; i < all.length; i += 1) { + all[i].SourceElement().parentNode.className += ' has-jax'; + } + }); + </script> + <script type="text/javascript"> + var mathjax = document.createElement('script'); + mathjax.type = 'text/javascript'; + mathjax.async = true; + + mathjax.src = ('https:' == document.location.protocol) ? + 'https://c328740.ssl.cf1.rackcdn.com/mathjax/latest/MathJax.js?config=TeX-AMS-MML_HTMLorMML' : + 'http://cdn.mathjax.org/mathjax/latest/MathJax.js?config=TeX-AMS-MML_HTMLorMML'; + + var s = document.getElementsByTagName('script')[0]; + s.parentNode.insertBefore(mathjax, s); + </script> +</head> + +<body id="home" data-twttr-rendered="true"> + <div id="wrap"> + <div id="header"> + <div id="logo"><a href="/overview.html"></a></div> + <div id="search"> + <form id="search-form" action="http://www.google.com/search" method="get" class="navbar-search pull-right"> + <input value="http://mahout.apache.org" name="sitesearch" type="hidden"> + <input class="search-query" name="q" id="query" type="text"> + <input id="submission" type="image" src="/images/mahout-lupe.png" alt="Search" /> + </form> + </div> + + <div class="navbar navbar-inverse" style="position:absolute;top:133px;padding-right:0px;padding-left:0px;"> + <div class="navbar-inner" style="border: none; background: #999; border: none; border-radius: 0px;"> + <div class="container"> + <button type="button" class="btn btn-navbar" data-toggle="collapse" data-target=".nav-collapse"> + <span class="icon-bar"></span> + <span class="icon-bar"></span> + <span class="icon-bar"></span> + </button> + <!-- <a class="brand" href="#">Apache Community Development Project</a> --> + <div class="nav-collapse collapse"> + <ul class="nav"> + <li><a href="/">Home</a></li> + <li class="dropdown"> <a href="#" class="dropdown-toggle" data-toggle="dropdown">General<b class="caret"></b></a> + <ul class="dropdown-menu"> + <li><a href="/general/downloads.html">Downloads</a> + <li><a href="/general/who-we-are.html">Who we are</a> + <li><a href="/general/mailing-lists,-irc-and-archives.html">Mailing Lists</a> + <li><a href="/general/release-notes.html">Release Notes</a> + <li><a href="/general/books-tutorials-and-talks.html">Books, Tutorials, Talks</a></li> + <li><a href="/general/powered-by-mahout.html">Powered By Mahout</a> + <li><a href="/general/professional-support.html">Professional Support</a> + <li class="divider"></li> + <li class="nav-header">Resources</li> + <li><a href="/general/reference-reading.html">Reference Reading</a> + <li><a href="/general/faq.html">FAQ</a> + <li class="divider"></li> + <li class="nav-header">Legal</li> + <li><a href="http://www.apache.org/licenses/">License</a></li> + <li><a href="http://www.apache.org/security/">Security</a></li> + <li><a href="/general/privacy-policy.html">Privacy Policy</a> + </ul> + </li> + <li class="dropdown"> <a href="#" class="dropdown-toggle" data-toggle="dropdown">Developers<b class="caret"></b></a> + <ul class="dropdown-menu"> + <li><a href="/developers/developer-resources.html">Developer resources</a></li> + <li><a href="/developers/version-control.html">Version control</a></li> + <li><a href="/developers/buildingmahout.html">Build from source</a></li> + <li><a href="/developers/issue-tracker.html">Issue tracker</a></li> + <li><a href="https://builds.apache.org/job/Mahout-Quality/" target="_blank">Code quality reports</a></li> + <li class="divider"></li> + <li class="nav-header">Contributions</li> + <li><a href="/developers/how-to-contribute.html">How to contribute</a></li> + <li><a href="/developers/how-to-become-a-committer.html">How to become a committer</a></li> + <li><a href="/developers/gsoc.html">GSoC</a></li> + <li class="divider"></li> + <li class="nav-header">For committers</li> + <li><a href="/developers/how-to-update-the-website.html">How to update the website</a></li> + <li><a href="/developers/patch-check-list.html">Patch check list</a></li> + <li><a href="/developers/github.html">Handling Github PRs</a></li> + <li><a href="/developers/how-to-release.html">How to release</a></li> + <li><a href="/developers/thirdparty-dependencies.html">Third party dependencies</a></li> + </ul> + </li> + <li class="dropdown"> <a href="#" class="dropdown-toggle" data-toggle="dropdown">Basics<b class="caret"></b></a> + <ul class="dropdown-menu"> + <li><a href="/users/basics/algorithms.html">List of algorithms</a> + <li><a href="/users/basics/quickstart.html">Quickstart</a> + <li class="divider"></li> + <li class="nav-header">Working with text</li> + <li><a href="/users/basics/creating-vectors-from-text.html">Creating vectors from text</a> + <li><a href="/users/basics/collocations.html">Collocations</a> + <li class="divider"></li> + <li class="nav-header">Dimensionality reduction</li> + <li><a href="/users/dim-reduction/dimensional-reduction.html">Singular Value Decomposition</a></li> + <li><a href="/users/dim-reduction/ssvd.html">Stochastic SVD</a></li> + <li class="divider"></li> + <li class="nav-header">Topic Models</li> + <li><a href="/users/clustering/latent-dirichlet-allocation.html">Latent Dirichlet Allocation</a></li> + </ul> + </li> + <li class="dropdown"> <a href="#" class="dropdown-toggle" data-toggle="dropdown">Spark<b class="caret"></b></a> + <ul class="dropdown-menu"> + <li><a href="/users/sparkbindings/home.html">Scala & Spark Bindings Overview</a></li> + <li><a href="/users/sparkbindings/play-with-shell.html">Playing with Mahout's Spark Shell</a></li> + <li class="divider"></li> + <li><a href="/users/sparkbindings/faq.html">FAQ</a></li> + </ul> + </li> + <li class="dropdown"> <a href="#" class="dropdown-toggle" data-toggle="dropdown">Classification<b class="caret"></b></a> + <ul class="dropdown-menu"> + <li><a href="/users/mapreduce/classification/bayesian.html">Naive Bayes</a></li> + <li><a href="/users/mapreduce/classification/hidden-markov-models.html">Hidden Markov Models</a></li> + <li><a href="/users/mapreduce/classification/logistic-regression.html">Logistic Regression</a></li> + <li><a href="/users/mapreduce/classification/partial-implementation.html">Random Forest</a></li> + + <li class="divider"></li> + <li class="nav-header">Examples</li> + <li><a href="/users/mapreduce/classification/breiman-example.html">Breiman example</a></li> + <li><a href="/users/mapreduce/classification/twenty-newsgroups.html">20 newsgroups example</a></li> + </ul></li> + <li class="dropdown"> <a href="#" class="dropdown-toggle" data-toggle="dropdown">Clustering<b class="caret"></b></a> + <ul class="dropdown-menu"> + <li><a href="/users/mapreduce/clustering/k-means-clustering.html">k-Means</a></li> + <li><a href="/users/mapreduce/clustering/canopy-clustering.html">Canopy</a></li> + <li><a href="/users/mapreduce/clustering/fuzzy-k-means.html">Fuzzy k-Means</a></li> + <li><a href="/users/mapreduce/clustering/streaming-k-means.html">Streaming KMeans</a></li> + <li><a href="/users/mapreduce/clustering/spectral-clustering.html">Spectral Clustering</a></li> + <li class="divider"></li> + <li class="nav-header">Commandline usage</li> + <li><a href="/users/mapreduce/clustering/k-means-commandline.html">Options for k-Means</a></li> + <li><a href="/users/mapreduce/clustering/canopy-commandline.html">Options for Canopy</a></li> + <li><a href="/users/mapreduce/clustering/fuzzy-k-means-commandline.html">Options for Fuzzy k-Means</a></li> + <li class="divider"></li> + <li class="nav-header">Examples</li> + <li><a href="/users/mapreduce/clustering/clustering-of-synthetic-control-data.html">Synthetic data</a></li> + <li class="divider"></li> + <li class="nav-header">Post processing</li> + <li><a href="/users/mapreduce/clustering/cluster-dumper.html">Cluster Dumper tool</a></li> + <li><a href="/users/mapreduce/clustering/visualizing-sample-clusters.html">Cluster visualisation</a></li> + </ul></li> + <li class="dropdown"> <a href="#" class="dropdown-toggle" data-toggle="dropdown">Recommendations<b class="caret"></b></a> + <ul class="dropdown-menu"> + <li><a href="/users/mapreduce/recommender/quickstart.html">Quickstart</a></li> + <li><a href="/users/mapreduce/recommender/recommender-first-timer-faq.html">First Timer FAQ</a></li> + <li><a href="/users/mapreduce/recommender/userbased-5-minutes.html">A user-based recommender <br/>in 5 minutes</a></li> + <li><a href="/users/mapreduce/recommender/matrix-factorization.html">Matrix factorization-based<br/> recommenders</a></li> + <li><a href="/users/mapreduce/recommender/recommender-documentation.html">Overview</a></li> + <li class="divider"></li> + <li class="nav-header">Hadoop</li> + <li><a href="/users/mapreduce/recommender/intro-itembased-hadoop.html">Intro to item-based recommendations<br/> with Hadoop</a></li> + <li><a href="/users/mapreduce/recommender/intro-als-hadoop.html">Intro to ALS recommendations<br/> with Hadoop</a></li> + <li class="nav-header">Spark</li> + <li><a href="/users/mapreduce/recommender/intro-cooccurrence-spark.html">Intro to cooccurrence-based<br/> recommendations with Spark</a></li> + </ul> + </li> + </ul> + </div><!--/.nav-collapse --> + </div> + </div> + </div> + +</div> + + <div id="sidebar"> + <div id="sidebar-wrap"> + <h2>Twitter</h2> + <ul class="sidemenu"> + <li> +<a class="twitter-timeline" href="https://twitter.com/ApacheMahout" data-widget-id="422861673444028416">Tweets by @ApacheMahout</a> +<script>!function(d,s,id){var js,fjs=d.getElementsByTagName(s)[0],p=/^http:/.test(d.location)?'http':'https';if(!d.getElementById(id)){js=d.createElement(s);js.id=id;js.src=p+"://platform.twitter.com/widgets.js";fjs.parentNode.insertBefore(js,fjs);}}(document,"script","twitter-wjs");</script> +</li> + </ul> + <h2>Apache Software Foundation</h2> + <ul class="sidemenu"> + <li><a href="http://www.apache.org/foundation/how-it-works.html">How the ASF works</a></li> + <li><a href="http://www.apache.org/foundation/getinvolved.html">Get Involved</a></li> + <li><a href="http://www.apache.org/dev/">Developer Resources</a></li> + <li><a href="http://www.apache.org/foundation/sponsorship.html">Sponsorship</a></li> + <li><a href="http://www.apache.org/foundation/thanks.html">Thanks</a></li> + </ul> + <h2>Related Projects</h2> + <ul class="sidemenu"> + <li><a href="http://lucene.apache.org/">Lucene</a></li> + <li><a href="http://hadoop.apache.org/">Hadoop</a></li> + </ul> + </div> +</div> + + <div id="content-wrap" class="clearfix"> + <div id="main"> + <h1 id="clustering-synthetic-control-data">Clustering synthetic control data</h1> +<h2 id="introduction">Introduction</h2> +<p>This example will demonstrate clustering of time series data, specifically control charts. <a href="http://en.wikipedia.org/wiki/Control_chart">Control charts</a> are tools used to determine whether a manufacturing or business process is in a state of statistical control. Such control charts are generated / simulated repeatedly at equal time intervals. A <a href="http://archive.ics.uci.edu/ml/databases/synthetic_control/synthetic_control.data.html">simulated dataset</a> is available for use in UCI machine learning repository.</p> +<p>A time series of control charts needs to be clustered into their close knit groups. The data set we use is synthetic and is meant to resemble real world information in an anonymized format. It contains six different classes: Normal, Cyclic, Increasing trend, Decreasing trend, Upward shift, Downward shift. In this example we will use Mahout to cluster the data into corresponding class buckets. </p> +<p><em>For the sake of simplicity, we won't use a cluster in this example, but instead show you the commands to run the clustering examples locally with Hadoop</em>.</p> +<h2 id="setup">Setup</h2> +<p>We need to do some initial setup before we are able to run the example. </p> +<ol> +<li> +<p>Start out by downloading the dataset to be clustered from the UCI Machine Learning Repository: <a href="http://archive.ics.uci.edu/ml/databases/synthetic_control/synthetic_control.data">http://archive.ics.uci.edu/ml/databases/synthetic_control/synthetic_control.data</a>.</p> +</li> +<li> +<p>Download the <a href="/general/downloads.html">latest release of Mahout</a>.</p> +</li> +<li> +<p>Unpack the release binary and switch to the <em>mahout-distribution-0.x</em> folder</p> +</li> +<li> +<p>Make sure that the <em>JAVA_HOME</em> environment variable points to your local java installation</p> +</li> +<li> +<p>Create a folder called <em>testdata</em> in the current directory and copy the dataset into this folder.</p> +</li> +</ol> +<h2 id="clustering-examples">Clustering Examples</h2> +<p>Depending on the clustering algorithm you want to run, the following commands can be used:</p> +<ul> +<li> +<p><a href="/users/clustering/canopy-clustering.html">Canopy Clustering</a></p> +<p>bin/mahout org.apache.mahout.clustering.syntheticcontrol.canopy.Job</p> +</li> +<li> +<p><a href="/users/clustering/k-means-clustering.html">k-Means Clustering</a></p> +<p>bin/mahout org.apache.mahout.clustering.syntheticcontrol.kmeans.Job</p> +</li> +<li> +<p><a href="/users/clustering/fuzzy-k-means.html">Fuzzy k-Means Clustering</a></p> +<p>bin/mahout org.apache.mahout.clustering.syntheticcontrol.fuzzykmeans.Job</p> +</li> +</ul> +<p>The clustering output will be produced in the <em>output</em> directory. The output data points are in vector format. In order to read/analyze the output, you can use the <a href="/users/clustering/cluster-dumper.html">clusterdump</a> utility provided by Mahout.</p> + </div> + </div> +</div> + <footer class="footer" align="center"> + <div class="container"> + <p> + Copyright © 2014 The Apache Software Foundation, Licensed under + the <a href="http://www.apache.org/licenses/LICENSE-2.0">Apache License, Version 2.0</a>. + <br /> + Apache and the Apache feather logos are trademarks of The Apache Software Foundation. + </p> + </div> + </footer> + + <script src="/js/jquery-1.9.1.min.js"></script> + <script src="/js/bootstrap.min.js"></script> + <script> + (function() { + var cx = '012254517474945470291:vhsfv7eokdc'; + var gcse = document.createElement('script'); + gcse.type = 'text/javascript'; + gcse.async = true; + gcse.src = (document.location.protocol == 'https:' ? 'https:' : 'http:') + + '//www.google.com/cse/cse.js?cx=' + cx; + var s = document.getElementsByTagName('script')[0]; + s.parentNode.insertBefore(gcse, s); + })(); + </script> +</body> +</html>
Added: websites/staging/mahout/trunk/content/users/mapreduce/clustering/clustering-seinfeld-episodes.html ============================================================================== --- websites/staging/mahout/trunk/content/users/mapreduce/clustering/clustering-seinfeld-episodes.html (added) +++ websites/staging/mahout/trunk/content/users/mapreduce/clustering/clustering-seinfeld-episodes.html Thu Mar 19 21:21:45 2015 @@ -0,0 +1,280 @@ +<!DOCTYPE html> +<!-- + + Licensed to the Apache Software Foundation (ASF) under one or more + contributor license agreements. See the NOTICE file distributed with + this work for additional information regarding copyright ownership. + The ASF licenses this file to You under the Apache License, Version 2.0 + (the "License"); you may not use this file except in compliance with + the License. You may obtain a copy of the License at + + http://www.apache.org/licenses/LICENSE-2.0 + + Unless required by applicable law or agreed to in writing, software + distributed under the License is distributed on an "AS IS" BASIS, + WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + See the License for the specific language governing permissions and + limitations under the License. +--> + +<html xmlns="http://www.w3.org/1999/xhtml" xml:lang="en" lang="en"><head><meta http-equiv="Content-Type" content="text/html; charset=UTF-8"> + <title>Apache Mahout: Scalable machine learning and data mining</title> + <meta http-equiv="Content-Type" content="text/html; charset=utf-8"> + <meta name="Distribution" content="Global"> + <meta name="Robots" content="index,follow"> + <meta name="keywords" content="apache, apache hadoop, apache lucene, + business data mining, cluster analysis, + collaborative filtering, data extraction, data filtering, data framework, data integration, + data matching, data mining, data mining algorithms, data mining analysis, data mining data, + data mining introduction, data mining software, + data mining techniques, data representation, data set, datamining, + feature extraction, fuzzy k means, genetic algorithm, hadoop, + hierarchical clustering, high dimensional, introduction to data mining, kmeans, + knowledge discovery, learning approach, learning approaches, learning methods, + learning techniques, lucene, machine learning, machine translation, mahout apache, + mahout taste, map reduce hadoop, mining data, mining methods, naive bayes, + natural language processing, + supervised, text mining, time series data, unsupervised, web data mining"> + <link rel="shortcut icon" type="image/x-icon" href="http://mahout.apache.org/images/favicon.ico"> + <script type="text/javascript" src="/js/prototype.js"></script> + <script type="text/javascript" src="/js/effects.js"></script> + <script type="text/javascript" src="/js/search.js"></script> + <script type="text/javascript" src="/js/slides.js"></script> + + <link href="/css/bootstrap.min.css" rel="stylesheet" media="screen"> + <link href="/css/bootstrap-responsive.css" rel="stylesheet"> + <link rel="stylesheet" href="/css/global.css" type="text/css"> + + <!-- mathJax stuff -- use `\(...\)` for inline style math in markdown --> + <script type="text/x-mathjax-config"> + MathJax.Hub.Config({ + tex2jax: { + skipTags: ['script', 'noscript', 'style', 'textarea', 'pre'] + } + }); + MathJax.Hub.Queue(function() { + var all = MathJax.Hub.getAllJax(), i; + for(i = 0; i < all.length; i += 1) { + all[i].SourceElement().parentNode.className += ' has-jax'; + } + }); + </script> + <script type="text/javascript"> + var mathjax = document.createElement('script'); + mathjax.type = 'text/javascript'; + mathjax.async = true; + + mathjax.src = ('https:' == document.location.protocol) ? + 'https://c328740.ssl.cf1.rackcdn.com/mathjax/latest/MathJax.js?config=TeX-AMS-MML_HTMLorMML' : + 'http://cdn.mathjax.org/mathjax/latest/MathJax.js?config=TeX-AMS-MML_HTMLorMML'; + + var s = document.getElementsByTagName('script')[0]; + s.parentNode.insertBefore(mathjax, s); + </script> +</head> + +<body id="home" data-twttr-rendered="true"> + <div id="wrap"> + <div id="header"> + <div id="logo"><a href="/overview.html"></a></div> + <div id="search"> + <form id="search-form" action="http://www.google.com/search" method="get" class="navbar-search pull-right"> + <input value="http://mahout.apache.org" name="sitesearch" type="hidden"> + <input class="search-query" name="q" id="query" type="text"> + <input id="submission" type="image" src="/images/mahout-lupe.png" alt="Search" /> + </form> + </div> + + <div class="navbar navbar-inverse" style="position:absolute;top:133px;padding-right:0px;padding-left:0px;"> + <div class="navbar-inner" style="border: none; background: #999; border: none; border-radius: 0px;"> + <div class="container"> + <button type="button" class="btn btn-navbar" data-toggle="collapse" data-target=".nav-collapse"> + <span class="icon-bar"></span> + <span class="icon-bar"></span> + <span class="icon-bar"></span> + </button> + <!-- <a class="brand" href="#">Apache Community Development Project</a> --> + <div class="nav-collapse collapse"> + <ul class="nav"> + <li><a href="/">Home</a></li> + <li class="dropdown"> <a href="#" class="dropdown-toggle" data-toggle="dropdown">General<b class="caret"></b></a> + <ul class="dropdown-menu"> + <li><a href="/general/downloads.html">Downloads</a> + <li><a href="/general/who-we-are.html">Who we are</a> + <li><a href="/general/mailing-lists,-irc-and-archives.html">Mailing Lists</a> + <li><a href="/general/release-notes.html">Release Notes</a> + <li><a href="/general/books-tutorials-and-talks.html">Books, Tutorials, Talks</a></li> + <li><a href="/general/powered-by-mahout.html">Powered By Mahout</a> + <li><a href="/general/professional-support.html">Professional Support</a> + <li class="divider"></li> + <li class="nav-header">Resources</li> + <li><a href="/general/reference-reading.html">Reference Reading</a> + <li><a href="/general/faq.html">FAQ</a> + <li class="divider"></li> + <li class="nav-header">Legal</li> + <li><a href="http://www.apache.org/licenses/">License</a></li> + <li><a href="http://www.apache.org/security/">Security</a></li> + <li><a href="/general/privacy-policy.html">Privacy Policy</a> + </ul> + </li> + <li class="dropdown"> <a href="#" class="dropdown-toggle" data-toggle="dropdown">Developers<b class="caret"></b></a> + <ul class="dropdown-menu"> + <li><a href="/developers/developer-resources.html">Developer resources</a></li> + <li><a href="/developers/version-control.html">Version control</a></li> + <li><a href="/developers/buildingmahout.html">Build from source</a></li> + <li><a href="/developers/issue-tracker.html">Issue tracker</a></li> + <li><a href="https://builds.apache.org/job/Mahout-Quality/" target="_blank">Code quality reports</a></li> + <li class="divider"></li> + <li class="nav-header">Contributions</li> + <li><a href="/developers/how-to-contribute.html">How to contribute</a></li> + <li><a href="/developers/how-to-become-a-committer.html">How to become a committer</a></li> + <li><a href="/developers/gsoc.html">GSoC</a></li> + <li class="divider"></li> + <li class="nav-header">For committers</li> + <li><a href="/developers/how-to-update-the-website.html">How to update the website</a></li> + <li><a href="/developers/patch-check-list.html">Patch check list</a></li> + <li><a href="/developers/github.html">Handling Github PRs</a></li> + <li><a href="/developers/how-to-release.html">How to release</a></li> + <li><a href="/developers/thirdparty-dependencies.html">Third party dependencies</a></li> + </ul> + </li> + <li class="dropdown"> <a href="#" class="dropdown-toggle" data-toggle="dropdown">Basics<b class="caret"></b></a> + <ul class="dropdown-menu"> + <li><a href="/users/basics/algorithms.html">List of algorithms</a> + <li><a href="/users/basics/quickstart.html">Quickstart</a> + <li class="divider"></li> + <li class="nav-header">Working with text</li> + <li><a href="/users/basics/creating-vectors-from-text.html">Creating vectors from text</a> + <li><a href="/users/basics/collocations.html">Collocations</a> + <li class="divider"></li> + <li class="nav-header">Dimensionality reduction</li> + <li><a href="/users/dim-reduction/dimensional-reduction.html">Singular Value Decomposition</a></li> + <li><a href="/users/dim-reduction/ssvd.html">Stochastic SVD</a></li> + <li class="divider"></li> + <li class="nav-header">Topic Models</li> + <li><a href="/users/clustering/latent-dirichlet-allocation.html">Latent Dirichlet Allocation</a></li> + </ul> + </li> + <li class="dropdown"> <a href="#" class="dropdown-toggle" data-toggle="dropdown">Spark<b class="caret"></b></a> + <ul class="dropdown-menu"> + <li><a href="/users/sparkbindings/home.html">Scala & Spark Bindings Overview</a></li> + <li><a href="/users/sparkbindings/play-with-shell.html">Playing with Mahout's Spark Shell</a></li> + <li class="divider"></li> + <li><a href="/users/sparkbindings/faq.html">FAQ</a></li> + </ul> + </li> + <li class="dropdown"> <a href="#" class="dropdown-toggle" data-toggle="dropdown">Classification<b class="caret"></b></a> + <ul class="dropdown-menu"> + <li><a href="/users/mapreduce/classification/bayesian.html">Naive Bayes</a></li> + <li><a href="/users/mapreduce/classification/hidden-markov-models.html">Hidden Markov Models</a></li> + <li><a href="/users/mapreduce/classification/logistic-regression.html">Logistic Regression</a></li> + <li><a href="/users/mapreduce/classification/partial-implementation.html">Random Forest</a></li> + + <li class="divider"></li> + <li class="nav-header">Examples</li> + <li><a href="/users/mapreduce/classification/breiman-example.html">Breiman example</a></li> + <li><a href="/users/mapreduce/classification/twenty-newsgroups.html">20 newsgroups example</a></li> + </ul></li> + <li class="dropdown"> <a href="#" class="dropdown-toggle" data-toggle="dropdown">Clustering<b class="caret"></b></a> + <ul class="dropdown-menu"> + <li><a href="/users/mapreduce/clustering/k-means-clustering.html">k-Means</a></li> + <li><a href="/users/mapreduce/clustering/canopy-clustering.html">Canopy</a></li> + <li><a href="/users/mapreduce/clustering/fuzzy-k-means.html">Fuzzy k-Means</a></li> + <li><a href="/users/mapreduce/clustering/streaming-k-means.html">Streaming KMeans</a></li> + <li><a href="/users/mapreduce/clustering/spectral-clustering.html">Spectral Clustering</a></li> + <li class="divider"></li> + <li class="nav-header">Commandline usage</li> + <li><a href="/users/mapreduce/clustering/k-means-commandline.html">Options for k-Means</a></li> + <li><a href="/users/mapreduce/clustering/canopy-commandline.html">Options for Canopy</a></li> + <li><a href="/users/mapreduce/clustering/fuzzy-k-means-commandline.html">Options for Fuzzy k-Means</a></li> + <li class="divider"></li> + <li class="nav-header">Examples</li> + <li><a href="/users/mapreduce/clustering/clustering-of-synthetic-control-data.html">Synthetic data</a></li> + <li class="divider"></li> + <li class="nav-header">Post processing</li> + <li><a href="/users/mapreduce/clustering/cluster-dumper.html">Cluster Dumper tool</a></li> + <li><a href="/users/mapreduce/clustering/visualizing-sample-clusters.html">Cluster visualisation</a></li> + </ul></li> + <li class="dropdown"> <a href="#" class="dropdown-toggle" data-toggle="dropdown">Recommendations<b class="caret"></b></a> + <ul class="dropdown-menu"> + <li><a href="/users/mapreduce/recommender/quickstart.html">Quickstart</a></li> + <li><a href="/users/mapreduce/recommender/recommender-first-timer-faq.html">First Timer FAQ</a></li> + <li><a href="/users/mapreduce/recommender/userbased-5-minutes.html">A user-based recommender <br/>in 5 minutes</a></li> + <li><a href="/users/mapreduce/recommender/matrix-factorization.html">Matrix factorization-based<br/> recommenders</a></li> + <li><a href="/users/mapreduce/recommender/recommender-documentation.html">Overview</a></li> + <li class="divider"></li> + <li class="nav-header">Hadoop</li> + <li><a href="/users/mapreduce/recommender/intro-itembased-hadoop.html">Intro to item-based recommendations<br/> with Hadoop</a></li> + <li><a href="/users/mapreduce/recommender/intro-als-hadoop.html">Intro to ALS recommendations<br/> with Hadoop</a></li> + <li class="nav-header">Spark</li> + <li><a href="/users/mapreduce/recommender/intro-cooccurrence-spark.html">Intro to cooccurrence-based<br/> recommendations with Spark</a></li> + </ul> + </li> + </ul> + </div><!--/.nav-collapse --> + </div> + </div> + </div> + +</div> + + <div id="sidebar"> + <div id="sidebar-wrap"> + <h2>Twitter</h2> + <ul class="sidemenu"> + <li> +<a class="twitter-timeline" href="https://twitter.com/ApacheMahout" data-widget-id="422861673444028416">Tweets by @ApacheMahout</a> +<script>!function(d,s,id){var js,fjs=d.getElementsByTagName(s)[0],p=/^http:/.test(d.location)?'http':'https';if(!d.getElementById(id)){js=d.createElement(s);js.id=id;js.src=p+"://platform.twitter.com/widgets.js";fjs.parentNode.insertBefore(js,fjs);}}(document,"script","twitter-wjs");</script> +</li> + </ul> + <h2>Apache Software Foundation</h2> + <ul class="sidemenu"> + <li><a href="http://www.apache.org/foundation/how-it-works.html">How the ASF works</a></li> + <li><a href="http://www.apache.org/foundation/getinvolved.html">Get Involved</a></li> + <li><a href="http://www.apache.org/dev/">Developer Resources</a></li> + <li><a href="http://www.apache.org/foundation/sponsorship.html">Sponsorship</a></li> + <li><a href="http://www.apache.org/foundation/thanks.html">Thanks</a></li> + </ul> + <h2>Related Projects</h2> + <ul class="sidemenu"> + <li><a href="http://lucene.apache.org/">Lucene</a></li> + <li><a href="http://hadoop.apache.org/">Hadoop</a></li> + </ul> + </div> +</div> + + <div id="content-wrap" class="clearfix"> + <div id="main"> + <p>Below is short tutorial on how to cluster Seinfeld episode transcripts with +Mahout.</p> +<p>http://blog.jteam.nl/2011/04/04/how-to-cluster-seinfeld-episodes-with-mahout/</p> + </div> + </div> +</div> + <footer class="footer" align="center"> + <div class="container"> + <p> + Copyright © 2014 The Apache Software Foundation, Licensed under + the <a href="http://www.apache.org/licenses/LICENSE-2.0">Apache License, Version 2.0</a>. + <br /> + Apache and the Apache feather logos are trademarks of The Apache Software Foundation. + </p> + </div> + </footer> + + <script src="/js/jquery-1.9.1.min.js"></script> + <script src="/js/bootstrap.min.js"></script> + <script> + (function() { + var cx = '012254517474945470291:vhsfv7eokdc'; + var gcse = document.createElement('script'); + gcse.type = 'text/javascript'; + gcse.async = true; + gcse.src = (document.location.protocol == 'https:' ? 'https:' : 'http:') + + '//www.google.com/cse/cse.js?cx=' + cx; + var s = document.getElementsByTagName('script')[0]; + s.parentNode.insertBefore(gcse, s); + })(); + </script> +</body> +</html> Added: websites/staging/mahout/trunk/content/users/mapreduce/clustering/clusteringyourdata.html ============================================================================== --- websites/staging/mahout/trunk/content/users/mapreduce/clustering/clusteringyourdata.html (added) +++ websites/staging/mahout/trunk/content/users/mapreduce/clustering/clusteringyourdata.html Thu Mar 19 21:21:45 2015 @@ -0,0 +1,389 @@ +<!DOCTYPE html> +<!-- + + Licensed to the Apache Software Foundation (ASF) under one or more + contributor license agreements. See the NOTICE file distributed with + this work for additional information regarding copyright ownership. + The ASF licenses this file to You under the Apache License, Version 2.0 + (the "License"); you may not use this file except in compliance with + the License. You may obtain a copy of the License at + + http://www.apache.org/licenses/LICENSE-2.0 + + Unless required by applicable law or agreed to in writing, software + distributed under the License is distributed on an "AS IS" BASIS, + WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + See the License for the specific language governing permissions and + limitations under the License. +--> + +<html xmlns="http://www.w3.org/1999/xhtml" xml:lang="en" lang="en"><head><meta http-equiv="Content-Type" content="text/html; charset=UTF-8"> + <title>Apache Mahout: Scalable machine learning and data mining</title> + <meta http-equiv="Content-Type" content="text/html; charset=utf-8"> + <meta name="Distribution" content="Global"> + <meta name="Robots" content="index,follow"> + <meta name="keywords" content="apache, apache hadoop, apache lucene, + business data mining, cluster analysis, + collaborative filtering, data extraction, data filtering, data framework, data integration, + data matching, data mining, data mining algorithms, data mining analysis, data mining data, + data mining introduction, data mining software, + data mining techniques, data representation, data set, datamining, + feature extraction, fuzzy k means, genetic algorithm, hadoop, + hierarchical clustering, high dimensional, introduction to data mining, kmeans, + knowledge discovery, learning approach, learning approaches, learning methods, + learning techniques, lucene, machine learning, machine translation, mahout apache, + mahout taste, map reduce hadoop, mining data, mining methods, naive bayes, + natural language processing, + supervised, text mining, time series data, unsupervised, web data mining"> + <link rel="shortcut icon" type="image/x-icon" href="http://mahout.apache.org/images/favicon.ico"> + <script type="text/javascript" src="/js/prototype.js"></script> + <script type="text/javascript" src="/js/effects.js"></script> + <script type="text/javascript" src="/js/search.js"></script> + <script type="text/javascript" src="/js/slides.js"></script> + + <link href="/css/bootstrap.min.css" rel="stylesheet" media="screen"> + <link href="/css/bootstrap-responsive.css" rel="stylesheet"> + <link rel="stylesheet" href="/css/global.css" type="text/css"> + + <!-- mathJax stuff -- use `\(...\)` for inline style math in markdown --> + <script type="text/x-mathjax-config"> + MathJax.Hub.Config({ + tex2jax: { + skipTags: ['script', 'noscript', 'style', 'textarea', 'pre'] + } + }); + MathJax.Hub.Queue(function() { + var all = MathJax.Hub.getAllJax(), i; + for(i = 0; i < all.length; i += 1) { + all[i].SourceElement().parentNode.className += ' has-jax'; + } + }); + </script> + <script type="text/javascript"> + var mathjax = document.createElement('script'); + mathjax.type = 'text/javascript'; + mathjax.async = true; + + mathjax.src = ('https:' == document.location.protocol) ? + 'https://c328740.ssl.cf1.rackcdn.com/mathjax/latest/MathJax.js?config=TeX-AMS-MML_HTMLorMML' : + 'http://cdn.mathjax.org/mathjax/latest/MathJax.js?config=TeX-AMS-MML_HTMLorMML'; + + var s = document.getElementsByTagName('script')[0]; + s.parentNode.insertBefore(mathjax, s); + </script> +</head> + +<body id="home" data-twttr-rendered="true"> + <div id="wrap"> + <div id="header"> + <div id="logo"><a href="/overview.html"></a></div> + <div id="search"> + <form id="search-form" action="http://www.google.com/search" method="get" class="navbar-search pull-right"> + <input value="http://mahout.apache.org" name="sitesearch" type="hidden"> + <input class="search-query" name="q" id="query" type="text"> + <input id="submission" type="image" src="/images/mahout-lupe.png" alt="Search" /> + </form> + </div> + + <div class="navbar navbar-inverse" style="position:absolute;top:133px;padding-right:0px;padding-left:0px;"> + <div class="navbar-inner" style="border: none; background: #999; border: none; border-radius: 0px;"> + <div class="container"> + <button type="button" class="btn btn-navbar" data-toggle="collapse" data-target=".nav-collapse"> + <span class="icon-bar"></span> + <span class="icon-bar"></span> + <span class="icon-bar"></span> + </button> + <!-- <a class="brand" href="#">Apache Community Development Project</a> --> + <div class="nav-collapse collapse"> + <ul class="nav"> + <li><a href="/">Home</a></li> + <li class="dropdown"> <a href="#" class="dropdown-toggle" data-toggle="dropdown">General<b class="caret"></b></a> + <ul class="dropdown-menu"> + <li><a href="/general/downloads.html">Downloads</a> + <li><a href="/general/who-we-are.html">Who we are</a> + <li><a href="/general/mailing-lists,-irc-and-archives.html">Mailing Lists</a> + <li><a href="/general/release-notes.html">Release Notes</a> + <li><a href="/general/books-tutorials-and-talks.html">Books, Tutorials, Talks</a></li> + <li><a href="/general/powered-by-mahout.html">Powered By Mahout</a> + <li><a href="/general/professional-support.html">Professional Support</a> + <li class="divider"></li> + <li class="nav-header">Resources</li> + <li><a href="/general/reference-reading.html">Reference Reading</a> + <li><a href="/general/faq.html">FAQ</a> + <li class="divider"></li> + <li class="nav-header">Legal</li> + <li><a href="http://www.apache.org/licenses/">License</a></li> + <li><a href="http://www.apache.org/security/">Security</a></li> + <li><a href="/general/privacy-policy.html">Privacy Policy</a> + </ul> + </li> + <li class="dropdown"> <a href="#" class="dropdown-toggle" data-toggle="dropdown">Developers<b class="caret"></b></a> + <ul class="dropdown-menu"> + <li><a href="/developers/developer-resources.html">Developer resources</a></li> + <li><a href="/developers/version-control.html">Version control</a></li> + <li><a href="/developers/buildingmahout.html">Build from source</a></li> + <li><a href="/developers/issue-tracker.html">Issue tracker</a></li> + <li><a href="https://builds.apache.org/job/Mahout-Quality/" target="_blank">Code quality reports</a></li> + <li class="divider"></li> + <li class="nav-header">Contributions</li> + <li><a href="/developers/how-to-contribute.html">How to contribute</a></li> + <li><a href="/developers/how-to-become-a-committer.html">How to become a committer</a></li> + <li><a href="/developers/gsoc.html">GSoC</a></li> + <li class="divider"></li> + <li class="nav-header">For committers</li> + <li><a href="/developers/how-to-update-the-website.html">How to update the website</a></li> + <li><a href="/developers/patch-check-list.html">Patch check list</a></li> + <li><a href="/developers/github.html">Handling Github PRs</a></li> + <li><a href="/developers/how-to-release.html">How to release</a></li> + <li><a href="/developers/thirdparty-dependencies.html">Third party dependencies</a></li> + </ul> + </li> + <li class="dropdown"> <a href="#" class="dropdown-toggle" data-toggle="dropdown">Basics<b class="caret"></b></a> + <ul class="dropdown-menu"> + <li><a href="/users/basics/algorithms.html">List of algorithms</a> + <li><a href="/users/basics/quickstart.html">Quickstart</a> + <li class="divider"></li> + <li class="nav-header">Working with text</li> + <li><a href="/users/basics/creating-vectors-from-text.html">Creating vectors from text</a> + <li><a href="/users/basics/collocations.html">Collocations</a> + <li class="divider"></li> + <li class="nav-header">Dimensionality reduction</li> + <li><a href="/users/dim-reduction/dimensional-reduction.html">Singular Value Decomposition</a></li> + <li><a href="/users/dim-reduction/ssvd.html">Stochastic SVD</a></li> + <li class="divider"></li> + <li class="nav-header">Topic Models</li> + <li><a href="/users/clustering/latent-dirichlet-allocation.html">Latent Dirichlet Allocation</a></li> + </ul> + </li> + <li class="dropdown"> <a href="#" class="dropdown-toggle" data-toggle="dropdown">Spark<b class="caret"></b></a> + <ul class="dropdown-menu"> + <li><a href="/users/sparkbindings/home.html">Scala & Spark Bindings Overview</a></li> + <li><a href="/users/sparkbindings/play-with-shell.html">Playing with Mahout's Spark Shell</a></li> + <li class="divider"></li> + <li><a href="/users/sparkbindings/faq.html">FAQ</a></li> + </ul> + </li> + <li class="dropdown"> <a href="#" class="dropdown-toggle" data-toggle="dropdown">Classification<b class="caret"></b></a> + <ul class="dropdown-menu"> + <li><a href="/users/mapreduce/classification/bayesian.html">Naive Bayes</a></li> + <li><a href="/users/mapreduce/classification/hidden-markov-models.html">Hidden Markov Models</a></li> + <li><a href="/users/mapreduce/classification/logistic-regression.html">Logistic Regression</a></li> + <li><a href="/users/mapreduce/classification/partial-implementation.html">Random Forest</a></li> + + <li class="divider"></li> + <li class="nav-header">Examples</li> + <li><a href="/users/mapreduce/classification/breiman-example.html">Breiman example</a></li> + <li><a href="/users/mapreduce/classification/twenty-newsgroups.html">20 newsgroups example</a></li> + </ul></li> + <li class="dropdown"> <a href="#" class="dropdown-toggle" data-toggle="dropdown">Clustering<b class="caret"></b></a> + <ul class="dropdown-menu"> + <li><a href="/users/mapreduce/clustering/k-means-clustering.html">k-Means</a></li> + <li><a href="/users/mapreduce/clustering/canopy-clustering.html">Canopy</a></li> + <li><a href="/users/mapreduce/clustering/fuzzy-k-means.html">Fuzzy k-Means</a></li> + <li><a href="/users/mapreduce/clustering/streaming-k-means.html">Streaming KMeans</a></li> + <li><a href="/users/mapreduce/clustering/spectral-clustering.html">Spectral Clustering</a></li> + <li class="divider"></li> + <li class="nav-header">Commandline usage</li> + <li><a href="/users/mapreduce/clustering/k-means-commandline.html">Options for k-Means</a></li> + <li><a href="/users/mapreduce/clustering/canopy-commandline.html">Options for Canopy</a></li> + <li><a href="/users/mapreduce/clustering/fuzzy-k-means-commandline.html">Options for Fuzzy k-Means</a></li> + <li class="divider"></li> + <li class="nav-header">Examples</li> + <li><a href="/users/mapreduce/clustering/clustering-of-synthetic-control-data.html">Synthetic data</a></li> + <li class="divider"></li> + <li class="nav-header">Post processing</li> + <li><a href="/users/mapreduce/clustering/cluster-dumper.html">Cluster Dumper tool</a></li> + <li><a href="/users/mapreduce/clustering/visualizing-sample-clusters.html">Cluster visualisation</a></li> + </ul></li> + <li class="dropdown"> <a href="#" class="dropdown-toggle" data-toggle="dropdown">Recommendations<b class="caret"></b></a> + <ul class="dropdown-menu"> + <li><a href="/users/mapreduce/recommender/quickstart.html">Quickstart</a></li> + <li><a href="/users/mapreduce/recommender/recommender-first-timer-faq.html">First Timer FAQ</a></li> + <li><a href="/users/mapreduce/recommender/userbased-5-minutes.html">A user-based recommender <br/>in 5 minutes</a></li> + <li><a href="/users/mapreduce/recommender/matrix-factorization.html">Matrix factorization-based<br/> recommenders</a></li> + <li><a href="/users/mapreduce/recommender/recommender-documentation.html">Overview</a></li> + <li class="divider"></li> + <li class="nav-header">Hadoop</li> + <li><a href="/users/mapreduce/recommender/intro-itembased-hadoop.html">Intro to item-based recommendations<br/> with Hadoop</a></li> + <li><a href="/users/mapreduce/recommender/intro-als-hadoop.html">Intro to ALS recommendations<br/> with Hadoop</a></li> + <li class="nav-header">Spark</li> + <li><a href="/users/mapreduce/recommender/intro-cooccurrence-spark.html">Intro to cooccurrence-based<br/> recommendations with Spark</a></li> + </ul> + </li> + </ul> + </div><!--/.nav-collapse --> + </div> + </div> + </div> + +</div> + + <div id="sidebar"> + <div id="sidebar-wrap"> + <h2>Twitter</h2> + <ul class="sidemenu"> + <li> +<a class="twitter-timeline" href="https://twitter.com/ApacheMahout" data-widget-id="422861673444028416">Tweets by @ApacheMahout</a> +<script>!function(d,s,id){var js,fjs=d.getElementsByTagName(s)[0],p=/^http:/.test(d.location)?'http':'https';if(!d.getElementById(id)){js=d.createElement(s);js.id=id;js.src=p+"://platform.twitter.com/widgets.js";fjs.parentNode.insertBefore(js,fjs);}}(document,"script","twitter-wjs");</script> +</li> + </ul> + <h2>Apache Software Foundation</h2> + <ul class="sidemenu"> + <li><a href="http://www.apache.org/foundation/how-it-works.html">How the ASF works</a></li> + <li><a href="http://www.apache.org/foundation/getinvolved.html">Get Involved</a></li> + <li><a href="http://www.apache.org/dev/">Developer Resources</a></li> + <li><a href="http://www.apache.org/foundation/sponsorship.html">Sponsorship</a></li> + <li><a href="http://www.apache.org/foundation/thanks.html">Thanks</a></li> + </ul> + <h2>Related Projects</h2> + <ul class="sidemenu"> + <li><a href="http://lucene.apache.org/">Lucene</a></li> + <li><a href="http://hadoop.apache.org/">Hadoop</a></li> + </ul> + </div> +</div> + + <div id="content-wrap" class="clearfix"> + <div id="main"> + <h1 id="clustering-your-data">Clustering your data</h1> +<p>After you've done the <a href="quickstart.html">Quickstart</a> and are familiar with the basics of Mahout, it is time to cluster your own +data. See also <a href="en.wikipedia.org/wiki/Cluster_analysis">Wikipedia on cluster analysis</a> for more background.</p> +<p>The following pieces <em>may</em> be useful for in getting started:</p> +<p><a name="ClusteringYourData-Input"></a></p> +<h1 id="input">Input</h1> +<p>For starters, you will need your data in an appropriate Vector format, see <a href="../basics/creating-vectors.html">Creating Vectors</a>. +In particular for text preparation check out <a href="../basics/creating-vectors-from-text.html">Creating Vectors from Text</a>.</p> +<p><a name="ClusteringYourData-RunningtheProcess"></a></p> +<h1 id="running-the-process">Running the Process</h1> +<ul> +<li> +<p><a href="canopy-clustering.html">Canopy background</a> and <a href="canopy-commandline.html">canopy-commandline</a>.</p> +</li> +<li> +<p><a href="k-means-clustering.html">K-Means background</a>, <a href="k-means-commandline.html">k-means-commandline</a>, and +<a href="fuzzy-k-means-commandline.html">fuzzy-k-means-commandline</a>.</p> +</li> +<li> +<p><a href="dirichlet-process-clustering.html">Dirichlet background</a> and <a href="dirichlet-commandline.html">dirichlet-commandline</a>.</p> +</li> +<li> +<p><a href="mean-shift-clustering.html">Meanshift background</a> and <a href="mean-shift-commandline.html">mean-shift-commandline</a>.</p> +</li> +<li> +<p><a href="-latent-dirichlet-allocation.html">LDA (Latent Dirichlet Allocation) background</a> and <a href="lda-commandline.html">lda-commandline</a>.</p> +</li> +<li> +<p>TODO: kmeans++/ streaming kMeans documentation</p> +</li> +</ul> +<p><a name="ClusteringYourData-RetrievingtheOutput"></a></p> +<h1 id="retrieving-the-output">Retrieving the Output</h1> +<p>Mahout has a cluster dumper utility that can be used to retrieve and evaluate your clustering data.</p> +<div class="codehilite"><pre><span class="o">./</span><span class="n">bin</span><span class="o">/</span><span class="n">mahout</span> <span class="n">clusterdump</span> <span class="o"><</span><span class="n">OPTIONS</span><span class="o">></span> +</pre></div> + + +<p><a name="ClusteringYourData-Theclusterdumperoptionsare:"></a></p> +<h2 id="the-cluster-dumper-options-are">The cluster dumper options are:</h2> +<div class="codehilite"><pre> <span class="o">--</span><span class="n">help</span> <span class="p">(</span><span class="o">-</span><span class="n">h</span><span class="p">)</span> <span class="n">Print</span> <span class="n">out</span> <span class="n">help</span> + + <span class="o">--</span><span class="n">input</span> <span class="p">(</span><span class="o">-</span><span class="nb">i</span><span class="p">)</span> <span class="n">input</span> <span class="n">The</span> <span class="n">directory</span> <span class="n">containing</span> <span class="n">Sequence</span> + <span class="n">Files</span> <span class="k">for</span> <span class="n">the</span> <span class="n">Clusters</span> + + <span class="o">--</span><span class="n">output</span> <span class="p">(</span><span class="o">-</span><span class="n">o</span><span class="p">)</span> <span class="n">output</span> <span class="n">The</span> <span class="n">output</span> <span class="n">file</span><span class="p">.</span> <span class="n">If</span> <span class="n">not</span> <span class="n">specified</span><span class="p">,</span> + <span class="n">dumps</span> <span class="n">to</span> <span class="n">the</span> <span class="n">console</span><span class="p">.</span> + + <span class="o">--</span><span class="n">outputFormat</span> <span class="p">(</span><span class="o">-</span><span class="n">of</span><span class="p">)</span> <span class="n">outputFormat</span> <span class="n">The</span> <span class="n">optional</span> <span class="n">output</span> <span class="n">format</span> <span class="n">to</span> <span class="n">write</span> + <span class="n">the</span> <span class="n">results</span> <span class="n">as</span><span class="p">.</span> <span class="n">Options</span><span class="p">:</span> <span class="n">TEXT</span><span class="p">,</span> <span class="n">CSV</span><span class="p">,</span> <span class="n">or</span> <span class="n">GRAPH_ML</span> + + <span class="o">--</span><span class="n">substring</span> <span class="p">(</span><span class="o">-</span><span class="n">b</span><span class="p">)</span> <span class="n">substring</span> <span class="n">The</span> <span class="n">number</span> <span class="n">of</span> <span class="n">chars</span> <span class="n">of</span> <span class="n">the</span> + <span class="n">asFormatString</span><span class="p">()</span> <span class="n">to</span> <span class="n">print</span> + + <span class="o">--</span><span class="n">pointsDir</span> <span class="p">(</span><span class="o">-</span><span class="n">p</span><span class="p">)</span> <span class="n">pointsDir</span> <span class="n">The</span> <span class="n">directory</span> <span class="n">containing</span> <span class="n">points</span> + <span class="n">sequence</span> <span class="n">files</span> <span class="n">mapping</span> <span class="n">input</span> <span class="n">vectors</span> <span class="n">to</span> <span class="n">their</span> <span class="n">cluster</span><span class="p">.</span> <span class="n">If</span> <span class="n">specified</span><span class="p">,</span> + <span class="n">then</span> <span class="n">the</span> <span class="n">program</span> <span class="n">will</span> <span class="n">output</span> <span class="n">the</span> + <span class="n">points</span> <span class="n">associated</span> <span class="n">with</span> <span class="n">a</span> <span class="n">cluster</span> + + <span class="o">--</span><span class="n">dictionary</span> <span class="p">(</span><span class="o">-</span><span class="n">d</span><span class="p">)</span> <span class="n">dictionary</span> <span class="n">The</span> <span class="n">dictionary</span> <span class="n">file</span><span class="p">.</span> + + <span class="o">--</span><span class="n">dictionaryType</span> <span class="p">(</span><span class="o">-</span><span class="n">dt</span><span class="p">)</span> <span class="n">dictionaryType</span> <span class="n">The</span> <span class="n">dictionary</span> <span class="n">file</span> <span class="n">type</span> + <span class="p">(</span><span class="n">text</span><span class="o">|</span><span class="n">sequencefile</span><span class="p">)</span> + + <span class="o">--</span><span class="n">distanceMeasure</span> <span class="p">(</span><span class="o">-</span><span class="n">dm</span><span class="p">)</span> <span class="n">distanceMeasure</span> <span class="n">The</span> <span class="n">classname</span> <span class="n">of</span> <span class="n">the</span> <span class="n">DistanceMeasure</span><span class="p">.</span> + <span class="n">Default</span> <span class="n">is</span> <span class="n">SquaredEuclidean</span><span class="p">.</span> + + <span class="o">--</span><span class="n">numWords</span> <span class="p">(</span><span class="o">-</span><span class="n">n</span><span class="p">)</span> <span class="n">numWords</span> <span class="n">The</span> <span class="n">number</span> <span class="n">of</span> <span class="n">top</span> <span class="n">terms</span> <span class="n">to</span> <span class="n">print</span> + + <span class="o">--</span><span class="n">tempDir</span> <span class="n">tempDir</span> <span class="n">Intermediate</span> <span class="n">output</span> <span class="n">directory</span> + + <span class="o">--</span><span class="n">startPhase</span> <span class="n">startPhase</span> <span class="n">First</span> <span class="n">phase</span> <span class="n">to</span> <span class="n">run</span> + + <span class="o">--</span><span class="n">endPhase</span> <span class="n">endPhase</span> <span class="n">Last</span> <span class="n">phase</span> <span class="n">to</span> <span class="n">run</span> + + <span class="o">--</span><span class="n">evaluate</span> <span class="p">(</span><span class="o">-</span><span class="n">e</span><span class="p">)</span> <span class="n">Run</span> <span class="n">ClusterEvaluator</span> <span class="n">and</span> <span class="n">CDbwEvaluator</span> <span class="n">over</span> <span class="n">the</span> + <span class="n">input</span><span class="p">.</span> <span class="n">The</span> <span class="n">output</span> <span class="n">will</span> <span class="n">be</span> <span class="n">appended</span> <span class="n">to</span> <span class="n">the</span> <span class="n">rest</span> <span class="n">of</span> + <span class="n">the</span> <span class="n">output</span> <span class="n">at</span> <span class="n">the</span> <span class="k">end</span><span class="p">.</span> +</pre></div> + + +<p>More information on using clusterdump utility can be found <a href="cluster-dumper.html">here</a></p> +<p><a name="ClusteringYourData-ValidatingtheOutput"></a></p> +<h1 id="validating-the-output">Validating the Output</h1> +<p>{quote} +Ted Dunning: A principled approach to cluster evaluation is to measure how well the +cluster membership captures the structure of unseen data. A natural +measure for this is to measure how much of the entropy of the data is +captured by cluster membership. For k-means and its natural L_2 metric, +the natural cluster quality metric is the squared distance from the nearest +centroid adjusted by the log_2 of the number of clusters. This can be +compared to the squared magnitude of the original data or the squared +deviation from the centroid for all of the data. The idea is that you are +changing the representation of the data by allocating some of the bits in +your original representation to represent which cluster each point is in. +If those bits aren't made up by the residue being small then your +clustering is making a bad trade-off.</p> +<p>In the past, I have used other more heuristic measures as well. One of the +key characteristics that I would like to see out of a clustering is a +degree of stability. Thus, I look at the fractions of points that are +assigned to each cluster or the distribution of distances from the cluster +centroid. These values should be relatively stable when applied to held-out +data.</p> +<p>For text, you can actually compute perplexity which measures how well +cluster membership predicts what words are used. This is nice because you +don't have to worry about the entropy of real valued numbers.</p> +<p>Manual inspection and the so-called laugh test is also important. The idea +is that the results should not be so ludicrous as to make you laugh. +Unfortunately, it is pretty easy to kid yourself into thinking your system +is working using this kind of inspection. The problem is that we are too +good at seeing (making up) patterns. +{quote}</p> + </div> + </div> +</div> + <footer class="footer" align="center"> + <div class="container"> + <p> + Copyright © 2014 The Apache Software Foundation, Licensed under + the <a href="http://www.apache.org/licenses/LICENSE-2.0">Apache License, Version 2.0</a>. + <br /> + Apache and the Apache feather logos are trademarks of The Apache Software Foundation. + </p> + </div> + </footer> + + <script src="/js/jquery-1.9.1.min.js"></script> + <script src="/js/bootstrap.min.js"></script> + <script> + (function() { + var cx = '012254517474945470291:vhsfv7eokdc'; + var gcse = document.createElement('script'); + gcse.type = 'text/javascript'; + gcse.async = true; + gcse.src = (document.location.protocol == 'https:' ? 'https:' : 'http:') + + '//www.google.com/cse/cse.js?cx=' + cx; + var s = document.getElementsByTagName('script')[0]; + s.parentNode.insertBefore(gcse, s); + })(); + </script> +</body> +</html> Added: websites/staging/mahout/trunk/content/users/mapreduce/clustering/expectation-maximization.html ============================================================================== --- websites/staging/mahout/trunk/content/users/mapreduce/clustering/expectation-maximization.html (added) +++ websites/staging/mahout/trunk/content/users/mapreduce/clustering/expectation-maximization.html Thu Mar 19 21:21:45 2015 @@ -0,0 +1,323 @@ +<!DOCTYPE html> +<!-- + + Licensed to the Apache Software Foundation (ASF) under one or more + contributor license agreements. See the NOTICE file distributed with + this work for additional information regarding copyright ownership. + The ASF licenses this file to You under the Apache License, Version 2.0 + (the "License"); you may not use this file except in compliance with + the License. You may obtain a copy of the License at + + http://www.apache.org/licenses/LICENSE-2.0 + + Unless required by applicable law or agreed to in writing, software + distributed under the License is distributed on an "AS IS" BASIS, + WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + See the License for the specific language governing permissions and + limitations under the License. +--> + +<html xmlns="http://www.w3.org/1999/xhtml" xml:lang="en" lang="en"><head><meta http-equiv="Content-Type" content="text/html; charset=UTF-8"> + <title>Apache Mahout: Scalable machine learning and data mining</title> + <meta http-equiv="Content-Type" content="text/html; charset=utf-8"> + <meta name="Distribution" content="Global"> + <meta name="Robots" content="index,follow"> + <meta name="keywords" content="apache, apache hadoop, apache lucene, + business data mining, cluster analysis, + collaborative filtering, data extraction, data filtering, data framework, data integration, + data matching, data mining, data mining algorithms, data mining analysis, data mining data, + data mining introduction, data mining software, + data mining techniques, data representation, data set, datamining, + feature extraction, fuzzy k means, genetic algorithm, hadoop, + hierarchical clustering, high dimensional, introduction to data mining, kmeans, + knowledge discovery, learning approach, learning approaches, learning methods, + learning techniques, lucene, machine learning, machine translation, mahout apache, + mahout taste, map reduce hadoop, mining data, mining methods, naive bayes, + natural language processing, + supervised, text mining, time series data, unsupervised, web data mining"> + <link rel="shortcut icon" type="image/x-icon" href="http://mahout.apache.org/images/favicon.ico"> + <script type="text/javascript" src="/js/prototype.js"></script> + <script type="text/javascript" src="/js/effects.js"></script> + <script type="text/javascript" src="/js/search.js"></script> + <script type="text/javascript" src="/js/slides.js"></script> + + <link href="/css/bootstrap.min.css" rel="stylesheet" media="screen"> + <link href="/css/bootstrap-responsive.css" rel="stylesheet"> + <link rel="stylesheet" href="/css/global.css" type="text/css"> + + <!-- mathJax stuff -- use `\(...\)` for inline style math in markdown --> + <script type="text/x-mathjax-config"> + MathJax.Hub.Config({ + tex2jax: { + skipTags: ['script', 'noscript', 'style', 'textarea', 'pre'] + } + }); + MathJax.Hub.Queue(function() { + var all = MathJax.Hub.getAllJax(), i; + for(i = 0; i < all.length; i += 1) { + all[i].SourceElement().parentNode.className += ' has-jax'; + } + }); + </script> + <script type="text/javascript"> + var mathjax = document.createElement('script'); + mathjax.type = 'text/javascript'; + mathjax.async = true; + + mathjax.src = ('https:' == document.location.protocol) ? + 'https://c328740.ssl.cf1.rackcdn.com/mathjax/latest/MathJax.js?config=TeX-AMS-MML_HTMLorMML' : + 'http://cdn.mathjax.org/mathjax/latest/MathJax.js?config=TeX-AMS-MML_HTMLorMML'; + + var s = document.getElementsByTagName('script')[0]; + s.parentNode.insertBefore(mathjax, s); + </script> +</head> + +<body id="home" data-twttr-rendered="true"> + <div id="wrap"> + <div id="header"> + <div id="logo"><a href="/overview.html"></a></div> + <div id="search"> + <form id="search-form" action="http://www.google.com/search" method="get" class="navbar-search pull-right"> + <input value="http://mahout.apache.org" name="sitesearch" type="hidden"> + <input class="search-query" name="q" id="query" type="text"> + <input id="submission" type="image" src="/images/mahout-lupe.png" alt="Search" /> + </form> + </div> + + <div class="navbar navbar-inverse" style="position:absolute;top:133px;padding-right:0px;padding-left:0px;"> + <div class="navbar-inner" style="border: none; background: #999; border: none; border-radius: 0px;"> + <div class="container"> + <button type="button" class="btn btn-navbar" data-toggle="collapse" data-target=".nav-collapse"> + <span class="icon-bar"></span> + <span class="icon-bar"></span> + <span class="icon-bar"></span> + </button> + <!-- <a class="brand" href="#">Apache Community Development Project</a> --> + <div class="nav-collapse collapse"> + <ul class="nav"> + <li><a href="/">Home</a></li> + <li class="dropdown"> <a href="#" class="dropdown-toggle" data-toggle="dropdown">General<b class="caret"></b></a> + <ul class="dropdown-menu"> + <li><a href="/general/downloads.html">Downloads</a> + <li><a href="/general/who-we-are.html">Who we are</a> + <li><a href="/general/mailing-lists,-irc-and-archives.html">Mailing Lists</a> + <li><a href="/general/release-notes.html">Release Notes</a> + <li><a href="/general/books-tutorials-and-talks.html">Books, Tutorials, Talks</a></li> + <li><a href="/general/powered-by-mahout.html">Powered By Mahout</a> + <li><a href="/general/professional-support.html">Professional Support</a> + <li class="divider"></li> + <li class="nav-header">Resources</li> + <li><a href="/general/reference-reading.html">Reference Reading</a> + <li><a href="/general/faq.html">FAQ</a> + <li class="divider"></li> + <li class="nav-header">Legal</li> + <li><a href="http://www.apache.org/licenses/">License</a></li> + <li><a href="http://www.apache.org/security/">Security</a></li> + <li><a href="/general/privacy-policy.html">Privacy Policy</a> + </ul> + </li> + <li class="dropdown"> <a href="#" class="dropdown-toggle" data-toggle="dropdown">Developers<b class="caret"></b></a> + <ul class="dropdown-menu"> + <li><a href="/developers/developer-resources.html">Developer resources</a></li> + <li><a href="/developers/version-control.html">Version control</a></li> + <li><a href="/developers/buildingmahout.html">Build from source</a></li> + <li><a href="/developers/issue-tracker.html">Issue tracker</a></li> + <li><a href="https://builds.apache.org/job/Mahout-Quality/" target="_blank">Code quality reports</a></li> + <li class="divider"></li> + <li class="nav-header">Contributions</li> + <li><a href="/developers/how-to-contribute.html">How to contribute</a></li> + <li><a href="/developers/how-to-become-a-committer.html">How to become a committer</a></li> + <li><a href="/developers/gsoc.html">GSoC</a></li> + <li class="divider"></li> + <li class="nav-header">For committers</li> + <li><a href="/developers/how-to-update-the-website.html">How to update the website</a></li> + <li><a href="/developers/patch-check-list.html">Patch check list</a></li> + <li><a href="/developers/github.html">Handling Github PRs</a></li> + <li><a href="/developers/how-to-release.html">How to release</a></li> + <li><a href="/developers/thirdparty-dependencies.html">Third party dependencies</a></li> + </ul> + </li> + <li class="dropdown"> <a href="#" class="dropdown-toggle" data-toggle="dropdown">Basics<b class="caret"></b></a> + <ul class="dropdown-menu"> + <li><a href="/users/basics/algorithms.html">List of algorithms</a> + <li><a href="/users/basics/quickstart.html">Quickstart</a> + <li class="divider"></li> + <li class="nav-header">Working with text</li> + <li><a href="/users/basics/creating-vectors-from-text.html">Creating vectors from text</a> + <li><a href="/users/basics/collocations.html">Collocations</a> + <li class="divider"></li> + <li class="nav-header">Dimensionality reduction</li> + <li><a href="/users/dim-reduction/dimensional-reduction.html">Singular Value Decomposition</a></li> + <li><a href="/users/dim-reduction/ssvd.html">Stochastic SVD</a></li> + <li class="divider"></li> + <li class="nav-header">Topic Models</li> + <li><a href="/users/clustering/latent-dirichlet-allocation.html">Latent Dirichlet Allocation</a></li> + </ul> + </li> + <li class="dropdown"> <a href="#" class="dropdown-toggle" data-toggle="dropdown">Spark<b class="caret"></b></a> + <ul class="dropdown-menu"> + <li><a href="/users/sparkbindings/home.html">Scala & Spark Bindings Overview</a></li> + <li><a href="/users/sparkbindings/play-with-shell.html">Playing with Mahout's Spark Shell</a></li> + <li class="divider"></li> + <li><a href="/users/sparkbindings/faq.html">FAQ</a></li> + </ul> + </li> + <li class="dropdown"> <a href="#" class="dropdown-toggle" data-toggle="dropdown">Classification<b class="caret"></b></a> + <ul class="dropdown-menu"> + <li><a href="/users/mapreduce/classification/bayesian.html">Naive Bayes</a></li> + <li><a href="/users/mapreduce/classification/hidden-markov-models.html">Hidden Markov Models</a></li> + <li><a href="/users/mapreduce/classification/logistic-regression.html">Logistic Regression</a></li> + <li><a href="/users/mapreduce/classification/partial-implementation.html">Random Forest</a></li> + + <li class="divider"></li> + <li class="nav-header">Examples</li> + <li><a href="/users/mapreduce/classification/breiman-example.html">Breiman example</a></li> + <li><a href="/users/mapreduce/classification/twenty-newsgroups.html">20 newsgroups example</a></li> + </ul></li> + <li class="dropdown"> <a href="#" class="dropdown-toggle" data-toggle="dropdown">Clustering<b class="caret"></b></a> + <ul class="dropdown-menu"> + <li><a href="/users/mapreduce/clustering/k-means-clustering.html">k-Means</a></li> + <li><a href="/users/mapreduce/clustering/canopy-clustering.html">Canopy</a></li> + <li><a href="/users/mapreduce/clustering/fuzzy-k-means.html">Fuzzy k-Means</a></li> + <li><a href="/users/mapreduce/clustering/streaming-k-means.html">Streaming KMeans</a></li> + <li><a href="/users/mapreduce/clustering/spectral-clustering.html">Spectral Clustering</a></li> + <li class="divider"></li> + <li class="nav-header">Commandline usage</li> + <li><a href="/users/mapreduce/clustering/k-means-commandline.html">Options for k-Means</a></li> + <li><a href="/users/mapreduce/clustering/canopy-commandline.html">Options for Canopy</a></li> + <li><a href="/users/mapreduce/clustering/fuzzy-k-means-commandline.html">Options for Fuzzy k-Means</a></li> + <li class="divider"></li> + <li class="nav-header">Examples</li> + <li><a href="/users/mapreduce/clustering/clustering-of-synthetic-control-data.html">Synthetic data</a></li> + <li class="divider"></li> + <li class="nav-header">Post processing</li> + <li><a href="/users/mapreduce/clustering/cluster-dumper.html">Cluster Dumper tool</a></li> + <li><a href="/users/mapreduce/clustering/visualizing-sample-clusters.html">Cluster visualisation</a></li> + </ul></li> + <li class="dropdown"> <a href="#" class="dropdown-toggle" data-toggle="dropdown">Recommendations<b class="caret"></b></a> + <ul class="dropdown-menu"> + <li><a href="/users/mapreduce/recommender/quickstart.html">Quickstart</a></li> + <li><a href="/users/mapreduce/recommender/recommender-first-timer-faq.html">First Timer FAQ</a></li> + <li><a href="/users/mapreduce/recommender/userbased-5-minutes.html">A user-based recommender <br/>in 5 minutes</a></li> + <li><a href="/users/mapreduce/recommender/matrix-factorization.html">Matrix factorization-based<br/> recommenders</a></li> + <li><a href="/users/mapreduce/recommender/recommender-documentation.html">Overview</a></li> + <li class="divider"></li> + <li class="nav-header">Hadoop</li> + <li><a href="/users/mapreduce/recommender/intro-itembased-hadoop.html">Intro to item-based recommendations<br/> with Hadoop</a></li> + <li><a href="/users/mapreduce/recommender/intro-als-hadoop.html">Intro to ALS recommendations<br/> with Hadoop</a></li> + <li class="nav-header">Spark</li> + <li><a href="/users/mapreduce/recommender/intro-cooccurrence-spark.html">Intro to cooccurrence-based<br/> recommendations with Spark</a></li> + </ul> + </li> + </ul> + </div><!--/.nav-collapse --> + </div> + </div> + </div> + +</div> + + <div id="sidebar"> + <div id="sidebar-wrap"> + <h2>Twitter</h2> + <ul class="sidemenu"> + <li> +<a class="twitter-timeline" href="https://twitter.com/ApacheMahout" data-widget-id="422861673444028416">Tweets by @ApacheMahout</a> +<script>!function(d,s,id){var js,fjs=d.getElementsByTagName(s)[0],p=/^http:/.test(d.location)?'http':'https';if(!d.getElementById(id)){js=d.createElement(s);js.id=id;js.src=p+"://platform.twitter.com/widgets.js";fjs.parentNode.insertBefore(js,fjs);}}(document,"script","twitter-wjs");</script> +</li> + </ul> + <h2>Apache Software Foundation</h2> + <ul class="sidemenu"> + <li><a href="http://www.apache.org/foundation/how-it-works.html">How the ASF works</a></li> + <li><a href="http://www.apache.org/foundation/getinvolved.html">Get Involved</a></li> + <li><a href="http://www.apache.org/dev/">Developer Resources</a></li> + <li><a href="http://www.apache.org/foundation/sponsorship.html">Sponsorship</a></li> + <li><a href="http://www.apache.org/foundation/thanks.html">Thanks</a></li> + </ul> + <h2>Related Projects</h2> + <ul class="sidemenu"> + <li><a href="http://lucene.apache.org/">Lucene</a></li> + <li><a href="http://hadoop.apache.org/">Hadoop</a></li> + </ul> + </div> +</div> + + <div id="content-wrap" class="clearfix"> + <div id="main"> + <p><a name="ExpectationMaximization-ExpectationMaximization"></a></p> +<h1 id="expectation-maximization">Expectation Maximization</h1> +<p>The principle of EM can be applied to several learning settings, but is +most commonly associated with clustering. The main principle of the +algorithm is comparable to k-Means. Yet in contrast to hard cluster +assignments, each object is given some probability to belong to a cluster. +Accordingly cluster centers are recomputed based on the average of all +objects weighted by their probability of belonging to the cluster at hand.</p> +<p><a name="ExpectationMaximization-Canopy-modifiedEM"></a></p> +<h2 id="canopy-modified-em">Canopy-modified EM</h2> +<p>One can also use the canopies idea to speed up prototypebased clustering +methods like K-means and Expectation-Maximization (EM). In general, neither +K-means nor EMspecify how many clusters to use. The canopies technique does +not help this choice.</p> +<p>Prototypes (our estimates of the cluster centroids) are associated with the +canopies that contain them, and the prototypes are only influenced by data +that are inside their associated canopies. After creating the canopies, we +decide how many prototypes will be created for each canopy. This could be +done, for example, using the number of data points in a canopy and AIC or +BIC where points that occur in more than one canopy are counted +fractionally. Then we place prototypesinto each canopy. This initial +placement can be random, as long as it is within the canopy in question, as +determined by the inexpensive distance metric.</p> +<p>Then, instead of calculating the distance from each prototype to every +point (as is traditional, a O(nk) operation), theE-step instead calculates +the distance from each prototype to a much smaller number of points. For +each prototype, we find the canopies that contain it (using the cheap +distance metric), and only calculate distances (using the expensive +distance metric) from that prototype to points within those canopies.</p> +<p>Note that by this procedure prototypes may move across canopy boundaries +when canopies overlap. Prototypes may move to cover the data in the +overlapping region, and then move entirely into another canopy in order to +cover data there.</p> +<p>The canopy-modified EM algorithm behaves very similarly to traditional EM, +with the slight difference that points outside the canopy have no influence +on points in the canopy, rather than a minute influence. If the canopy +property holds, and points in the same cluster fall in the same canopy, +then the canopy-modified EM will almost always converge to the same maximum +in likelihood as the traditional EM. In fact, the difference in each +iterative step (apart from the enormous computational savings of computing +fewer terms) will be negligible since points outside the canopy will have +exponentially small influence.</p> +<p><a name="ExpectationMaximization-StrategyforParallelization"></a></p> +<h2 id="strategy-for-parallelization">Strategy for Parallelization</h2> +<p><a name="ExpectationMaximization-Map/ReduceImplementation"></a></p> +<h2 id="mapreduce-implementation">Map/Reduce Implementation</h2> + </div> + </div> +</div> + <footer class="footer" align="center"> + <div class="container"> + <p> + Copyright © 2014 The Apache Software Foundation, Licensed under + the <a href="http://www.apache.org/licenses/LICENSE-2.0">Apache License, Version 2.0</a>. + <br /> + Apache and the Apache feather logos are trademarks of The Apache Software Foundation. + </p> + </div> + </footer> + + <script src="/js/jquery-1.9.1.min.js"></script> + <script src="/js/bootstrap.min.js"></script> + <script> + (function() { + var cx = '012254517474945470291:vhsfv7eokdc'; + var gcse = document.createElement('script'); + gcse.type = 'text/javascript'; + gcse.async = true; + gcse.src = (document.location.protocol == 'https:' ? 'https:' : 'http:') + + '//www.google.com/cse/cse.js?cx=' + cx; + var s = document.getElementsByTagName('script')[0]; + s.parentNode.insertBefore(gcse, s); + })(); + </script> +</body> +</html>
