http://git-wip-us.apache.org/repos/asf/mahout/blob/3c53a6dc/website/old_site_migration/completed/sparkbindings/faq.md ---------------------------------------------------------------------- diff --git a/website/old_site_migration/completed/sparkbindings/faq.md b/website/old_site_migration/completed/sparkbindings/faq.md deleted file mode 100644 index 9649e3b..0000000 --- a/website/old_site_migration/completed/sparkbindings/faq.md +++ /dev/null @@ -1,52 +0,0 @@ ---- -layout: default -title: FAQ -theme: - name: retro-mahout ---- - -# FAQ for using Mahout with Spark - -**Q: Mahout Spark shell doesn't start; "ClassNotFound" problems or various classpath problems.** - -**A:** So far as of the time of this writing all reported problems starting the Spark shell in Mahout were revolving -around classpath issues one way or another. - -If you are getting method signature like errors, most probably you have mismatch between Mahout's Spark dependency -and actual Spark installed. (At the time of this writing the HEAD depends on Spark 1.1.0) but check mahout/pom.xml. - -Troubleshooting general classpath issues is pretty straightforward. Since Mahout is using Spark's installation -and its classpath as reported by Spark itself for Spark-related dependencies, it is important to make sure -the classpath is sane and is made available to Mahout: - -1. Check Spark is of correct version (same as in Mahout's poms), is compiled and SPARK_HOME is set. -2. Check Mahout is compiled and MAHOUT_HOME is set. -3. Run `$SPARK_HOME/bin/compute-classpath.sh` and make sure it produces sane result with no errors. -If it outputs something other than a straightforward classpath string, most likely Spark is not compiled/set correctly (later spark versions require -`sbt/sbt assembly` to be run, simply runnig `sbt/sbt publish-local` is not enough any longer). -4. Run `$MAHOUT_HOME/bin/mahout -spark classpath` and check that path reported in step (3) is included. - -**Q: I am using the command line Mahout jobs that run on Spark or am writing my own application that uses -Mahout's Spark code. When I run the code on my cluster I get ClassNotFound or signature errors during serialization. -What's wrong?** - -**A:** The Spark artifacts in the maven ecosystem may not match the exact binary you are running on your cluster. This may -cause class name or version mismatches. In this case you may wish -to build Spark yourself to guarantee that you are running exactly what you are building Mahout against. To do this follow these steps -in order: - -1. Build Spark with maven, but **do not** use the "package" target as described on the Spark site. Build with the "clean install" target instead. -Something like: "mvn clean install -Dhadoop1.2.1" or whatever your particular build options are. This will put the jars for Spark -in the local maven cache. -2. Deploy **your** Spark build to your cluster and test it there. -3. Build Mahout. This will cause maven to pull the jars for Spark from the local maven cache and may resolve missing -or mis-identified classes. -4. if you are building your own code do so against the local builds of Spark and Mahout. - -**Q: The implicit SparkContext 'sc' does not work in the Mahout spark-shell.** - -**A:** In the Mahout spark-shell the SparkContext is called 'sdc', where the 'd' stands for distributed. - - - -
http://git-wip-us.apache.org/repos/asf/mahout/blob/3c53a6dc/website/old_site_migration/completed/sparkbindings/home.md ---------------------------------------------------------------------- diff --git a/website/old_site_migration/completed/sparkbindings/home.md b/website/old_site_migration/completed/sparkbindings/home.md deleted file mode 100644 index 5075612..0000000 --- a/website/old_site_migration/completed/sparkbindings/home.md +++ /dev/null @@ -1,101 +0,0 @@ ---- -layout: default -title: Spark Bindings -theme: - name: retro-mahout ---- - -# Scala & Spark Bindings: -*Bringing algebraic semantics* - -## What is Scala & Spark Bindings? - -In short, Scala & Spark Bindings for Mahout is Scala DSL and algebraic optimizer of something like this (actual formula from **(d)spca**) - - -`\[\mathbf{G}=\mathbf{B}\mathbf{B}^{\top}-\mathbf{C}-\mathbf{C}^{\top}+\mathbf{s}_{q}\mathbf{s}_{q}^{\top}\boldsymbol{\xi}^{\top}\boldsymbol{\xi}\]` - -bound to in-core and distributed computations (currently, on Apache Spark). - - -Mahout Scala & Spark Bindings expression of the above: - - val g = bt.t %*% bt - c - c.t + (s_q cross s_q) * (xi dot xi) - -The main idea is that a scientist writing algebraic expressions cannot care less of distributed -operation plans and works **entirely on the logical level** just like he or she would do with R. - -Another idea is decoupling logical expression from distributed back-end. As more back-ends are added, -this implies **"write once, run everywhere"**. - -The linear algebra side works with scalars, in-core vectors and matrices, and Mahout Distributed -Row Matrices (DRMs). - -The ecosystem of operators is built in the R's image, i.e. it follows R naming such as %*%, -colSums, nrow, length operating over vectors or matices. - -Important part of Spark Bindings is expression optimizer. It looks at expression as a whole -and figures out how it can be simplified, and which physical operators should be picked. For example, -there are currently about 5 different physical operators performing DRM-DRM multiplication -picked based on matrix geometry, distributed dataset partitioning, orientation etc. -If we count in DRM by in-core combinations, that would be another 4, i.e. 9 total -- all of it for just -simple x %*% y logical notation. - - - -Please refer to the documentation for details. - -## Status - -This environment addresses mostly R-like Linear Algebra optmizations for -Spark, Flink and H20. - - -## Documentation - -* Scala and Spark bindings manual: [web](http://apache.github.io/mahout/doc/ScalaSparkBindings.html), [pdf](ScalaSparkBindings.pdf) -* Overview blog on 0.10.x releases: [blog](http://www.weatheringthroughtechdays.com/2015/04/mahout-010x-first-mahout-release-as.html) - -## Distributed methods and solvers using Bindings - -* In-core ([ssvd]) and Distributed ([dssvd]) Stochastic SVD -- guinea pigs -- see the bindings manual -* In-core ([spca]) and Distributed ([dspca]) Stochastic PCA -- guinea pigs -- see the bindings manual -* Distributed thin QR decomposition ([dqrThin]) -- guinea pig -- see the bindings manual -* [Current list of algorithms](https://mahout.apache.org/users/basics/algorithms.html) - -[ssvd]: https://github.com/apache/mahout/blob/trunk/math-scala/src/main/scala/org/apache/mahout/math/scalabindings/SSVD.scala -[spca]: https://github.com/apache/mahout/blob/trunk/math-scala/src/main/scala/org/apache/mahout/math/scalabindings/SSVD.scala -[dssvd]: https://github.com/apache/mahout/blob/trunk/spark/src/main/scala/org/apache/mahout/sparkbindings/decompositions/DSSVD.scala -[dspca]: https://github.com/apache/mahout/blob/trunk/spark/src/main/scala/org/apache/mahout/sparkbindings/decompositions/DSPCA.scala -[dqrThin]: https://github.com/apache/mahout/blob/trunk/spark/src/main/scala/org/apache/mahout/sparkbindings/decompositions/DQR.scala - - -## Related history of note - -* CLI and Driver for Spark version of item similarity -- [MAHOUT-1541](https://issues.apache.org/jira/browse/MAHOUT-1541) -* Command line interface for generalizable Spark pipelines -- [MAHOUT-1569](https://issues.apache.org/jira/browse/MAHOUT-1569) -* Cooccurrence Analysis / Item-based Recommendation -- [MAHOUT-1464](https://issues.apache.org/jira/browse/MAHOUT-1464) -* Spark Bindings -- [MAHOUT-1346](https://issues.apache.org/jira/browse/MAHOUT-1346) -* Scala Bindings -- [MAHOUT-1297](https://issues.apache.org/jira/browse/MAHOUT-1297) -* Interactive Scala & Spark Bindings Shell & Script processor -- [MAHOUT-1489](https://issues.apache.org/jira/browse/MAHOUT-1489) -* OLS tutorial using Mahout shell -- [MAHOUT-1542](https://issues.apache.org/jira/browse/MAHOUT-1542) -* Full abstraction of DRM apis and algorithms from a distributed engine -- [MAHOUT-1529](https://issues.apache.org/jira/browse/MAHOUT-1529) -* Port Naive Bayes -- [MAHOUT-1493](https://issues.apache.org/jira/browse/MAHOUT-1493) - -## Work in progress -* Text-delimited files for input and output -- [MAHOUT-1568](https://issues.apache.org/jira/browse/MAHOUT-1568) -<!-- * Weighted (Implicit Feedback) ALS -- [MAHOUT-1365](https://issues.apache.org/jira/browse/MAHOUT-1365) --> -<!--* Data frame R-like bindings -- [MAHOUT-1490](https://issues.apache.org/jira/browse/MAHOUT-1490) --> - -* *Your issue here!* - -<!-- ## Stuff wanted: -* Data frame R-like bindings (similarly to linalg bindings) -* Stat R-like bindings (perhaps we can just adapt to commons.math stat) -* **BYODMs:** Bring Your Own Distributed Method on SparkBindings! -* In-core jBlas matrix adapter -* In-core GPU matrix adapters --> - - - - \ No newline at end of file http://git-wip-us.apache.org/repos/asf/mahout/blob/3c53a6dc/website/old_site_migration/completed/sparkbindings/play-with-shell.md ---------------------------------------------------------------------- diff --git a/website/old_site_migration/completed/sparkbindings/play-with-shell.md b/website/old_site_migration/completed/sparkbindings/play-with-shell.md deleted file mode 100644 index 3cdb8f7..0000000 --- a/website/old_site_migration/completed/sparkbindings/play-with-shell.md +++ /dev/null @@ -1,199 +0,0 @@ ---- -layout: default -title: Perceptron and Winnow -theme: - name: retro-mahout ---- -# Playing with Mahout's Spark Shell - -This tutorial will show you how to play with Mahout's scala DSL for linear algebra and its Spark shell. **Please keep in mind that this code is still in a very early experimental stage**. - -_(Edited for 0.10.2)_ - -## Intro - -We'll use an excerpt of a publicly available [dataset about cereals](http://lib.stat.cmu.edu/DASL/Datafiles/Cereals.html). The dataset tells the protein, fat, carbohydrate and sugars (in milligrams) contained in a set of cereals, as well as a customer rating for the cereals. Our aim for this example is to fit a linear model which infers the customer rating from the ingredients. - - -Name | protein | fat | carbo | sugars | rating -:-----------------------|:--------|:----|:------|:-------|:--------- -Apple Cinnamon Cheerios | 2 | 2 | 10.5 | 10 | 29.509541 -Cap'n'Crunch | 1 | 2 | 12 | 12 | 18.042851 -Cocoa Puffs | 1 | 1 | 12 | 13 | 22.736446 -Froot Loops | 2 | 1 | 11 | 13 | 32.207582 -Honey Graham Ohs | 1 | 2 | 12 | 11 | 21.871292 -Wheaties Honey Gold | 2 | 1 | 16 | 8 | 36.187559 -Cheerios | 6 | 2 | 17 | 1 | 50.764999 -Clusters | 3 | 2 | 13 | 7 | 40.400208 -Great Grains Pecan | 3 | 3 | 13 | 4 | 45.811716 - - -## Installing Mahout & Spark on your local machine - -We describe how to do a quick toy setup of Spark & Mahout on your local machine, so that you can run this example and play with the shell. - - 1. Download [Apache Spark 1.6.2](http://d3kbcqa49mib13.cloudfront.net/spark-1.6.2-bin-hadoop2.6.tgz) and unpack the archive file - 1. Change to the directory where you unpacked Spark and type ```sbt/sbt assembly``` to build it - 1. Create a directory for Mahout somewhere on your machine, change to there and checkout the master branch of Apache Mahout from GitHub ```git clone https://github.com/apache/mahout mahout``` - 1. Change to the ```mahout``` directory and build mahout using ```mvn -DskipTests clean install``` - -## Starting Mahout's Spark shell - - 1. Goto the directory where you unpacked Spark and type ```sbin/start-all.sh``` to locally start Spark - 1. Open a browser, point it to [http://localhost:8080/](http://localhost:8080/) to check whether Spark successfully started. Copy the url of the spark master at the top of the page (it starts with **spark://**) - 1. Define the following environment variables: <pre class="codehilite">export MAHOUT_HOME=[directory into which you checked out Mahout] -export SPARK_HOME=[directory where you unpacked Spark] -export MASTER=[url of the Spark master] -</pre> - 1. Finally, change to the directory where you unpacked Mahout and type ```bin/mahout spark-shell```, -you should see the shell starting and get the prompt ```mahout> ```. Check -[FAQ](http://mahout.apache.org/users/sparkbindings/faq.html) for further troubleshooting. - -## Implementation - -We'll use the shell to interactively play with the data and incrementally implement a simple [linear regression](https://en.wikipedia.org/wiki/Linear_regression) algorithm. Let's first load the dataset. Usually, we wouldn't need Mahout unless we processed a large dataset stored in a distributed filesystem. But for the sake of this example, we'll use our tiny toy dataset and "pretend" it was too big to fit onto a single machine. - -*Note: You can incrementally follow the example by copy-and-pasting the code into your running Mahout shell.* - -Mahout's linear algebra DSL has an abstraction called *DistributedRowMatrix (DRM)* which models a matrix that is partitioned by rows and stored in the memory of a cluster of machines. We use ```dense()``` to create a dense in-memory matrix from our toy dataset and use ```drmParallelize``` to load it into the cluster, "mimicking" a large, partitioned dataset. - -<div class="codehilite"><pre> -val drmData = drmParallelize(dense( - (2, 2, 10.5, 10, 29.509541), // Apple Cinnamon Cheerios - (1, 2, 12, 12, 18.042851), // Cap'n'Crunch - (1, 1, 12, 13, 22.736446), // Cocoa Puffs - (2, 1, 11, 13, 32.207582), // Froot Loops - (1, 2, 12, 11, 21.871292), // Honey Graham Ohs - (2, 1, 16, 8, 36.187559), // Wheaties Honey Gold - (6, 2, 17, 1, 50.764999), // Cheerios - (3, 2, 13, 7, 40.400208), // Clusters - (3, 3, 13, 4, 45.811716)), // Great Grains Pecan - numPartitions = 2); -</pre></div> - -Have a look at this matrix. The first four columns represent the ingredients -(our features) and the last column (the rating) is the target variable for -our regression. [Linear regression](https://en.wikipedia.org/wiki/Linear_regression) -assumes that the **target variable** `\(\mathbf{y}\)` is generated by the -linear combination of **the feature matrix** `\(\mathbf{X}\)` with the -**parameter vector** `\(\boldsymbol{\beta}\)` plus the - **noise** `\(\boldsymbol{\varepsilon}\)`, summarized in the formula -`\(\mathbf{y}=\mathbf{X}\boldsymbol{\beta}+\boldsymbol{\varepsilon}\)`. -Our goal is to find an estimate of the parameter vector -`\(\boldsymbol{\beta}\)` that explains the data very well. - -As a first step, we extract `\(\mathbf{X}\)` and `\(\mathbf{y}\)` from our data matrix. We get *X* by slicing: we take all rows (denoted by ```::```) and the first four columns, which have the ingredients in milligrams as content. Note that the result is again a DRM. The shell will not execute this code yet, it saves the history of operations and defers the execution until we really access a result. **Mahout's DSL automatically optimizes and parallelizes all operations on DRMs and runs them on Apache Spark.** - -<div class="codehilite"><pre> -val drmX = drmData(::, 0 until 4) -</pre></div> - -Next, we extract the target variable vector *y*, the fifth column of the data matrix. We assume this one fits into our driver machine, so we fetch it into memory using ```collect```: - -<div class="codehilite"><pre> -val y = drmData.collect(::, 4) -</pre></div> - -Now we are ready to think about a mathematical way to estimate the parameter vector *β*. A simple textbook approach is [ordinary least squares (OLS)](https://en.wikipedia.org/wiki/Ordinary_least_squares), which minimizes the sum of residual squares between the true target variable and the prediction of the target variable. In OLS, there is even a closed form expression for estimating `\(\boldsymbol{\beta}\)` as -`\(\left(\mathbf{X}^{\top}\mathbf{X}\right)^{-1}\mathbf{X}^{\top}\mathbf{y}\)`. - -The first thing which we compute for this is `\(\mathbf{X}^{\top}\mathbf{X}\)`. The code for doing this in Mahout's scala DSL maps directly to the mathematical formula. The operation ```.t()``` transposes a matrix and analogous to R ```%*%``` denotes matrix multiplication. - -<div class="codehilite"><pre> -val drmXtX = drmX.t %*% drmX -</pre></div> - -The same is true for computing `\(\mathbf{X}^{\top}\mathbf{y}\)`. We can simply type the math in scala expressions into the shell. Here, *X* lives in the cluster, while is *y* in the memory of the driver, and the result is a DRM again. -<div class="codehilite"><pre> -val drmXty = drmX.t %*% y -</pre></div> - -We're nearly done. The next step we take is to fetch `\(\mathbf{X}^{\top}\mathbf{X}\)` and -`\(\mathbf{X}^{\top}\mathbf{y}\)` into the memory of our driver machine (we are targeting -features matrices that are tall and skinny , -so we can assume that `\(\mathbf{X}^{\top}\mathbf{X}\)` is small enough -to fit in). Then, we provide them to an in-memory solver (Mahout provides -the an analog to R's ```solve()``` for that) which computes ```beta```, our -OLS estimate of the parameter vector `\(\boldsymbol{\beta}\)`. - -<div class="codehilite"><pre> -val XtX = drmXtX.collect -val Xty = drmXty.collect(::, 0) - -val beta = solve(XtX, Xty) -</pre></div> - -That's it! We have a implemented a distributed linear regression algorithm -on Apache Spark. I hope you agree that we didn't have to worry a lot about -parallelization and distributed systems. The goal of Mahout's linear algebra -DSL is to abstract away the ugliness of programming a distributed system -as much as possible, while still retaining decent performance and -scalability. - -We can now check how well our model fits its training data. -First, we multiply the feature matrix `\(\mathbf{X}\)` by our estimate of -`\(\boldsymbol{\beta}\)`. Then, we look at the difference (via L2-norm) of -the target variable `\(\mathbf{y}\)` to the fitted target variable: - -<div class="codehilite"><pre> -val yFitted = (drmX %*% beta).collect(::, 0) -(y - yFitted).norm(2) -</pre></div> - -We hope that we could show that Mahout's shell allows people to interactively and incrementally write algorithms. We have entered a lot of individual commands, one-by-one, until we got the desired results. We can now refactor a little by wrapping our statements into easy-to-use functions. The definition of functions follows standard scala syntax. - -We put all the commands for ordinary least squares into a function ```ols```. - -<div class="codehilite"><pre> -def ols(drmX: DrmLike[Int], y: Vector) = - solve(drmX.t %*% drmX, drmX.t %*% y)(::, 0) - -</pre></div> - -Note that DSL declares implicit `collect` if coersion rules require an in-core argument. Hence, we can simply -skip explicit `collect`s. - -Next, we define a function ```goodnessOfFit``` that tells how well a model fits the target variable: - -<div class="codehilite"><pre> -def goodnessOfFit(drmX: DrmLike[Int], beta: Vector, y: Vector) = { - val fittedY = (drmX %*% beta).collect(::, 0) - (y - fittedY).norm(2) -} -</pre></div> - -So far we have left out an important aspect of a standard linear regression -model. Usually there is a constant bias term added to the model. Without -that, our model always crosses through the origin and we only learn the -right angle. An easy way to add such a bias term to our model is to add a -column of ones to the feature matrix `\(\mathbf{X}\)`. -The corresponding weight in the parameter vector will then be the bias term. - -Here is how we add a bias column: - -<div class="codehilite"><pre> -val drmXwithBiasColumn = drmX cbind 1 -</pre></div> - -Now we can give the newly created DRM ```drmXwithBiasColumn``` to our model fitting method ```ols``` and see how well the resulting model fits the training data with ```goodnessOfFit```. You should see a large improvement in the result. - -<div class="codehilite"><pre> -val betaWithBiasTerm = ols(drmXwithBiasColumn, y) -goodnessOfFit(drmXwithBiasColumn, betaWithBiasTerm, y) -</pre></div> - -As a further optimization, we can make use of the DSL's caching functionality. We use ```drmXwithBiasColumn``` repeatedly as input to a computation, so it might be beneficial to cache it in memory. This is achieved by calling ```checkpoint()```. In the end, we remove it from the cache with uncache: - -<div class="codehilite"><pre> -val cachedDrmX = drmXwithBiasColumn.checkpoint() - -val betaWithBiasTerm = ols(cachedDrmX, y) -val goodness = goodnessOfFit(cachedDrmX, betaWithBiasTerm, y) - -cachedDrmX.uncache() - -goodness -</pre></div> - - -Liked what you saw? Checkout Mahout's overview for the [Scala and Spark bindings](https://mahout.apache.org/users/sparkbindings/home.html). \ No newline at end of file http://git-wip-us.apache.org/repos/asf/mahout/blob/3c53a6dc/website/old_site_migration/completed/twenty-newsgroups.md ---------------------------------------------------------------------- diff --git a/website/old_site_migration/completed/twenty-newsgroups.md b/website/old_site_migration/completed/twenty-newsgroups.md deleted file mode 100644 index 472aaf6..0000000 --- a/website/old_site_migration/completed/twenty-newsgroups.md +++ /dev/null @@ -1,179 +0,0 @@ ---- -layout: default -title: Twenty Newsgroups -theme: - name: retro-mahout ---- - - -<a name="TwentyNewsgroups-TwentyNewsgroupsClassificationExample"></a> -## Twenty Newsgroups Classification Example - -<a name="TwentyNewsgroups-Introduction"></a> -## Introduction - -The 20 newsgroups dataset is a collection of approximately 20,000 -newsgroup documents, partitioned (nearly) evenly across 20 different -newsgroups. The 20 newsgroups collection has become a popular data set for -experiments in text applications of machine learning techniques, such as -text classification and text clustering. We will use the [Mahout CBayes](http://mahout.apache.org/users/mapreduce/classification/bayesian.html) -classifier to create a model that would classify a new document into one of -the 20 newsgroups. - -<a name="TwentyNewsgroups-Prerequisites"></a> -### Prerequisites - -* Mahout has been downloaded ([instructions here](https://mahout.apache.org/general/downloads.html)) -* Maven is available -* Your environment has the following variables: - * **HADOOP_HOME** Environment variables refers to where Hadoop lives - * **MAHOUT_HOME** Environment variables refers to where Mahout lives - -<a name="TwentyNewsgroups-Instructionsforrunningtheexample"></a> -### Instructions for running the example - -1. If running Hadoop in cluster mode, start the hadoop daemons by executing the following commands: - - $ cd $HADOOP_HOME/bin - $ ./start-all.sh - - Otherwise: - - $ export MAHOUT_LOCAL=true - -2. In the trunk directory of Mahout, compile and install Mahout: - - $ cd $MAHOUT_HOME - $ mvn -DskipTests clean install - -3. Run the [20 newsgroups example script](https://github.com/apache/mahout/blob/master/examples/bin/classify-20newsgroups.sh) by executing: - - $ ./examples/bin/classify-20newsgroups.sh - -4. You will be prompted to select a classification method algorithm: - - 1. Complement Naive Bayes - 2. Naive Bayes - 3. Stochastic Gradient Descent - -Select 1 and the the script will perform the following: - -1. Create a working directory for the dataset and all input/output. -2. Download and extract the *20news-bydate.tar.gz* from the [20 newsgroups dataset](http://people.csail.mit.edu/jrennie/20Newsgroups/20news-bydate.tar.gz) to the working directory. -3. Convert the full 20 newsgroups dataset into a < Text, Text > SequenceFile. -4. Convert and preprocesses the dataset into a < Text, VectorWritable > SequenceFile containing term frequencies for each document. -5. Split the preprocessed dataset into training and testing sets. -6. Train the classifier. -7. Test the classifier. - - -Output should look something like: - - - ======================================================= - Confusion Matrix - ------------------------------------------------------- - a b c d e f g h i j k l m n o p q r s t <--Classified as - 381 0 0 0 0 9 1 0 0 0 1 0 0 2 0 1 0 0 3 0 |398 a=rec.motorcycles - 1 284 0 0 0 0 1 0 6 3 11 0 66 3 0 6 0 4 9 0 |395 b=comp.windows.x - 2 0 339 2 0 3 5 1 0 0 0 0 1 1 12 1 7 0 2 0 |376 c=talk.politics.mideast - 4 0 1 327 0 2 2 0 0 2 1 1 0 5 1 4 12 0 2 0 |364 d=talk.politics.guns - 7 0 4 32 27 7 7 2 0 12 0 0 6 0 100 9 7 31 0 0 |251 e=talk.religion.misc - 10 0 0 0 0 359 2 2 0 0 3 0 1 6 0 1 0 0 11 0 |396 f=rec.autos - 0 0 0 0 0 1 383 9 1 0 0 0 0 0 0 0 0 3 0 0 |397 g=rec.sport.baseball - 1 0 0 0 0 0 9 382 0 0 0 0 1 1 1 0 2 0 2 0 |399 h=rec.sport.hockey - 2 0 0 0 0 4 3 0 330 4 4 0 5 12 0 0 2 0 12 7 |385 i=comp.sys.mac.hardware - 0 3 0 0 0 0 1 0 0 368 0 0 10 4 1 3 2 0 2 0 |394 j=sci.space - 0 0 0 0 0 3 1 0 27 2 291 0 11 25 0 0 1 0 13 18|392 k=comp.sys.ibm.pc.hardware - 8 0 1 109 0 6 11 4 1 18 0 98 1 3 11 10 27 1 1 0 |310 l=talk.politics.misc - 0 11 0 0 0 3 6 0 10 6 11 0 299 13 0 2 13 0 7 8 |389 m=comp.graphics - 6 0 1 0 0 4 2 0 5 2 12 0 8 321 0 4 14 0 8 6 |393 n=sci.electronics - 2 0 0 0 0 0 4 1 0 3 1 0 3 1 372 6 0 2 1 2 |398 o=soc.religion.christian - 4 0 0 1 0 2 3 3 0 4 2 0 7 12 6 342 1 0 9 0 |396 p=sci.med - 0 1 0 1 0 1 4 0 3 0 1 0 8 4 0 2 369 0 1 1 |396 q=sci.crypt - 10 0 4 10 1 5 6 2 2 6 2 0 2 1 86 15 14 152 0 1 |319 r=alt.atheism - 4 0 0 0 0 9 1 1 8 1 12 0 3 0 2 0 0 0 341 2 |390 s=misc.forsale - 8 5 0 0 0 1 6 0 8 5 50 0 40 2 1 0 9 0 3 256|394 t=comp.os.ms-windows.misc - ======================================================= - Statistics - ------------------------------------------------------- - Kappa 0.8808 - Accuracy 90.8596% - Reliability 86.3632% - Reliability (standard deviation) 0.2131 - - - - - -<a name="TwentyNewsgroups-ComplementaryNaiveBayes"></a> -## End to end commands to build a CBayes model for 20 newsgroups -The [20 newsgroups example script](https://github.com/apache/mahout/blob/master/examples/bin/classify-20newsgroups.sh) issues the following commands as outlined above. We can build a CBayes classifier from the command line by following the process in the script: - -*Be sure that **MAHOUT_HOME**/bin and **HADOOP_HOME**/bin are in your **$PATH*** - -1. Create a working directory for the dataset and all input/output. - - $ export WORK_DIR=/tmp/mahout-work-${USER} - $ mkdir -p ${WORK_DIR} - -2. Download and extract the *20news-bydate.tar.gz* from the [20newsgroups dataset](http://people.csail.mit.edu/jrennie/20Newsgroups/20news-bydate.tar.gz) to the working directory. - - $ curl http://people.csail.mit.edu/jrennie/20Newsgroups/20news-bydate.tar.gz - -o ${WORK_DIR}/20news-bydate.tar.gz - $ mkdir -p ${WORK_DIR}/20news-bydate - $ cd ${WORK_DIR}/20news-bydate && tar xzf ../20news-bydate.tar.gz && cd .. && cd .. - $ mkdir ${WORK_DIR}/20news-all - $ cp -R ${WORK_DIR}/20news-bydate/*/* ${WORK_DIR}/20news-all - * If you're running on a Hadoop cluster: - - $ hadoop dfs -put ${WORK_DIR}/20news-all ${WORK_DIR}/20news-all - -3. Convert the full 20 newsgroups dataset into a < Text, Text > SequenceFile. - - $ mahout seqdirectory - -i ${WORK_DIR}/20news-all - -o ${WORK_DIR}/20news-seq - -ow - -4. Convert and preprocesses the dataset into a < Text, VectorWritable > SequenceFile containing term frequencies for each document. - - $ mahout seq2sparse - -i ${WORK_DIR}/20news-seq - -o ${WORK_DIR}/20news-vectors - -lnorm - -nv - -wt tfidf -If we wanted to use different parsing methods or transformations on the term frequency vectors we could supply different options here e.g.: -ng 2 for bigrams or -n 2 for L2 length normalization. See the [Creating vectors from text](http://mahout.apache.org/users/basics/creating-vectors-from-text.html) page for a list of all seq2sparse options. - -5. Split the preprocessed dataset into training and testing sets. - - $ mahout split - -i ${WORK_DIR}/20news-vectors/tfidf-vectors - --trainingOutput ${WORK_DIR}/20news-train-vectors - --testOutput ${WORK_DIR}/20news-test-vectors - --randomSelectionPct 40 - --overwrite --sequenceFiles -xm sequential - -6. Train the classifier. - - $ mahout trainnb - -i ${WORK_DIR}/20news-train-vectors - -el - -o ${WORK_DIR}/model - -li ${WORK_DIR}/labelindex - -ow - -c - -7. Test the classifier. - - $ mahout testnb - -i ${WORK_DIR}/20news-test-vectors - -m ${WORK_DIR}/model - -l ${WORK_DIR}/labelindex - -ow - -o ${WORK_DIR}/20news-testing - -c - - - \ No newline at end of file http://git-wip-us.apache.org/repos/asf/mahout/blob/3c53a6dc/website/old_site_migration/completed/who-we-are.md ---------------------------------------------------------------------- diff --git a/website/old_site_migration/completed/who-we-are.md b/website/old_site_migration/completed/who-we-are.md deleted file mode 100644 index 24f493b..0000000 --- a/website/old_site_migration/completed/who-we-are.md +++ /dev/null @@ -1,62 +0,0 @@ ---- -layout: default -title: Who We Are -theme: - name: retro-mahout ---- - -<a name="WhoWeAre-Whoweare"></a> -# Who we are - -Apache Mahout is maintained by a team of volunteer developers. - -<a name="WhoWeAre-CoreCommitters"></a> -## Core Committers - -(Please keep the list below in alphabetical order by first name.) - -Name | Mail | PMC | Comment -----|---------|------|------|---------- -Anand Avati | avati@... | No | Twitter: @anandavati -Andrew Musselman | akm@... | Yes | Twitter: @akm -Andrew Palumbo | apalumbo@... | Yes (Chair) | | -Benson Margulies | bimargulies@... | Yes | | -Dan Filimon | dfilimon@... | No | | -Dmitriy Lyubimov | dlyubimov@... | No (Emeritus) | -Drew Farris | drew@... | Yes | | -Ellen Friedman | ellenf@... | No | Twitter: @Ellen_Friedman -Frank Scholten | frankscholten@... | No | | -Gokhan Capan | gcapan@... | No | <a href="http://www.linkedin.com/in/gokhancapan">LinkedIn Profile</a> -Grant Ingersoll | gsingers@... | Yes | Twitter: @gsingers -Isabel Drost-Fromm | isabel@... | Yes | Passion for free software (development, but to some extend also the political and economic implications), interested in agile development and project management, lives in Germany. Follow me on Twitter @MaineC -Jacob Alexander Mannix | jmannix@... | Yes | | -Jeff Eastman | jeastman@... | No (Emeritus) | -Paritosh Ranjan | pranjan@... | Yes | Twitter: @paritoshranjan -Pat Ferrel | pat@... | Yes | Twitter: @occam -Robin Anil | robinanil@... | Yes | | -Sean Owen | srowen@... | No (Emeritus) | -Sebastian Schelter | ssc@... | Yes | | -Shannon Quinn | squinn@... | No | | -Stevo SlaviÄ| sslavic@... | No | Twitter: @sslavic -Suneel Marthi | smarthi@... | Yes | Twitter: @suneelmarthi -Ted Dunning | tdunning@... | Yes | -Tom Pierce | tcp@... | No | | - -<a name="WhoWeAre-EmeritusCommitters"></a> -## Emeritus Committers - -* Niranjan Balasubramanian (nbalasub@...) -* Otis Gospodnetic (otis@...) -* David Hall (dlwh@...) -* Erik Hatcher (ehatcher@...) -* Ozgur Yilmazel (oyilmazel@...) -* Dawid Weiss (dweiss@...) -* Karl Wettin (kalle@...) -* AbdelHakim Deneche (adeneche@...) - -Note that the email addresses above end with @apache.org. - -<a name="WhoWeAre-Contributors"></a> -## Contributors - -Apache Mahout contributors and their contributions to individual issues can be found at Apache <a href="http://issues.apache.org/jira/browse/MAHOUT">JIRA</a>. \ No newline at end of file http://git-wip-us.apache.org/repos/asf/mahout/blob/3c53a6dc/website/old_site_migration/completed/wikipedia-classifier-example.md ---------------------------------------------------------------------- diff --git a/website/old_site_migration/completed/wikipedia-classifier-example.md b/website/old_site_migration/completed/wikipedia-classifier-example.md deleted file mode 100644 index 9df07da..0000000 --- a/website/old_site_migration/completed/wikipedia-classifier-example.md +++ /dev/null @@ -1,57 +0,0 @@ ---- -layout: default -title: Wikipedia XML parser and Naive Bayes Example -theme: - name: retro-mahout ---- -# Wikipedia XML parser and Naive Bayes Classifier Example - -## Introduction -Mahout has an [example script](https://github.com/apache/mahout/blob/master/examples/bin/classify-wikipedia.sh) [1] which will download a recent XML dump of the (entire if desired) [English Wikipedia database](http://dumps.wikimedia.org/enwiki/latest/). After running the classification script, you can use the [document classification script](https://github.com/apache/mahout/blob/master/examples/bin/spark-document-classifier.mscala) from the Mahout [spark-shell](http://mahout.apache.org/users/sparkbindings/play-with-shell.html) to vectorize and classify text from outside of the training and testing corpus using a modle built on the Wikipedia dataset. - -You can run this script to build and test a Naive Bayes classifier for option (1) 10 arbitrary countries or option (2) 2 countries (United States and United Kingdom). - -## Oververview - -Tou run the example simply execute the `$MAHOUT_HOME/examples/bin/classify-wikipedia.sh` script. - -By defult the script is set to run on a medium sized Wikipedia XML dump. To run on the full set (the entire english Wikipedia) you can change the download by commenting out line 78, and uncommenting line 80 of [classify-wikipedia.sh](https://github.com/apache/mahout/blob/master/examples/bin/classify-wikipedia.sh) [1]. However this is not recommended unless you have the resources to do so. *Be sure to clean your work directory when changing datasets- option (3).* - -The step by step process for Creating a Naive Bayes Classifier for the Wikipedia XML dump is very similar to that for [creating a 20 Newsgroups Classifier](http://mahout.apache.org/users/classification/twenty-newsgroups.html) [4]. The only difference being that instead of running `$mahout seqdirectory` on the unzipped 20 Newsgroups file, you'll run `$mahout seqwiki` on the unzipped Wikipedia xml dump. - - $ mahout seqwiki - -The above command launches `WikipediaToSequenceFile.java` which accepts a text file of categories [3] and starts an MR job to parse the each document in the XML file. This process will seek to extract documents with a wikipedia category tag which (exactly, if the `-exactMatchOnly` option is set) matches a line in the category file. If no match is found and the `-all` option is set, the document will be dumped into an "unknown" category. The documents will then be written out as a `<Text,Text>` sequence file of the form (K:/category/document_title , V: document). - -There are 3 different example category files available to in the /examples/src/test/resources -directory: country.txt, country10.txt and country2.txt. You can edit these categories to extract a different corpus from the Wikipedia dataset. - -The CLI options for `seqwiki` are as follows: - - --input (-i) input pathname String - --output (-o) the output pathname String - --categories (-c) the file containing the Wikipedia categories - --exactMatchOnly (-e) if set, then the Wikipedia category must match - exactly instead of simply containing the category string - --all (-all) if set select all categories - --removeLabels (-rl) if set, remove [[Category:labels]] from document text after extracting label. - - -After `seqwiki`, the script runs `seq2sparse`, `split`, `trainnb` and `testnb` as in the [step by step 20newsgroups example](http://mahout.apache.org/users/classification/twenty-newsgroups.html). When all of the jobs have finished, a confusion matrix will be displayed. - -#Resourcese - -[1] [classify-wikipedia.sh](https://github.com/apache/mahout/blob/master/examples/bin/classify-wikipedia.sh) - -[2] [Document classification script for the Mahout Spark Shell](https://github.com/apache/mahout/blob/master/examples/bin/spark-document-classifier.mscala) - -[3] [Example category file](https://github.com/apache/mahout/blob/master/examples/src/test/resources/country10.txt) - -[4] [Step by step instructions for building a Naive Bayes classifier for 20newsgroups from the command line](http://mahout.apache.org/users/classification/twenty-newsgroups.html) - -[5] [Mahout MapReduce Naive Bayes](http://mahout.apache.org/users/classification/bayesian.html) - -[6] [Mahout Spark Naive Bayes](http://mahout.apache.org/users/algorithms/spark-naive-bayes.html) - -[7] [Mahout Scala Spark and H2O Bindings](http://mahout.apache.org/users/sparkbindings/home.html) - http://git-wip-us.apache.org/repos/asf/mahout/blob/3c53a6dc/website/old_site_migration/dont_migrate/algorithms.md ---------------------------------------------------------------------- diff --git a/website/old_site_migration/dont_migrate/algorithms.md b/website/old_site_migration/dont_migrate/algorithms.md deleted file mode 100644 index c3a7e4f..0000000 --- a/website/old_site_migration/dont_migrate/algorithms.md +++ /dev/null @@ -1,58 +0,0 @@ ---- -layout: default -title: Algorithms -theme: - name: retro-mahout ---- - -NOTE: As we move away from Mapreduce, all MRs are deprecated. If anything maybe move this to the Mapreduce home page and drop teh spark, flink, h2o columns ---- -*Mahout 0.12.0 Features by Engine* ---- - -| | **Single Machine** | [**MapReduce**](http://hadoop.apache.org/)| [**Spark**](https://spark.apache.org/) | [**H2O**](http://0xdata.com/) | [**Flink**](https://flink.apache.org/) | ----------------------------------------------|:----------------:|:-----------:|:------:|:---:|:----:| -**Mahout Math-Scala Core Library and Scala DSL**| -| [Mahout Distributed BLAS. Distributed Row Matrix API with R and Matlab like operators. Distributed ALS, SPCA, SSVD, thin-QR. Similarity Analysis](http://mahout.apache.org/users/sparkbindings/home.html). | | | [x](https://mahout.apache.org/users/sparkbindings/ScalaSparkBindings.pdf) | [x](https://github.com/apache/mahout/tree/master/h2o) |[x](https://github.com/apache/mahout/tree/flink-binding/flink) -|| -**Mahout Interactive Shell**| -| [Interactive REPL shell for Spark optimized Mahout DSL](http://mahout.apache.org/users/sparkbindings/play-with-shell.html) | | | x | -|| -**Collaborative Filtering** *with CLI drivers*| - User-Based Collaborative Filtering | *deprecated* | *deprecated*|[x](https://mahout.apache.org/users/algorithms/intro-cooccurrence-spark.html) - Item-Based Collaborative Filtering | x | [x](https://mahout.apache.org/users/recommender/intro-itembased-hadoop.html) | [x](https://mahout.apache.org/users/algorithms/intro-cooccurrence-spark.html) | - Matrix Factorization with ALS | x | [x](https://mahout.apache.org/users/recommender/intro-als-hadoop.html) | | - Matrix Factorization with ALS on Implicit Feedback | x | [x](https://mahout.apache.org/users/recommender/intro-als-hadoop.html) | | - Weighted Matrix Factorization, SVD++ | x | | -|| -**Classification** *with CLI drivers*| | | - Logistic Regression - trained via SGD | [*deprecated*](http://mahout.apache.org/users/classification/logistic-regression.html) | - Naive Bayes / Complementary Naive Bayes | | [*deprecated*](https://mahout.apache.org/users/classification/bayesian.html) | [x](https://mahout.apache.org/users/algorithms/spark-naive-bayes.html) | - Hidden Markov Models | [*deprecated*](https://mahout.apache.org/users/classification/hidden-markov-models.html) | -|| -**Clustering** *with CLI drivers*|| - Canopy Clustering | [*deprecated*](https://mahout.apache.org/users/clustering/canopy-clustering.html) | [*deprecated*](https://mahout.apache.org/users/clustering/canopy-clustering.html)| - k-Means Clustering | [*deprecated*](https://mahout.apache.org/users/clustering/k-means-clustering.html) | [*deprecated*](https://mahout.apache.org/users/clustering/k-means-clustering.html) | - Fuzzy k-Means | [*deprecated*](https://mahout.apache.org/users/clustering/fuzzy-k-means.html) | [*deprecated*](https://mahout.apache.org/users/clustering/fuzzy-k-means.html)| - Streaming k-Means | [*deprecated*](https://mahout.apache.org/users/clustering/streaming-k-means.html) | [*deprecated*](https://mahout.apache.org/users/clustering/streaming-k-means.html) | - Spectral Clustering | | [*deprecated*](https://mahout.apache.org/users/clustering/spectral-clustering.html) | -|| -**Dimensionality Reduction** *note: most scala-based dimensionality reduction algorithms are available through the [Mahout Math-Scala Core Library for all engines](https://mahout.apache.org/users/sparkbindings/home.html)*|| - Singular Value Decomposition | *deprecated* | *deprecated* | [x](http://mahout.apache.org/users/sparkbindings/home.html) |[x](http://mahout.apache.org/users/environment/h2o-internals.html) | [x](http://mahout.apache.org/users/flinkbindings/flink-internals.html) - Lanczos Algorithm | *deprecated* | *deprecated* | - Stochastic SVD | [*deprecated*](https://mahout.apache.org/users/dim-reduction/ssvd.html) | [*deprecated*](https://mahout.apache.org/users/dim-reduction/ssvd.html) | [x](http://mahout.apache.org/users/algorithms/d-ssvd.html) | [x](http://mahout.apache.org/users/algorithms/d-ssvd.html)| [x](http://mahout.apache.org/users/algorithms/d-ssvd.html) - PCA (via Stochastic SVD) | *deprecated* | *deprecated* | [x](http://mahout.apache.org/users/sparkbindings/home.html) |[x](http://mahout.apache.org/users/environment/h2o-internals.html) | [x](http://mahout.apache.org/users/flinkbindings/flink-internals.html) - QR Decomposition | *deprecated* | *deprecated* | [x](http://mahout.apache.org/users/algorithms/d-qr.html) |[x](http://mahout.apache.org/users/algorithms/d-qr.html) | [x](http://mahout.apache.org/users/algorithms/d-qr.html) -|| -**Topic Models**|| - Latent Dirichlet Allocation | *deprecated* | *deprecated* | -|| -**Miscellaneous**|| - RowSimilarityJob | | *deprecated* | [x](https://github.com/apache/mahout/blob/master/spark/src/test/scala/org/apache/mahout/drivers/RowSimilarityDriverSuite.scala) | - Collocations | | [*deprecated*](https://mahout.apache.org/users/basics/collocations.html) | - Sparse TF-IDF Vectors from Text | | [*deprecated*](https://mahout.apache.org/users/basics/creating-vectors-from-text.html) | - XML Parsing| | [*deprecated*](https://issues.apache.org/jira/browse/MAHOUT-1479?jql=text%20~%20%22wikipedia%20mahout%22) | - Email Archive Parsing | | [*deprecated*](https://github.com/apache/mahout/tree/master/integration/src/main/java/org/apache/mahout/text) | - Evolutionary Processes | [x](https://github.com/apache/mahout/tree/master/mr/src/main/java/org/apache/mahout/ep) | - - http://git-wip-us.apache.org/repos/asf/mahout/blob/3c53a6dc/website/old_site_migration/dont_migrate/collections.md ---------------------------------------------------------------------- diff --git a/website/old_site_migration/dont_migrate/collections.md b/website/old_site_migration/dont_migrate/collections.md deleted file mode 100644 index 327b079..0000000 --- a/website/old_site_migration/dont_migrate/collections.md +++ /dev/null @@ -1,98 +0,0 @@ ---- -layout: default -title: Collections -theme: - name: retro-mahout ---- - -NOTE: Idk when this was made but there are lots of free data set sites now that track this... - -TODO: Organize these somehow, add one-line blurbs - -Organize by usage? (classification, recommendation etc.) - -<a name="Collections-CollectionsofCollections"></a> -## Collections of Collections - -- [ML Data](http://mldata.org/about/) - ... repository supported by Pascal 2. -- [DBPedia](http://wiki.dbpedia.org/Downloads30) -- [UCI Machine Learning Repo](http://archive.ics.uci.edu/ml/) -- [http://mloss.org/community/blog/2008/sep/19/data-sources/](http://mloss.org/community/blog/2008/sep/19/data-sources/) -- [Linked Library Data](http://ckan.net/group/lld) - via CKAN -- [InfoChimps](http://infochimps.com/) - Free and purchasable datasets -- [http://www.linkedin.com/groupItem?view=&srchtype=discussedNews&gid=3638279&item=35736572&type=member&trk=EML_anet_ac_pst_ttle](http://www.linkedin.com/groupItem?view=&srchtype=discussedNews&gid=3638279&item=35736572&type=member&trk=EML_anet_ac_pst_ttle) - LinkedIn discussion of lots of data sets - -<a name="Collections-CategorizationData"></a> -## Categorization Data - -- [20Newsgroups](http://people.csail.mit.edu/jrennie/20Newsgroups/) -- [RCV1 data set](http://jmlr.csail.mit.edu/papers/volume5/lewis04a/lyrl2004_rcv1v2_README.htm) -- [10 years of CLEF Data](http://direct.dei.unipd.it/) -- [http://ece.ut.ac.ir/DBRG/Hamshahri/](http://ece.ut.ac.ir/DBRG/Hamshahri/) - (Approximately 160k categorized docs) -There is a newer beta verson here:[http://ece.ut.ac.ir/DBRG/Hamshahri/ham2/](http://ece.ut.ac.ir/DBRG/Hamshahri/ham2/) - (Approximately 320k categorized docs) -- Lending Club load data [https://www.lendingclub.com/info/download-data.action](https://www.lendingclub.com/info/download-data.action) - -<a name="Collections-RecommendationData"></a> -## Recommendation Data - -- [Book usage and recommendation data from the University of Huddersfield](http://library.hud.ac.uk/data/usagedata/) -- [Last.fm](http://denoiserthebetter.posterous.com/music-recommendation-datasets) - \- Non-commercial use only -- [Amazon Product Review Data via Jindal and Liu](http://www.cs.uic.edu/~liub/FBS/sentiment-analysis.html) - -- Scroll down -- [GroupLens/MovieLens Movie Review Dataset](http://www.grouplens.org/node/73) - -<a name="Collections-MultilingualData"></a> -## Multilingual Data - -- [http://urd.let.rug.nl/tiedeman/OPUS/OpenSubtitles.php](http://urd.let.rug.nl/tiedeman/OPUS/OpenSubtitles.php) - \- 308,000 subtitle files covering about 18,900 movies in 59 languages -(July 2006 numbers). This is a curated collection of subtitles from an -aggregation site, [http://www.openSubTitles.org] -The original site, OpenSubtitles.org, is up to 1.6m subtitles files. -- [Statistical Machine Translation](http://www.statmt.org/) - \- devoted to all things language translation. Includes multilingual -corpuses of European and Canadian legal tomes. - -<a name="Collections-Geospatial"></a> -## Geospatial - -- [Natural Earth Data](http://www.naturalearthdata.com/) -- [Open Street Maps](http://wiki.openstreetmap.org/wiki/Main_Page) -And other crowd-sourced mapping data sites. - -<a name="Collections-Airline"></a> -## Airline - -- [Open Flights](http://openflights.org/) - \- Crowd-sourced database of airlines, flights, airports, times, etc. -- [Airline on-time information - 1987-2008](http://stat-computing.org/dataexpo/2009/) - \- 120m CSV records, 12G uncompressed - -<a name="Collections-GeneralResources"></a> -## General Resources - -- [theinfo](http://theinfo.org/) -- [WordNet](http://wordnet.princeton.edu/obtain) -- [Common Crawl](http://www.commoncrawl.org/) - \- freely available web crawl on EC2 - -<a name="Collections-Stuff"></a> -## Stuff - -- [http://www.cs.technion.ac.il/~gabr/resources/data/ne_datasets.html](http://www.cs.technion.ac.il/~gabr/resources/data/ne_datasets.html) -- [4 Universities Data Set](http://www-2.cs.cmu.edu/afs/cs.cmu.edu/project/theo-20/www/data/) -- [Large crawl of Twitter](http://an.kaist.ac.kr/traces/WWW2010.html) -- [UniProt](http://beta.uniprot.org/) -- [http://www.icwsm.org/2009/data/](http://www.icwsm.org/2009/data/) -- [http://data.gov](http://data.gov) -- [http://www.ckan.net/](http://www.ckan.net/) -- [http://www.guardian.co.uk/news/datablog/2010/jan/07/government-data-world](http://www.guardian.co.uk/news/datablog/2010/jan/07/government-data-world) -- [http://data.gov.uk/](http://data.gov.uk/) -- [51,000 US Congressional Bills tagged](http://www.ark.cs.cmu.edu/bills/) http://git-wip-us.apache.org/repos/asf/mahout/blob/3c53a6dc/website/old_site_migration/dont_migrate/glossary.md ---------------------------------------------------------------------- diff --git a/website/old_site_migration/dont_migrate/glossary.md b/website/old_site_migration/dont_migrate/glossary.md deleted file mode 100644 index 5ff1015..0000000 --- a/website/old_site_migration/dont_migrate/glossary.md +++ /dev/null @@ -1,15 +0,0 @@ ---- -layout: default -title: Glossary -theme: - name: retro-mahout ---- - -NOTE: no migrate- empty file. good idea though in general - - -This is a list of common glossary terms used on both the mailing lists and -around the site. Where possible I have tried to provide a link to more -in-depth explanations from the web - -{children:excerpt=true|style=h4} http://git-wip-us.apache.org/repos/asf/mahout/blob/3c53a6dc/website/old_site_migration/dont_migrate/mahout-benchmarks.md ---------------------------------------------------------------------- diff --git a/website/old_site_migration/dont_migrate/mahout-benchmarks.md b/website/old_site_migration/dont_migrate/mahout-benchmarks.md deleted file mode 100644 index 1502b90..0000000 --- a/website/old_site_migration/dont_migrate/mahout-benchmarks.md +++ /dev/null @@ -1,156 +0,0 @@ ---- -layout: default -title: Mahout Benchmarks -theme: - name: retro-mahout ---- - -NOTE: These are all pretty old. I think it would be cool to have a page -like this but may need total reboot... - -<a name="MahoutBenchmarks-Introduction"></a> -# Introduction - -Depending on hardware configuration, exact distribution of ratings over users and items YMMV! - -<a name="MahoutBenchmarks-Recommenders"></a> -# Recommenders - -<a name="MahoutBenchmarks-ARuleofThumb"></a> -## A Rule of Thumb - -100M preferences are about the data set size where non-distributed -recommenders will outgrow a normal-sized machine (32-bit, <= 4GB RAM). Your -mileage will vary significantly with the nature of the data. - -<a name="MahoutBenchmarks-Distributedrecommendervs.Wikipedialinks(May272010)"></a> -## Distributed recommender vs. Wikipedia links (May 27 2010) - -From the mailing list: - -I just finished running a set of recommendations based on the Wikipedia -link graph, for book purposes (yeah, it's unconventional). I ran on my -laptop, but it ought to be crudely representative of how it runs in a real -cluster. - -The input is 1058MB as a text file, and contains, 130M article-article -associations, from 5.7M articles to 3.8M distinct articles ("users" and -"items", respectively). I estimate cost based on Amazon's North -American small Linux-based instance pricing of $0.085/hour. I ran on a -dual-core laptop with plenty of RAM, allowing 1GB per worker, so this is -valid. - -In this run, I run recommendations for all 5.7M "users". You can certainly -run for any subset of all users of course. - -Phase 1 (Item ID to item index mapping) -29 minutes CPU time -$0.05 -60MB output - -Phase 2 (Create user vectors) -88 minutes CPU time -$0.13 -Output: 1159MB - -Phase 3 (Count co-occurrence) -77 hours CPU time -$6.54 -Output: 23.6GB - -Phase 4 (Partial multiply prep) -10.5 hours CPU time -$0.90 -Output: 24.6GB - -Phase 5 (Aggregate and recommend) -about 600 hours -about $51.00 -about 10GB -(I estimated these rather than let it run at home for days!) - - -Note that phases 1 and 3 may be run less frequently, and need not be run -every time. But the cost is dominated by the last step, which is most of -the work. I've ignored storage costs. - -This implies a cost of $0.01 (or about 8 instance-minutes) per 1,000 user -recommendations. That's not bad if, say, you want to update recs for you -site's 100,000 daily active users for a dollar. - -There are several levers one could pull internally to sacrifice accuracy -for speed, but it's currently set to pretty normal values. So this is just -one possibility. - -Now that's not terrible, but it is about 8x more computing than would be -needed by a non-distributed implementation *if* you could fit the whole -data set into a very large instance's memory, which is still possible at -this scale but needs a pretty big instance. That's a very apples-to-oranges -comparison of course; different algorithms, entirely different -environments. This is about the amount of overhead I'd expect from -distributing -- interesting to note how non-trivial it is. - -<a name="MahoutBenchmarks-Non-distributedrecommendervs.KDDCupdataset(March2011)"></a> -## Non-distributed recommender vs. KDD Cup data set (March 2011) - -(From the [email protected] mailing list) - -I've been test-driving a simple application of Mahout recommenders (the -non-distributed kind) on Amazon EC2 on the new Yahoo KDD Cup data set -(kddcup.yahoo.com). - -In the spirit of open-source, like I mentioned, I'm committing the extra -code to mahout-examples that can be used to run a Recommender on the input -and output the right format. And, I'd like to publish the rough timings -too. Find all the source in org.apache.mahout.cf.taste.example.kddcup - -<a name="MahoutBenchmarks-Track1"></a> -### Track 1 - -* m2.2xlarge instance, 34.2GB RAM / 4 cores -* Steady state memory consumption: ~19GB -* Computation time: 30 hours (wall clock-time) -* CPU time per user: ~0.43 sec -* Cost on EC2: $34.20 (!) - -(Helpful hint on cost I realized after the fact: you can almost surely get -spot instances for cheaper. The maximum price this sort of instance has -gone for as a spot instance is about $0.60/hour, vs "retail price" of -$1.14/hour.) - -Resulted in an RMSE of 29.5618 (the rating scale is 0-100), which is only -good enough for 29th place at the moment. Not terrible for "out of the box" -performance -- it's just using an item-based recommender with uncentered -cosine similarity. But not really good in absolute terms. A winning -solution is going to try to factor in time, and apply more sophisticated -techniques. The best RMSE so far is about 23. - -<a name="MahoutBenchmarks-Track2"></a> -### Track 2 - -* c1.xlarge instance: 7GB RAM / 8 cores -* Steady state memory consumption: ~3.8GB -* Computation time: 4.1 hours (wall clock-time) -* CPU time per user: ~1.1 sec -* Cost on EC2: $3.20 - -For this I bothered to write a simplistic item-item similarity metric to -take into account the additional info that is available: track, artist, -album, genre. The result was comparatively better: 17.92% error rate, good -enough for 4th place at the moment. - -Of course, the next task is to put this through the actual distributed -processing -- that's really the appropriate solution. - -This shows you can still tackle fairly impressive scale with a -non-distributed solution. These results suggest that the largest instances -available from EC2 would accomodate almost 1 billion ratings in memory. -However at that scale running a user's full recommendations would easily be -measured in seconds, not milliseconds. - -<a name="MahoutBenchmarks-Clustering"></a> -# Clustering - -See [MAHOUT-588](https://issues.apache.org/jira/browse/MAHOUT-588) - - http://git-wip-us.apache.org/repos/asf/mahout/blob/3c53a6dc/website/old_site_migration/dont_migrate/mahoutintegration.md ---------------------------------------------------------------------- diff --git a/website/old_site_migration/dont_migrate/mahoutintegration.md b/website/old_site_migration/dont_migrate/mahoutintegration.md deleted file mode 100644 index e2d01ea..0000000 --- a/website/old_site_migration/dont_migrate/mahoutintegration.md +++ /dev/null @@ -1,6 +0,0 @@ ---- -layout: default -title: MahoutIntegration -theme: - name: retro-mahout ---- http://git-wip-us.apache.org/repos/asf/mahout/blob/3c53a6dc/website/old_site_migration/dont_migrate/recommender-overview.md ---------------------------------------------------------------------- diff --git a/website/old_site_migration/dont_migrate/recommender-overview.md b/website/old_site_migration/dont_migrate/recommender-overview.md deleted file mode 100644 index a48d47f..0000000 --- a/website/old_site_migration/dont_migrate/recommender-overview.md +++ /dev/null @@ -1,34 +0,0 @@ ---- -layout: default -title: Recommender Quickstart -theme: - name: retro-mahout ---- - -Not migrating bc seems to be same content as intro-coocurrence-spark.md - -# Recommender Overview - -Recommenders have changed over the years. Mahout contains a long list of them, which you can still use. But to get the best out of our more modern aproach we'll need to think of the Recommender as a "model creation" component—supplied by Mahout's new spark-itemsimilarity job, and a "serving" component—supplied by a modern scalable search engine, like Solr. - - - -To integrate with your application you will collect user interactions storing them in a DB and also in a from usable by Mahout. The simplest way to do this is to log user interactions to csv files (user-id, item-id). The DB should be setup to contain the last n user interactions, which will form part of the query for recommendations. - -Mahout's spark-itemsimilarity will create a table of (item-id, list-of-similar-items) in csv form. Think of this as an item collection with one field containing the item-ids of similar items. Index this with your search engine. - -When your application needs recommendations for a specific person, get the latest user history of interactions from the DB and query the indicator collection with this history. You will get back an ordered list of item-ids. These are your recommendations. You may wish to filter out any that the user has already seen but that will depend on your use case. - -All ids for users and items are preserved as string tokens and so work as an external key in DBs or as doc ids for search engines, they also work as tokens for search queries. - -##References - -1. A free ebook, which talks about the general idea: [Practical Machine Learning](https://www.mapr.com/practical-machine-learning) -2. A slide deck, which talks about mixing actions or other indicators: [Creating a Multimodal Recommender with Mahout and a Search Engine](http://occamsmachete.com/ml/2014/10/07/creating-a-unified-recommender-with-mahout-and-a-search-engine/) -3. Two blog posts: [What's New in Recommenders: part #1](http://occamsmachete.com/ml/2014/08/11/mahout-on-spark-whats-new-in-recommenders/) -and [What's New in Recommenders: part #2](http://occamsmachete.com/ml/2014/09/09/mahout-on-spark-whats-new-in-recommenders-part-2/) -3. A post describing the loglikelihood ratio: [Surprise and Coinsidense](http://tdunning.blogspot.com/2008/03/surprise-and-coincidence.html) LLR is used to reduce noise in the data while keeping the calculations O(n) complexity. - -##Mahout Model Creation - -See the page describing [*spark-itemsimilarity*](http://mahout.apache.org/users/recommender/intro-cooccurrence-spark.html) for more details. \ No newline at end of file http://git-wip-us.apache.org/repos/asf/mahout/blob/3c53a6dc/website/old_site_migration/needs_work_convenience/bayesian-commandline.md ---------------------------------------------------------------------- diff --git a/website/old_site_migration/needs_work_convenience/bayesian-commandline.md b/website/old_site_migration/needs_work_convenience/bayesian-commandline.md deleted file mode 100644 index d81d3ef..0000000 --- a/website/old_site_migration/needs_work_convenience/bayesian-commandline.md +++ /dev/null @@ -1,64 +0,0 @@ ---- -layout: default -title: bayesian-commandline -theme: - name: retro-mahout ---- - -# Naive Bayes commandline documentation - -<a name="bayesian-commandline-Introduction"></a> -## Introduction - -This quick start page describes how to run the naive bayesian and -complementary naive bayesian classification algorithms on a Hadoop cluster. - -<a name="bayesian-commandline-Steps"></a> -## Steps - -<a name="bayesian-commandline-Testingitononesinglemachinew/ocluster"></a> -### Testing it on one single machine w/o cluster - -In the examples directory type: - - mvn -q exec:java - -Dexec.mainClass="org.apache.mahout.classifier.bayes.mapreduce.bayes.<JOB>" - -Dexec.args="<OPTIONS>" - - mvn -q exec:java - -Dexec.mainClass="org.apache.mahout.classifier.bayes.mapreduce.cbayes.<JOB>" - -Dexec.args="<OPTIONS>" - - -<a name="bayesian-commandline-Runningitonthecluster"></a> -### Running it on the cluster - -* In $MAHOUT_HOME/, build the jar containing the job (mvn install) The job -will be generated in $MAHOUT_HOME/core/target/ and it's name will contain -the Mahout version number. For example, when using Mahout 0.1 release, the -job will be mahout-core-0.1.jar - -* (Optional) 1 Start up Hadoop: $HADOOP_HOME/bin/start-all.sh - -* Put the data: $HADOOP_HOME/bin/hadoop fs -put <PATH TO DATA> testdata - -* Run the Job: $HADOOP_HOME/bin/hadoop jar - - $MAHOUT_HOME/core/target/mahout-core-<MAHOUT VERSION>.job - org.apache.mahout.classifier.bayes.mapreduce.bayes.BayesDriver <OPTIONS> - -* Get the data out of HDFS and have a look. Use bin/hadoop fs -lsr output -to view all outputs. - -<a name="bayesian-commandline-Commandlineoptions"></a> -## Command line options - - BayesDriver, BayesThetaNormalizerDriver, CBayesNormalizedWeightDriver, CBayesDriver, CBayesThetaDriver, CBayesThetaNormalizerDriver, BayesWeightSummerDriver, BayesFeatureDriver, BayesTfIdfDriver Usage: - [--input <input> --output <output> --help] - - Options - - --input (-i) input The Path for input Vectors. Must be a SequenceFile of Writable, Vector. - --output (-o) output The directory pathname for output points. - --help (-h) Print out help. - http://git-wip-us.apache.org/repos/asf/mahout/blob/3c53a6dc/website/old_site_migration/needs_work_convenience/faq.md ---------------------------------------------------------------------- diff --git a/website/old_site_migration/needs_work_convenience/faq.md b/website/old_site_migration/needs_work_convenience/faq.md deleted file mode 100644 index 8e1e592..0000000 --- a/website/old_site_migration/needs_work_convenience/faq.md +++ /dev/null @@ -1,105 +0,0 @@ ---- -layout: default -title: FAQ -theme: - name: retro-mahout ---- - -# The Official Mahout FAQ - -*General* - -1. [What is Apache Mahout?](#whatis) -1. [What does the name mean?](#mean) -1. [How is the name pronounced?](#pronounce) -1. [Where can I find the origins of the Mahout project?](#historical) -1. [Where can I download the Mahout logo?](#downloadlogo) -1. [Where can I download Mahout slide presentations?](#presentations) - -*Algorithms* - -1. [What algorithms are implemented in Mahout?](#algos) -1. [What algorithms are missing from Mahout?](#todo) -1. [Do I need Hadoop to run Mahout?](#hadoop) - -*Hadoop specific questions* - -1. [Mahout just won't run in parallel on my dataset. Why?](#split) - - -# *Answers* - - -## General - - -<a name="whatis"></a> -#### What is Apache Mahout? - -Apache Mahout is a suite of machine learning libraries designed to be -scalable and robust - -<a name="mean"></a> -#### What does the name mean? - -The name [Mahout](http://en.wikipedia.org/wiki/Mahout) - was original chosen for it's association with the [Apache Hadoop](http://hadoop.apache.org) - project. A Mahout is a person who drives an elephant (hint: Hadoop's logo -is an elephant). We just wanted a name that complemented Hadoop but we see -our project as a good driver of Hadoop in the sense that we will be using -and testing it. We are not, however, implying that we are controlling -Hadoop's development. - -Prior to coming to the ASF, those of us working on the project plan voted between [Howdah](http://en.wikipedia.org/wiki/Howdah) â the carriage on top of an elephant and Mahout. - -<a name="historical"></a> -#### Where can I find the origins of the Mahout project? - -See [http://ml-site.grantingersoll.com](http://web.archive.org/web/20080101233917/http://ml-site.grantingersoll.com/index.php?title=Main_Page) - for old wiki and mailing list archives (all read-only) - -Mahout was started by <a href="http://web.archive.org/web/20071228055210/http://ml-site.grantingersoll.com/index.php?title=Main_Page" class="external-link" rel="nofollow">Isabel Drost, Grant Ingersoll and Karl Wettin</a>. It <a href="http://web.archive.org/web/20080201093120/http://lucene.apache.org/#22+January+2008+-+Lucene+PMC+Approves+Mahout+Machine+Learning+Project" class="external-link" rel="nofollow">started</a> as part of the <a href="http://lucene.apache.org" class="external-link" rel="nofollow">Lucene</a> project (see the <a href="http://web.archive.org/web/20080102151102/http://ml-site.grantingersoll.com/index.php?title=Incubator_proposal" class="external-link" rel="nofollow">original proposal</a>) and went on to become a top level project in April of 2010.</p><p style="text-align: left;">The original goal was to implement all 10 algorithms from Andrew Ng's paper "<a href="http://ai.stanford.edu/~ang/papers/nips06-mapreducemulticore.pdf" class="external-link" rel="nof ollow">Map-Reduce for Machine Learning on Multicore</a>"</p> - -<a name="pronounce"></a> -#### How is the name pronounced? - -There are some disagreements about how to pronounce the name. Webster's has it as muh-hout (as in ["out"](http://dictionary.reference.com/browse/mahout)), but the Sanskrit/Hindi origins pronounce it as "muh-hoot". The second pronunciation suggests a nice pun on the Hebrew word ×××ת meaning "essence or truth". - -<a name="downloadlogo"></a> -#### Where can I download the Mahout logo? - -See [MAHOUT-335](https://issues.apache.org/jira/browse/MAHOUT-335) - - -<a name="presentations"></a> -#### Where can I download Mahout slide presentations? - -The [Books, Tutorials and Talks](https://mahout.apache.org/general/books-tutorials-and-talks.html) - page contains an overview of a wide variety of presentations with links to slides where available. - -## Algorithms - -<a name="algos"></a> -#### What algorithms are implemented in Mahout? - -We are interested in a wide variety of machine learning algorithms. Many of -which are already implemented in Mahout. You can find a list [here](https://mahout.apache.org/users/basics/algorithms.html). - -<a name="todo"></a> -#### What algorithms are missing from Mahout? - -There are many machine learning algorithms that we would like to have in -Mahout. If you have an algorithm or an improvement to an algorithm that you would -like to implement, start a discussion on our [mailing list](https://mahout.apache.org/general/mailing-lists,-irc-and-archives.html). - -<a name="hadoop"></a> -#### Do I need Hadoop to use Mahout? - -There is a number of algorithm implementations that require no Hadoop dependencies whatsoever, consult the [algorithms list](https://mahout.apache.org/users/basics/algorithms.html). In the future, we might provide more algorithm implementations on platforms more suitable for machine learning such as [Apache Spark](http://spark.apache.org) - -## Hadoop specific questions -<a name="split"></a> -#### Mahout just won't run in parallel on my dataset. Why? - -If you are running training on a Hadoop cluster keep in mind that the number of mappers started is governed by the size of the input data and the configured split/block size of your cluster. As a rule of thumb, -anything below 100MB in size won't be split by default. \ No newline at end of file http://git-wip-us.apache.org/repos/asf/mahout/blob/3c53a6dc/website/old_site_migration/needs_work_convenience/map-reduce/clustering/20newsgroups.md ---------------------------------------------------------------------- diff --git a/website/old_site_migration/needs_work_convenience/map-reduce/clustering/20newsgroups.md b/website/old_site_migration/needs_work_convenience/map-reduce/clustering/20newsgroups.md deleted file mode 100644 index da5174f..0000000 --- a/website/old_site_migration/needs_work_convenience/map-reduce/clustering/20newsgroups.md +++ /dev/null @@ -1,11 +0,0 @@ ---- -layout: default -title: 20Newsgroups -theme: - name: retro-mahout ---- - -<a name="20Newsgroups-NaiveBayesusing20NewsgroupsData"></a> -# Naive Bayes using 20 Newsgroups Data - -See [https://issues.apache.org/jira/browse/MAHOUT-9](https://issues.apache.org/jira/browse/MAHOUT-9) http://git-wip-us.apache.org/repos/asf/mahout/blob/3c53a6dc/website/old_site_migration/needs_work_convenience/map-reduce/clustering/canopy-clustering.md ---------------------------------------------------------------------- diff --git a/website/old_site_migration/needs_work_convenience/map-reduce/clustering/canopy-clustering.md b/website/old_site_migration/needs_work_convenience/map-reduce/clustering/canopy-clustering.md deleted file mode 100644 index eb4c845..0000000 --- a/website/old_site_migration/needs_work_convenience/map-reduce/clustering/canopy-clustering.md +++ /dev/null @@ -1,188 +0,0 @@ ---- -layout: default -title: Canopy Clustering -theme: - name: retro-mahout ---- - -<a name="CanopyClustering-CanopyClustering"></a> -# Canopy Clustering - -[Canopy Clustering](http://www.kamalnigam.com/papers/canopy-kdd00.pdf) - is a very simple, fast and surprisingly accurate method for grouping -objects into clusters. All objects are represented as a point in a -multidimensional feature space. The algorithm uses a fast approximate -distance metric and two distance thresholds T1 > T2 for processing. The -basic algorithm is to begin with a set of points and remove one at random. -Create a Canopy containing this point and iterate through the remainder of -the point set. At each point, if its distance from the first point is < T1, -then add the point to the cluster. If, in addition, the distance is < T2, -then remove the point from the set. This way points that are very close to -the original will avoid all further processing. The algorithm loops until -the initial set is empty, accumulating a set of Canopies, each containing -one or more points. A given point may occur in more than one Canopy. - -Canopy Clustering is often used as an initial step in more rigorous -clustering techniques, such as [K-Means Clustering](k-means-clustering.html) -. By starting with an initial clustering the number of more expensive -distance measurements can be significantly reduced by ignoring points -outside of the initial canopies. - -**WARNING**: Canopy is deprecated in the latest release and will be removed once streaming k-means becomes stable enough. - -<a name="CanopyClustering-Strategyforparallelization"></a> -## Strategy for parallelization - -Looking at the sample Hadoop implementation in [http://code.google.com/p/canopy-clustering/](http://code.google.com/p/canopy-clustering/) - the processing is done in 3 M/R steps: -1. The data is massaged into suitable input format -1. Each mapper performs canopy clustering on the points in its input set and -outputs its canopies' centers -1. The reducer clusters the canopy centers to produce the final canopy -centers -1. The points are then clustered into these final canopies - -Some ideas can be found in [Cluster computing and MapReduce](https://www.youtube.com/watch?v=yjPBkvYh-ss&list=PLEFAB97242917704A) - lecture video series \[by Google(r)\]; Canopy Clustering is discussed in [lecture #4](https://www.youtube.com/watch?v=1ZDybXl212Q) -. Finally here is the [Wikipedia page](http://en.wikipedia.org/wiki/Canopy_clustering_algorithm) -. - -<a name="CanopyClustering-Designofimplementation"></a> -## Design of implementation - -The implementation accepts as input Hadoop SequenceFiles containing -multidimensional points (VectorWritable). Points may be expressed either as -dense or sparse Vectors and processing is done in two phases: Canopy -generation and, optionally, Clustering. - -<a name="CanopyClustering-Canopygenerationphase"></a> -### Canopy generation phase - -During the map step, each mapper processes a subset of the total points and -applies the chosen distance measure and thresholds to generate canopies. In -the mapper, each point which is found to be within an existing canopy will -be added to an internal list of Canopies. After observing all its input -vectors, the mapper updates all of its Canopies and normalizes their totals -to produce canopy centroids which are output, using a constant key -("centroid") to a single reducer. The reducer receives all of the initial -centroids and again applies the canopy measure and thresholds to produce a -final set of canopy centroids which is output (i.e. clustering the cluster -centroids). The reducer output format is: SequenceFile(Text, Canopy) with -the _key_ encoding the canopy identifier. - -<a name="CanopyClustering-Clusteringphase"></a> -### Clustering phase - -During the clustering phase, each mapper reads the Canopies produced by the -first phase. Since all mappers have the same canopy definitions, their -outputs will be combined during the shuffle so that each reducer (many are -allowed here) will see all of the points assigned to one or more canopies. -The output format will then be: SequenceFile(IntWritable, -WeightedVectorWritable) with the _key_ encoding the canopyId. The -WeightedVectorWritable has two fields: a double weight and a VectorWritable -vector. Together they encode the probability that each vector is a member -of the given canopy. - -<a name="CanopyClustering-RunningCanopyClustering"></a> -## Running Canopy Clustering - -The canopy clustering algorithm may be run using a command-line invocation -on CanopyDriver.main or by making a Java call to CanopyDriver.run(...). -Both require several arguments: - -Invocation using the command line takes the form: - - - bin/mahout canopy \ - -i <input vectors directory> \ - -o <output working directory> \ - -dm <DistanceMeasure> \ - -t1 <T1 threshold> \ - -t2 <T2 threshold> \ - -t3 <optional reducer T1 threshold> \ - -t4 <optional reducer T2 threshold> \ - -cf <optional cluster filter size (default: 0)> \ - -ow <overwrite output directory if present> - -cl <run input vector clustering after computing Canopies> - -xm <execution method: sequential or mapreduce> - - -Invocation using Java involves supplying the following arguments: - -1. input: a file path string to a directory containing the input data set a -SequenceFile(WritableComparable, VectorWritable). The sequence file _key_ -is not used. -1. output: a file path string to an empty directory which is used for all -output from the algorithm. -1. measure: the fully-qualified class name of an instance of DistanceMeasure -which will be used for the clustering. -1. t1: the T1 distance threshold used for clustering. -1. t2: the T2 distance threshold used for clustering. -1. t3: the optional T1 distance threshold used by the reducer for -clustering. If not specified, T1 is used by the reducer. -1. t4: the optional T2 distance threshold used by the reducer for -clustering. If not specified, T2 is used by the reducer. -1. clusterFilter: the minimum size for canopies to be output by the -algorithm. Affects both sequential and mapreduce execution modes, and -mapper and reducer outputs. -1. runClustering: a boolean indicating, if true, that the clustering step is -to be executed after clusters have been determined. -1. runSequential: a boolean indicating, if true, that the computation is to -be run in memory using the reference Canopy implementation. Note: that the -sequential implementation performs a single pass through the input vectors -whereas the MapReduce implementation performs two passes (once in the -mapper and again in the reducer). The MapReduce implementation will -typically produce less clusters than the sequential implementation as a -result. - -After running the algorithm, the output directory will contain: -1. clusters-0: a directory containing SequenceFiles(Text, Canopy) produced -by the algorithm. The Text _key_ contains the cluster identifier of the -Canopy. -1. clusteredPoints: (if runClustering enabled) a directory containing -SequenceFile(IntWritable, WeightedVectorWritable). The IntWritable _key_ is -the canopyId. The WeightedVectorWritable _value_ is a bean containing a -double _weight_ and a VectorWritable _vector_ where the weight indicates -the probability that the vector is a member of the canopy. For canopy -clustering, the weights are computed as 1/(1+distance) where the distance -is between the cluster center and the vector using the chosen -DistanceMeasure. - -<a name="CanopyClustering-Examples"></a> -# Examples - -The following images illustrate Canopy clustering applied to a set of -randomly-generated 2-d data points. The points are generated using a normal -distribution centered at a mean location and with a constant standard -deviation. See the README file in the [/examples/src/main/java/org/apache/mahout/clustering/display/README.txt](https://github.com/apache/mahout/blob/master/examples/src/main/java/org/apache/mahout/clustering/display/README.txt) - for details on running similar examples. - -The points are generated as follows: - -* 500 samples m=\[1.0, 1.0\](1.0,-1.0\.html) - sd=3.0 -* 300 samples m=\[1.0, 0.0\](1.0,-0.0\.html) - sd=0.5 -* 300 samples m=\[0.0, 2.0\](0.0,-2.0\.html) - sd=0.1 - -In the first image, the points are plotted and the 3-sigma boundaries of -their generator are superimposed. - - - -In the second image, the resulting canopies are shown superimposed upon the -sample data. Each canopy is represented by two circles, with radius T1 and -radius T2. - - - -The third image uses the same values of T1 and T2 but only superimposes -canopies covering more than 10% of the population. This is a bit better -representation of the data but it still has lots of room for improvement. -The advantage of Canopy clustering is that it is single-pass and fast -enough to iterate runs using different T1, T2 parameters and display -thresholds. - - - http://git-wip-us.apache.org/repos/asf/mahout/blob/3c53a6dc/website/old_site_migration/needs_work_convenience/map-reduce/clustering/canopy-commandline.md ---------------------------------------------------------------------- diff --git a/website/old_site_migration/needs_work_convenience/map-reduce/clustering/canopy-commandline.md b/website/old_site_migration/needs_work_convenience/map-reduce/clustering/canopy-commandline.md deleted file mode 100644 index 446faba..0000000 --- a/website/old_site_migration/needs_work_convenience/map-reduce/clustering/canopy-commandline.md +++ /dev/null @@ -1,70 +0,0 @@ ---- -layout: default -title: canopy-commandline -theme: - name: retro-mahout ---- - -<a name="canopy-commandline-RunningCanopyClusteringfromtheCommandLine"></a> -# Running Canopy Clustering from the Command Line -Mahout's Canopy clustering can be launched from the same command line -invocation whether you are running on a single machine in stand-alone mode -or on a larger Hadoop cluster. The difference is determined by the -$HADOOP_HOME and $HADOOP_CONF_DIR environment variables. If both are set to -an operating Hadoop cluster on the target machine then the invocation will -run Canopy on that cluster. If either of the environment variables are -missing then the stand-alone Hadoop configuration will be invoked instead. - - - ./bin/mahout canopy <OPTIONS> - - -* In $MAHOUT_HOME/, build the jar containing the job (mvn install) The job -will be generated in $MAHOUT_HOME/core/target/ and it's name will contain -the Mahout version number. For example, when using Mahout 0.3 release, the -job will be mahout-core-0.3.job - - -<a name="canopy-commandline-Testingitononesinglemachinew/ocluster"></a> -## Testing it on one single machine w/o cluster - -* Put the data: cp <PATH TO DATA> testdata -* Run the Job: - - ./bin/mahout canopy -i testdata -o output -dm -org.apache.mahout.common.distance.CosineDistanceMeasure -ow -t1 5 -t2 2 - - -<a name="canopy-commandline-Runningitonthecluster"></a> -## Running it on the cluster - -* (As needed) Start up Hadoop: $HADOOP_HOME/bin/start-all.sh -* Put the data: $HADOOP_HOME/bin/hadoop fs -put <PATH TO DATA> testdata -* Run the Job: - - export HADOOP_HOME=<Hadoop Home Directory> - export HADOOP_CONF_DIR=$HADOOP_HOME/conf - ./bin/mahout canopy -i testdata -o output -dm -org.apache.mahout.common.distance.CosineDistanceMeasure -ow -t1 5 -t2 2 - -* Get the data out of HDFS and have a look. Use bin/hadoop fs -lsr output -to view all outputs. - -<a name="canopy-commandline-Commandlineoptions"></a> -# Command line options - - --input (-i) input Path to job input directory.Must - be a SequenceFile of - VectorWritable - --output (-o) output The directory pathname for output. - --overwrite (-ow) If present, overwrite the output - directory before running job - --distanceMeasure (-dm) distanceMeasure The classname of the - DistanceMeasure. Default is - SquaredEuclidean - --t1 (-t1) t1 T1 threshold value - --t2 (-t2) t2 T2 threshold value - --clustering (-cl) If present, run clustering after - the iterations have taken place - --help (-h) Print out help - http://git-wip-us.apache.org/repos/asf/mahout/blob/3c53a6dc/website/old_site_migration/needs_work_convenience/map-reduce/clustering/cluster-dumper.md ---------------------------------------------------------------------- diff --git a/website/old_site_migration/needs_work_convenience/map-reduce/clustering/cluster-dumper.md b/website/old_site_migration/needs_work_convenience/map-reduce/clustering/cluster-dumper.md deleted file mode 100644 index 454734b..0000000 --- a/website/old_site_migration/needs_work_convenience/map-reduce/clustering/cluster-dumper.md +++ /dev/null @@ -1,106 +0,0 @@ ---- -layout: default -title: Cluster Dumper -theme: - name: retro-mahout ---- - -<a name="ClusterDumper-Introduction"></a> -## Cluster Dumper - Introduction - -Clustering tasks in Mahout will output data in the format of a SequenceFile -(Text, Cluster) and the Text is a cluster identifier string. To analyze -this output we need to convert the sequence files to a human readable -format and this is achieved using the clusterdump utility. - -<a name="ClusterDumper-Stepsforanalyzingclusteroutputusingclusterdumputility"></a> -## Steps for analyzing cluster output using clusterdump utility - -After you've executed a clustering tasks (either examples or real-world), -you can run clusterdumper in 2 modes: - - -1. Hadoop Environment -1. Standalone Java Program - - -<a name="ClusterDumper-HadoopEnvironment{anchor:HadoopEnvironment}"></a> -### Hadoop Environment - -If you have setup your HADOOP_HOME environment variable, you can use the -command line utility `mahout` to execute the ClusterDumper on Hadoop. In -this case we wont need to get the output clusters to our local machines. -The utility will read the output clusters present in HDFS and output the -human-readable cluster values into our local file system. Say you've just -executed the [synthetic control example ](clustering-of-synthetic-control-data.html) - and want to analyze the output, you can execute the `mahout clusterdumper` utility from the command line. - -#### CLI options: - --help Print out help - --input (-i) input The directory containing Sequence - Files for the Clusters - --output (-o) output The output file. If not specified, - dumps to the console. - --outputFormat (-of) outputFormat The optional output format to write - the results as. Options: TEXT, CSV, or GRAPH_ML - --substring (-b) substring The number of chars of the - asFormatString() to print - --pointsDir (-p) pointsDir The directory containing points - sequence files mapping input vectors - to their cluster. If specified, - then the program will output the - points associated with a cluster - --dictionary (-d) dictionary The dictionary file. - --dictionaryType (-dt) dictionaryType The dictionary file type - (text|sequencefile) - --distanceMeasure (-dm) distanceMeasure The classname of the DistanceMeasure. - Default is SquaredEuclidean. - --numWords (-n) numWords The number of top terms to print - --tempDir tempDir Intermediate output directory - --startPhase startPhase First phase to run - --endPhase endPhase Last phase to run - --evaluate (-e) Run ClusterEvaluator and CDbwEvaluator over the - input. The output will be appended to the rest of - the output at the end. - -### Standalone Java Program - -Run the clusterdump utility as follows as a standalone Java Program through Eclipse. <!-- - if you are using eclipse, setup mahout-utils as a project as specified in [Working with Maven in Eclipse](../../developers/buildingmahout.html). --> - To execute ClusterDumper.java, - -* Under mahout-utils, Right-Click on ClusterDumper.java -* Choose Run-As, Run Configurations -* On the left menu, click on Java Application -* On the top-bar click on "New Launch Configuration" -* A new launch should be automatically created with project as - - "mahout-utils" and Main Class as "org.apache.mahout.utils.clustering.ClusterDumper" - -In the arguments tab, specify the below arguments - - - --seqFileDir <MAHOUT_HOME>/examples/output/clusters-10 - --pointsDir <MAHOUT_HOME>/examples/output/clusteredPoints - --output <MAHOUT_HOME>/examples/output/clusteranalyze.txt - replace <MAHOUT_HOME> with the actual path of your $MAHOUT_HOME - -* Hit run to execute the ClusterDumper using Eclipse. Setting breakpoints etc should just work fine. - -Reading the output file - -This will output the clusters into a file called clusteranalyze.txt inside $MAHOUT_HOME/examples/output -Sample data will look like - -CL-0 { n=116 c=[29.922, 30.407, 30.373, 30.094, 29.886, 29.937, 29.751, 30.054, 30.039, 30.126, 29.764, 29.835, 30.503, 29.876, 29.990, 29.605, 29.379, 30.120, 29.882, 30.161, 29.825, 30.074, 30.001, 30.421, 29.867, 29.736, 29.760, 30.192, 30.134, 30.082, 29.962, 29.512, 29.736, 29.594, 29.493, 29.761, 29.183, 29.517, 29.273, 29.161, 29.215, 29.731, 29.154, 29.113, 29.348, 28.981, 29.543, 29.192, 29.479, 29.406, 29.715, 29.344, 29.628, 29.074, 29.347, 29.812, 29.058, 29.177, 29.063, 29.607](29.922,-30.407,-30.373,-30.094,-29.886,-29.937,-29.751,-30.054,-30.039,-30.126,-29.764,-29.835,-30.503,-29.876,-29.990,-29.605,-29.379,-30.120,-29.882,-30.161,-29.825,-30.074,-30.001,-30.421,-29.867,-29.736,-29.760,-30.192,-30.134,-30.082,-29.962,-29.512,-29.736,-29.594,-29.493,-29.761,-29.183,-29.517,-29.273,-29.161,-29.215,-29.731,-29.154,-29.113,-29.348,-28.981,-29.543,-29.192,-29.479,-29.406,-29.715,-29.344,-29.628,-29.074,-29.347,-29.812,-29.058,-29.177,-29.063,-29.607.html) - r=[3.463, 3.351, 3.452, 3.438, 3.371, 3.569, 3.253, 3.531, 3.439, 3.472, -3.402, 3.459, 3.320, 3.260, 3.430, 3.452, 3.320, 3.499, 3.302, 3.511, -3.520, 3.447, 3.516, 3.485, 3.345, 3.178, 3.492, 3.434, 3.619, 3.483, -3.651, 3.833, 3.812, 3.433, 4.133, 3.855, 4.123, 3.999, 4.467, 4.731, -4.539, 4.956, 4.644, 4.382, 4.277, 4.918, 4.784, 4.582, 4.915, 4.607, -4.672, 4.577, 5.035, 5.241, 4.731, 4.688, 4.685, 4.657, 4.912, 4.300] } - -and on... - -where CL-0 is the Cluster 0 and n=116 refers to the number of points observed by this cluster and c = \[29.922 ...\] - refers to the center of Cluster as a vector and r = \[3.463 ..\] refers to -the radius of the cluster as a vector. \ No newline at end of file
