Repository: mahout Updated Branches: refs/heads/website 3a724debc -> 9c0314528
http://git-wip-us.apache.org/repos/asf/mahout/blob/9c031452/website/old_site_migration/needs_work_priority/sparkbindings/faq.md ---------------------------------------------------------------------- diff --git a/website/old_site_migration/needs_work_priority/sparkbindings/faq.md b/website/old_site_migration/needs_work_priority/sparkbindings/faq.md new file mode 100644 index 0000000..9649e3b --- /dev/null +++ b/website/old_site_migration/needs_work_priority/sparkbindings/faq.md @@ -0,0 +1,52 @@ +--- +layout: default +title: FAQ +theme: + name: retro-mahout +--- + +# FAQ for using Mahout with Spark + +**Q: Mahout Spark shell doesn't start; "ClassNotFound" problems or various classpath problems.** + +**A:** So far as of the time of this writing all reported problems starting the Spark shell in Mahout were revolving +around classpath issues one way or another. + +If you are getting method signature like errors, most probably you have mismatch between Mahout's Spark dependency +and actual Spark installed. (At the time of this writing the HEAD depends on Spark 1.1.0) but check mahout/pom.xml. + +Troubleshooting general classpath issues is pretty straightforward. Since Mahout is using Spark's installation +and its classpath as reported by Spark itself for Spark-related dependencies, it is important to make sure +the classpath is sane and is made available to Mahout: + +1. Check Spark is of correct version (same as in Mahout's poms), is compiled and SPARK_HOME is set. +2. Check Mahout is compiled and MAHOUT_HOME is set. +3. Run `$SPARK_HOME/bin/compute-classpath.sh` and make sure it produces sane result with no errors. +If it outputs something other than a straightforward classpath string, most likely Spark is not compiled/set correctly (later spark versions require +`sbt/sbt assembly` to be run, simply runnig `sbt/sbt publish-local` is not enough any longer). +4. Run `$MAHOUT_HOME/bin/mahout -spark classpath` and check that path reported in step (3) is included. + +**Q: I am using the command line Mahout jobs that run on Spark or am writing my own application that uses +Mahout's Spark code. When I run the code on my cluster I get ClassNotFound or signature errors during serialization. +What's wrong?** + +**A:** The Spark artifacts in the maven ecosystem may not match the exact binary you are running on your cluster. This may +cause class name or version mismatches. In this case you may wish +to build Spark yourself to guarantee that you are running exactly what you are building Mahout against. To do this follow these steps +in order: + +1. Build Spark with maven, but **do not** use the "package" target as described on the Spark site. Build with the "clean install" target instead. +Something like: "mvn clean install -Dhadoop1.2.1" or whatever your particular build options are. This will put the jars for Spark +in the local maven cache. +2. Deploy **your** Spark build to your cluster and test it there. +3. Build Mahout. This will cause maven to pull the jars for Spark from the local maven cache and may resolve missing +or mis-identified classes. +4. if you are building your own code do so against the local builds of Spark and Mahout. + +**Q: The implicit SparkContext 'sc' does not work in the Mahout spark-shell.** + +**A:** In the Mahout spark-shell the SparkContext is called 'sdc', where the 'd' stands for distributed. + + + + http://git-wip-us.apache.org/repos/asf/mahout/blob/9c031452/website/old_site_migration/needs_work_priority/sparkbindings/home.md ---------------------------------------------------------------------- diff --git a/website/old_site_migration/needs_work_priority/sparkbindings/home.md b/website/old_site_migration/needs_work_priority/sparkbindings/home.md new file mode 100644 index 0000000..5075612 --- /dev/null +++ b/website/old_site_migration/needs_work_priority/sparkbindings/home.md @@ -0,0 +1,101 @@ +--- +layout: default +title: Spark Bindings +theme: + name: retro-mahout +--- + +# Scala & Spark Bindings: +*Bringing algebraic semantics* + +## What is Scala & Spark Bindings? + +In short, Scala & Spark Bindings for Mahout is Scala DSL and algebraic optimizer of something like this (actual formula from **(d)spca**) + + +`\[\mathbf{G}=\mathbf{B}\mathbf{B}^{\top}-\mathbf{C}-\mathbf{C}^{\top}+\mathbf{s}_{q}\mathbf{s}_{q}^{\top}\boldsymbol{\xi}^{\top}\boldsymbol{\xi}\]` + +bound to in-core and distributed computations (currently, on Apache Spark). + + +Mahout Scala & Spark Bindings expression of the above: + + val g = bt.t %*% bt - c - c.t + (s_q cross s_q) * (xi dot xi) + +The main idea is that a scientist writing algebraic expressions cannot care less of distributed +operation plans and works **entirely on the logical level** just like he or she would do with R. + +Another idea is decoupling logical expression from distributed back-end. As more back-ends are added, +this implies **"write once, run everywhere"**. + +The linear algebra side works with scalars, in-core vectors and matrices, and Mahout Distributed +Row Matrices (DRMs). + +The ecosystem of operators is built in the R's image, i.e. it follows R naming such as %*%, +colSums, nrow, length operating over vectors or matices. + +Important part of Spark Bindings is expression optimizer. It looks at expression as a whole +and figures out how it can be simplified, and which physical operators should be picked. For example, +there are currently about 5 different physical operators performing DRM-DRM multiplication +picked based on matrix geometry, distributed dataset partitioning, orientation etc. +If we count in DRM by in-core combinations, that would be another 4, i.e. 9 total -- all of it for just +simple x %*% y logical notation. + + + +Please refer to the documentation for details. + +## Status + +This environment addresses mostly R-like Linear Algebra optmizations for +Spark, Flink and H20. + + +## Documentation + +* Scala and Spark bindings manual: [web](http://apache.github.io/mahout/doc/ScalaSparkBindings.html), [pdf](ScalaSparkBindings.pdf) +* Overview blog on 0.10.x releases: [blog](http://www.weatheringthroughtechdays.com/2015/04/mahout-010x-first-mahout-release-as.html) + +## Distributed methods and solvers using Bindings + +* In-core ([ssvd]) and Distributed ([dssvd]) Stochastic SVD -- guinea pigs -- see the bindings manual +* In-core ([spca]) and Distributed ([dspca]) Stochastic PCA -- guinea pigs -- see the bindings manual +* Distributed thin QR decomposition ([dqrThin]) -- guinea pig -- see the bindings manual +* [Current list of algorithms](https://mahout.apache.org/users/basics/algorithms.html) + +[ssvd]: https://github.com/apache/mahout/blob/trunk/math-scala/src/main/scala/org/apache/mahout/math/scalabindings/SSVD.scala +[spca]: https://github.com/apache/mahout/blob/trunk/math-scala/src/main/scala/org/apache/mahout/math/scalabindings/SSVD.scala +[dssvd]: https://github.com/apache/mahout/blob/trunk/spark/src/main/scala/org/apache/mahout/sparkbindings/decompositions/DSSVD.scala +[dspca]: https://github.com/apache/mahout/blob/trunk/spark/src/main/scala/org/apache/mahout/sparkbindings/decompositions/DSPCA.scala +[dqrThin]: https://github.com/apache/mahout/blob/trunk/spark/src/main/scala/org/apache/mahout/sparkbindings/decompositions/DQR.scala + + +## Related history of note + +* CLI and Driver for Spark version of item similarity -- [MAHOUT-1541](https://issues.apache.org/jira/browse/MAHOUT-1541) +* Command line interface for generalizable Spark pipelines -- [MAHOUT-1569](https://issues.apache.org/jira/browse/MAHOUT-1569) +* Cooccurrence Analysis / Item-based Recommendation -- [MAHOUT-1464](https://issues.apache.org/jira/browse/MAHOUT-1464) +* Spark Bindings -- [MAHOUT-1346](https://issues.apache.org/jira/browse/MAHOUT-1346) +* Scala Bindings -- [MAHOUT-1297](https://issues.apache.org/jira/browse/MAHOUT-1297) +* Interactive Scala & Spark Bindings Shell & Script processor -- [MAHOUT-1489](https://issues.apache.org/jira/browse/MAHOUT-1489) +* OLS tutorial using Mahout shell -- [MAHOUT-1542](https://issues.apache.org/jira/browse/MAHOUT-1542) +* Full abstraction of DRM apis and algorithms from a distributed engine -- [MAHOUT-1529](https://issues.apache.org/jira/browse/MAHOUT-1529) +* Port Naive Bayes -- [MAHOUT-1493](https://issues.apache.org/jira/browse/MAHOUT-1493) + +## Work in progress +* Text-delimited files for input and output -- [MAHOUT-1568](https://issues.apache.org/jira/browse/MAHOUT-1568) +<!-- * Weighted (Implicit Feedback) ALS -- [MAHOUT-1365](https://issues.apache.org/jira/browse/MAHOUT-1365) --> +<!--* Data frame R-like bindings -- [MAHOUT-1490](https://issues.apache.org/jira/browse/MAHOUT-1490) --> + +* *Your issue here!* + +<!-- ## Stuff wanted: +* Data frame R-like bindings (similarly to linalg bindings) +* Stat R-like bindings (perhaps we can just adapt to commons.math stat) +* **BYODMs:** Bring Your Own Distributed Method on SparkBindings! +* In-core jBlas matrix adapter +* In-core GPU matrix adapters --> + + + + \ No newline at end of file http://git-wip-us.apache.org/repos/asf/mahout/blob/9c031452/website/old_site_migration/needs_work_priority/sparkbindings/play-with-shell.md ---------------------------------------------------------------------- diff --git a/website/old_site_migration/needs_work_priority/sparkbindings/play-with-shell.md b/website/old_site_migration/needs_work_priority/sparkbindings/play-with-shell.md new file mode 100644 index 0000000..3cdb8f7 --- /dev/null +++ b/website/old_site_migration/needs_work_priority/sparkbindings/play-with-shell.md @@ -0,0 +1,199 @@ +--- +layout: default +title: Perceptron and Winnow +theme: + name: retro-mahout +--- +# Playing with Mahout's Spark Shell + +This tutorial will show you how to play with Mahout's scala DSL for linear algebra and its Spark shell. **Please keep in mind that this code is still in a very early experimental stage**. + +_(Edited for 0.10.2)_ + +## Intro + +We'll use an excerpt of a publicly available [dataset about cereals](http://lib.stat.cmu.edu/DASL/Datafiles/Cereals.html). The dataset tells the protein, fat, carbohydrate and sugars (in milligrams) contained in a set of cereals, as well as a customer rating for the cereals. Our aim for this example is to fit a linear model which infers the customer rating from the ingredients. + + +Name | protein | fat | carbo | sugars | rating +:-----------------------|:--------|:----|:------|:-------|:--------- +Apple Cinnamon Cheerios | 2 | 2 | 10.5 | 10 | 29.509541 +Cap'n'Crunch | 1 | 2 | 12 | 12 | 18.042851 +Cocoa Puffs | 1 | 1 | 12 | 13 | 22.736446 +Froot Loops | 2 | 1 | 11 | 13 | 32.207582 +Honey Graham Ohs | 1 | 2 | 12 | 11 | 21.871292 +Wheaties Honey Gold | 2 | 1 | 16 | 8 | 36.187559 +Cheerios | 6 | 2 | 17 | 1 | 50.764999 +Clusters | 3 | 2 | 13 | 7 | 40.400208 +Great Grains Pecan | 3 | 3 | 13 | 4 | 45.811716 + + +## Installing Mahout & Spark on your local machine + +We describe how to do a quick toy setup of Spark & Mahout on your local machine, so that you can run this example and play with the shell. + + 1. Download [Apache Spark 1.6.2](http://d3kbcqa49mib13.cloudfront.net/spark-1.6.2-bin-hadoop2.6.tgz) and unpack the archive file + 1. Change to the directory where you unpacked Spark and type ```sbt/sbt assembly``` to build it + 1. Create a directory for Mahout somewhere on your machine, change to there and checkout the master branch of Apache Mahout from GitHub ```git clone https://github.com/apache/mahout mahout``` + 1. Change to the ```mahout``` directory and build mahout using ```mvn -DskipTests clean install``` + +## Starting Mahout's Spark shell + + 1. Goto the directory where you unpacked Spark and type ```sbin/start-all.sh``` to locally start Spark + 1. Open a browser, point it to [http://localhost:8080/](http://localhost:8080/) to check whether Spark successfully started. Copy the url of the spark master at the top of the page (it starts with **spark://**) + 1. Define the following environment variables: <pre class="codehilite">export MAHOUT_HOME=[directory into which you checked out Mahout] +export SPARK_HOME=[directory where you unpacked Spark] +export MASTER=[url of the Spark master] +</pre> + 1. Finally, change to the directory where you unpacked Mahout and type ```bin/mahout spark-shell```, +you should see the shell starting and get the prompt ```mahout> ```. Check +[FAQ](http://mahout.apache.org/users/sparkbindings/faq.html) for further troubleshooting. + +## Implementation + +We'll use the shell to interactively play with the data and incrementally implement a simple [linear regression](https://en.wikipedia.org/wiki/Linear_regression) algorithm. Let's first load the dataset. Usually, we wouldn't need Mahout unless we processed a large dataset stored in a distributed filesystem. But for the sake of this example, we'll use our tiny toy dataset and "pretend" it was too big to fit onto a single machine. + +*Note: You can incrementally follow the example by copy-and-pasting the code into your running Mahout shell.* + +Mahout's linear algebra DSL has an abstraction called *DistributedRowMatrix (DRM)* which models a matrix that is partitioned by rows and stored in the memory of a cluster of machines. We use ```dense()``` to create a dense in-memory matrix from our toy dataset and use ```drmParallelize``` to load it into the cluster, "mimicking" a large, partitioned dataset. + +<div class="codehilite"><pre> +val drmData = drmParallelize(dense( + (2, 2, 10.5, 10, 29.509541), // Apple Cinnamon Cheerios + (1, 2, 12, 12, 18.042851), // Cap'n'Crunch + (1, 1, 12, 13, 22.736446), // Cocoa Puffs + (2, 1, 11, 13, 32.207582), // Froot Loops + (1, 2, 12, 11, 21.871292), // Honey Graham Ohs + (2, 1, 16, 8, 36.187559), // Wheaties Honey Gold + (6, 2, 17, 1, 50.764999), // Cheerios + (3, 2, 13, 7, 40.400208), // Clusters + (3, 3, 13, 4, 45.811716)), // Great Grains Pecan + numPartitions = 2); +</pre></div> + +Have a look at this matrix. The first four columns represent the ingredients +(our features) and the last column (the rating) is the target variable for +our regression. [Linear regression](https://en.wikipedia.org/wiki/Linear_regression) +assumes that the **target variable** `\(\mathbf{y}\)` is generated by the +linear combination of **the feature matrix** `\(\mathbf{X}\)` with the +**parameter vector** `\(\boldsymbol{\beta}\)` plus the + **noise** `\(\boldsymbol{\varepsilon}\)`, summarized in the formula +`\(\mathbf{y}=\mathbf{X}\boldsymbol{\beta}+\boldsymbol{\varepsilon}\)`. +Our goal is to find an estimate of the parameter vector +`\(\boldsymbol{\beta}\)` that explains the data very well. + +As a first step, we extract `\(\mathbf{X}\)` and `\(\mathbf{y}\)` from our data matrix. We get *X* by slicing: we take all rows (denoted by ```::```) and the first four columns, which have the ingredients in milligrams as content. Note that the result is again a DRM. The shell will not execute this code yet, it saves the history of operations and defers the execution until we really access a result. **Mahout's DSL automatically optimizes and parallelizes all operations on DRMs and runs them on Apache Spark.** + +<div class="codehilite"><pre> +val drmX = drmData(::, 0 until 4) +</pre></div> + +Next, we extract the target variable vector *y*, the fifth column of the data matrix. We assume this one fits into our driver machine, so we fetch it into memory using ```collect```: + +<div class="codehilite"><pre> +val y = drmData.collect(::, 4) +</pre></div> + +Now we are ready to think about a mathematical way to estimate the parameter vector *β*. A simple textbook approach is [ordinary least squares (OLS)](https://en.wikipedia.org/wiki/Ordinary_least_squares), which minimizes the sum of residual squares between the true target variable and the prediction of the target variable. In OLS, there is even a closed form expression for estimating `\(\boldsymbol{\beta}\)` as +`\(\left(\mathbf{X}^{\top}\mathbf{X}\right)^{-1}\mathbf{X}^{\top}\mathbf{y}\)`. + +The first thing which we compute for this is `\(\mathbf{X}^{\top}\mathbf{X}\)`. The code for doing this in Mahout's scala DSL maps directly to the mathematical formula. The operation ```.t()``` transposes a matrix and analogous to R ```%*%``` denotes matrix multiplication. + +<div class="codehilite"><pre> +val drmXtX = drmX.t %*% drmX +</pre></div> + +The same is true for computing `\(\mathbf{X}^{\top}\mathbf{y}\)`. We can simply type the math in scala expressions into the shell. Here, *X* lives in the cluster, while is *y* in the memory of the driver, and the result is a DRM again. +<div class="codehilite"><pre> +val drmXty = drmX.t %*% y +</pre></div> + +We're nearly done. The next step we take is to fetch `\(\mathbf{X}^{\top}\mathbf{X}\)` and +`\(\mathbf{X}^{\top}\mathbf{y}\)` into the memory of our driver machine (we are targeting +features matrices that are tall and skinny , +so we can assume that `\(\mathbf{X}^{\top}\mathbf{X}\)` is small enough +to fit in). Then, we provide them to an in-memory solver (Mahout provides +the an analog to R's ```solve()``` for that) which computes ```beta```, our +OLS estimate of the parameter vector `\(\boldsymbol{\beta}\)`. + +<div class="codehilite"><pre> +val XtX = drmXtX.collect +val Xty = drmXty.collect(::, 0) + +val beta = solve(XtX, Xty) +</pre></div> + +That's it! We have a implemented a distributed linear regression algorithm +on Apache Spark. I hope you agree that we didn't have to worry a lot about +parallelization and distributed systems. The goal of Mahout's linear algebra +DSL is to abstract away the ugliness of programming a distributed system +as much as possible, while still retaining decent performance and +scalability. + +We can now check how well our model fits its training data. +First, we multiply the feature matrix `\(\mathbf{X}\)` by our estimate of +`\(\boldsymbol{\beta}\)`. Then, we look at the difference (via L2-norm) of +the target variable `\(\mathbf{y}\)` to the fitted target variable: + +<div class="codehilite"><pre> +val yFitted = (drmX %*% beta).collect(::, 0) +(y - yFitted).norm(2) +</pre></div> + +We hope that we could show that Mahout's shell allows people to interactively and incrementally write algorithms. We have entered a lot of individual commands, one-by-one, until we got the desired results. We can now refactor a little by wrapping our statements into easy-to-use functions. The definition of functions follows standard scala syntax. + +We put all the commands for ordinary least squares into a function ```ols```. + +<div class="codehilite"><pre> +def ols(drmX: DrmLike[Int], y: Vector) = + solve(drmX.t %*% drmX, drmX.t %*% y)(::, 0) + +</pre></div> + +Note that DSL declares implicit `collect` if coersion rules require an in-core argument. Hence, we can simply +skip explicit `collect`s. + +Next, we define a function ```goodnessOfFit``` that tells how well a model fits the target variable: + +<div class="codehilite"><pre> +def goodnessOfFit(drmX: DrmLike[Int], beta: Vector, y: Vector) = { + val fittedY = (drmX %*% beta).collect(::, 0) + (y - fittedY).norm(2) +} +</pre></div> + +So far we have left out an important aspect of a standard linear regression +model. Usually there is a constant bias term added to the model. Without +that, our model always crosses through the origin and we only learn the +right angle. An easy way to add such a bias term to our model is to add a +column of ones to the feature matrix `\(\mathbf{X}\)`. +The corresponding weight in the parameter vector will then be the bias term. + +Here is how we add a bias column: + +<div class="codehilite"><pre> +val drmXwithBiasColumn = drmX cbind 1 +</pre></div> + +Now we can give the newly created DRM ```drmXwithBiasColumn``` to our model fitting method ```ols``` and see how well the resulting model fits the training data with ```goodnessOfFit```. You should see a large improvement in the result. + +<div class="codehilite"><pre> +val betaWithBiasTerm = ols(drmXwithBiasColumn, y) +goodnessOfFit(drmXwithBiasColumn, betaWithBiasTerm, y) +</pre></div> + +As a further optimization, we can make use of the DSL's caching functionality. We use ```drmXwithBiasColumn``` repeatedly as input to a computation, so it might be beneficial to cache it in memory. This is achieved by calling ```checkpoint()```. In the end, we remove it from the cache with uncache: + +<div class="codehilite"><pre> +val cachedDrmX = drmXwithBiasColumn.checkpoint() + +val betaWithBiasTerm = ols(cachedDrmX, y) +val goodness = goodnessOfFit(cachedDrmX, betaWithBiasTerm, y) + +cachedDrmX.uncache() + +goodness +</pre></div> + + +Liked what you saw? Checkout Mahout's overview for the [Scala and Spark bindings](https://mahout.apache.org/users/sparkbindings/home.html). \ No newline at end of file http://git-wip-us.apache.org/repos/asf/mahout/blob/9c031452/website/old_site_migration/needs_work_priority/wikipedia-classifier-example.md ---------------------------------------------------------------------- diff --git a/website/old_site_migration/needs_work_priority/wikipedia-classifier-example.md b/website/old_site_migration/needs_work_priority/wikipedia-classifier-example.md new file mode 100644 index 0000000..9df07da --- /dev/null +++ b/website/old_site_migration/needs_work_priority/wikipedia-classifier-example.md @@ -0,0 +1,57 @@ +--- +layout: default +title: Wikipedia XML parser and Naive Bayes Example +theme: + name: retro-mahout +--- +# Wikipedia XML parser and Naive Bayes Classifier Example + +## Introduction +Mahout has an [example script](https://github.com/apache/mahout/blob/master/examples/bin/classify-wikipedia.sh) [1] which will download a recent XML dump of the (entire if desired) [English Wikipedia database](http://dumps.wikimedia.org/enwiki/latest/). After running the classification script, you can use the [document classification script](https://github.com/apache/mahout/blob/master/examples/bin/spark-document-classifier.mscala) from the Mahout [spark-shell](http://mahout.apache.org/users/sparkbindings/play-with-shell.html) to vectorize and classify text from outside of the training and testing corpus using a modle built on the Wikipedia dataset. + +You can run this script to build and test a Naive Bayes classifier for option (1) 10 arbitrary countries or option (2) 2 countries (United States and United Kingdom). + +## Oververview + +Tou run the example simply execute the `$MAHOUT_HOME/examples/bin/classify-wikipedia.sh` script. + +By defult the script is set to run on a medium sized Wikipedia XML dump. To run on the full set (the entire english Wikipedia) you can change the download by commenting out line 78, and uncommenting line 80 of [classify-wikipedia.sh](https://github.com/apache/mahout/blob/master/examples/bin/classify-wikipedia.sh) [1]. However this is not recommended unless you have the resources to do so. *Be sure to clean your work directory when changing datasets- option (3).* + +The step by step process for Creating a Naive Bayes Classifier for the Wikipedia XML dump is very similar to that for [creating a 20 Newsgroups Classifier](http://mahout.apache.org/users/classification/twenty-newsgroups.html) [4]. The only difference being that instead of running `$mahout seqdirectory` on the unzipped 20 Newsgroups file, you'll run `$mahout seqwiki` on the unzipped Wikipedia xml dump. + + $ mahout seqwiki + +The above command launches `WikipediaToSequenceFile.java` which accepts a text file of categories [3] and starts an MR job to parse the each document in the XML file. This process will seek to extract documents with a wikipedia category tag which (exactly, if the `-exactMatchOnly` option is set) matches a line in the category file. If no match is found and the `-all` option is set, the document will be dumped into an "unknown" category. The documents will then be written out as a `<Text,Text>` sequence file of the form (K:/category/document_title , V: document). + +There are 3 different example category files available to in the /examples/src/test/resources +directory: country.txt, country10.txt and country2.txt. You can edit these categories to extract a different corpus from the Wikipedia dataset. + +The CLI options for `seqwiki` are as follows: + + --input (-i) input pathname String + --output (-o) the output pathname String + --categories (-c) the file containing the Wikipedia categories + --exactMatchOnly (-e) if set, then the Wikipedia category must match + exactly instead of simply containing the category string + --all (-all) if set select all categories + --removeLabels (-rl) if set, remove [[Category:labels]] from document text after extracting label. + + +After `seqwiki`, the script runs `seq2sparse`, `split`, `trainnb` and `testnb` as in the [step by step 20newsgroups example](http://mahout.apache.org/users/classification/twenty-newsgroups.html). When all of the jobs have finished, a confusion matrix will be displayed. + +#Resourcese + +[1] [classify-wikipedia.sh](https://github.com/apache/mahout/blob/master/examples/bin/classify-wikipedia.sh) + +[2] [Document classification script for the Mahout Spark Shell](https://github.com/apache/mahout/blob/master/examples/bin/spark-document-classifier.mscala) + +[3] [Example category file](https://github.com/apache/mahout/blob/master/examples/src/test/resources/country10.txt) + +[4] [Step by step instructions for building a Naive Bayes classifier for 20newsgroups from the command line](http://mahout.apache.org/users/classification/twenty-newsgroups.html) + +[5] [Mahout MapReduce Naive Bayes](http://mahout.apache.org/users/classification/bayesian.html) + +[6] [Mahout Spark Naive Bayes](http://mahout.apache.org/users/algorithms/spark-naive-bayes.html) + +[7] [Mahout Scala Spark and H2O Bindings](http://mahout.apache.org/users/sparkbindings/home.html) + http://git-wip-us.apache.org/repos/asf/mahout/blob/9c031452/website/old_site_migration/old_site/general/books-tutorials-and-talks.md ---------------------------------------------------------------------- diff --git a/website/old_site_migration/old_site/general/books-tutorials-and-talks.md b/website/old_site_migration/old_site/general/books-tutorials-and-talks.md new file mode 100644 index 0000000..bbbdeef --- /dev/null +++ b/website/old_site_migration/old_site/general/books-tutorials-and-talks.md @@ -0,0 +1,121 @@ +--- +layout: default +title: Books Tutorials and Talks +theme: + name: retro-mahout +--- +# Intro + +This page is a place for info about talks (past and upcoming), tutorials, articles, books, slides, PDFs, discussions, etc. about Mahout. No endorsements are implied or +given. + +# Books + +## Mahout specific + + * <a href="http://www.weatheringthroughtechdays.com/2016/02/mahout-samsara-book-is-out.html">Apache Mahout: Beyond MapReduce</a> by Dmitriy Lyubimov and Andrew Palumbo published Feb 2016. Covers new features in Mahout "Samsara" releases (0.10, 0.11+). + * <a href="http://www.packtpub.com/apache-mahout-cookbook/book">Apache Mahout cookbook</a>- Book by Piero Giacomelli published Dec 2013 by Packtpub. + * <a href="http://www.manning.com/owen/">Mahout in Action</a> - Book by Sean Owen, Robin Anil, Ted Dunning and Ellen Friedman published Oct 2011 by Manning Publications. + * <a href="http://www.manning.com/ingersoll/">Taming Text</a> - By Grant Ingersoll and Tom Morton, published by Manning Publications. Will have some Mahout coverage, but by no means as complete as Mahout in Action. + +## Engineering oriented machine learning books + + * <a href="http://www.amazon.com/Collective-Intelligence-Action-Satnam-Alag/dp/1933988312/ref=pd_bbs_sr_3?ie=UTF8&s=books&qid=1214545249&sr=1-3">Collective Intelligence in Action</a> + * <a href="http://www.amazon.com/Programming-Collective-Intelligence-Building-Applications/dp/0596529325/ref=pd_bbs_sr_1/104-1017533-9408723?ie=UTF8&s=books&qid=1214593516&sr=1-1">Programming Collective Intelligence</a> + * <a href="http://www.amazon.com/Algorithms-Intelligent-Web-Haralambos-Marmanis/dp/1933988665/ref=sr_1_1?s=books&ie=UTF8&qid=1298005918&sr=1-1">Algorithms of the Intelligent Web</a> + +## Scientific background + + * <a href="http://www.cs.waikato.ac.nz/~ml/weka/book.html">Data Mining: Practical Machine Learning Tools and Techniques</a> + * <a href="http://www-nlp.stanford.edu/IR-book/">Introduction to Information Retrieval</a> + * <a href="http://www.amazon.com/Machine-Learning-Mcgraw-Hill-International-Edit/dp/0071154671/ref=pd_bbs_sr_1?ie=UTF8&s=books&qid=1214593709&sr=8-1">Machine Learning</a> + * <a href="http://www.amazon.com/Pattern-Recognition-Learning-Information-Statistics/dp/0387310738/ref=pd_bbs_sr_2?ie=UTF8&s=books&qid=1214593709&sr=8-2">Pattern Recognition and Machine Learning (Information Science and Statistics) </a> + +# News, Articles and Tutorials + + * [Mahout 0.10.x: first Mahout release as a programming environment](http://www.weatheringthroughtechdays.com/2015/04/mahout-010x-first-mahout-release-as.html) + * [Comparing Document Classification Functions of Lucene and Mahout](http://soleami.com/blog/comparing-document-classification-functions-of-lucene-and-mahout.html) + * <a href="http://www.ibm.com/developerworks/java/library/j-mahout-scaling/">Apache Mahout: Scalable Machine Learning for Everyone</a> + * <a href="http://emmaespina.wordpress.com/2011/04/26/ham-spam-and-elephants-or-how-to-build-a-spam-filter-server-with-mahout/">How to build a spam filter server with Mahout</a> - Applying classification on a live server - April 2011 + * <a href="http://ssc.io/deploying-a-massively-scalable-recommender-system-with-apache-mahout/">Deploying a massively scalable recommender system with Apache Mahout</a> - Blogpost of Sebastian Schelter in April 2011 + * <a href="http://www.redmonk.com/cote/2010/11/04/makeall013/">Apache Mahout & the commoditization of machine learning </a> - Podcast interview with Grant Ingersoll at ApacheCon 2010 + * <a href="http://isabel-drost.de/hadoop/slides/devoxx.pdf">Apache Mahout 0.4 mit neuen Algorithmen</a> - published after the 0.4 release by heise Open/ Developer, November 2010 + * <a href="http://www.infoq.com/news/2009/04/mahout">Mahout on InfoQ</a> - Interview with Grant Ingersoll on InfoQ + * <a href="http://www.cloudera.com/blog/2009/04/21/hadoop-uk-user-group-meeting/">Mahout in the Cloudera weblog</a> - published after the Hadoop user group UK. + * <a href="http://blog.athico.com/2008/08/machine-learning-and-apache-mahout.html">Mahout in the Drools weblog</a> - Michael Neale published an article on Mahout in the drools weblog + * <a href="https://www.ibm.com/developerworks/java/library/j-mahout/index.html">Introducing Apache Mahout</a> - Grant Ingersoll - Intro to Apache Mahout focused on clustering, classification and collaborative filtering. Japanese translation available at: [http://www.ibm.com/developerworks/jp/java/library/j-mahout/](http://www.ibm.com/developerworks/jp/java/library/j-mahout/) + * <a href="http://philippeadjiman.com/blog/2009/11/11/flexible-collaborative-filtering-in-java-with-mahout-taste/">Flexible Collaborative Filtering In Java With Mahout Taste</a> - Philippe Adjiman - Quick starting guide on how to use the collaborative filtering package of Mahout (called Taste) to quickly and flexibly create, test and compare tailored recommendation engines. + * <a href="http://www.lucidimagination.com/blog/2010/03/16/integrating-apache-mahout-with-apache-lucene-and-solr-part-i-of-3/">Integrating Mahout with Lucene and Solr</a> Three part series on ways to integrate Mahout with Lucene and Solr + * <a href="https://www.youtube.com/watch?v=yD40rVKUwPI">Mahout Item Recommender Tutorial using Java and Eclipse</a> - YouTube video tutorial by Steve Cook + + +# Coursework/Lectures + + * <a href="http://videolectures.net/mlss05us_chicago/">http://videolectures.net/mlss05us_chicago/</a> + * <a href="http://videolectures.net/mlas06_pittsburgh/">http://videolectures.net/mlas06_pittsburgh/</a> + * <a href="http://see.stanford.edu/see/lecturelist.aspx?coll=348ca38a-3a6d-4052-937d-cb017338d7b1">Stanford Lectures on Machine Learning by Andrew Ng</a> + * <a href="https://docs.google.com/open?id=0ByhGL2_SCeitMDQ3OTczNjItM2ZjYi00ZDg5LWE0MzItZGQxODQ5NzkzYjNj">CMU@Qatar Introduction to Mahout lecture</a> + + +# Talks + +In reverse chronological order, so that most recent talks are at the top + + * [Distributed Machine Learning with Apache Mahout] Suneel Marthi at Apache Big Data North America, Vancouver, Canada, May 11, 2016 and MapR Washington DC Big Data Everywhere, Tysons, VA, June 2 2016 + * [Declarative Machine Learning with the Samsara DSL](http://www.slideshare.net/FlinkForward/sebastian-schelter-distributed-machine-learing-with-the-samsara-dsl) Sebastian Schelter at Flink Forward Conference, Berlin Germany, October 2015. + * [Bringing Algebraic Semantics to Mahout](http://www.slideshare.net/sscdotopen/bringing-algebraic-semantics-to-mahout) Sebastian Schelter at HPI Infolunch, Potsdam Germany, May 2014 + * Mahout Spark and Scala bindings: Bringing Algebraic Semantics ([slides](http://www.slideshare.net/DmitriyLyubimov/mahout-scala-and-spark-bindings)/[video](http://youtu.be/h9dpmvNW1Dw)) - Dmitriy Lyubimov at Mahout Meetup, April 17, 2014. + * Mahout Future Directions - Ted Dunning, Suneel Marthi, Sebastian Schelter at Hadoop Summit Europe 2014, Amsterdam, April 3, 2014 + * Building Recommender Systems for Mere-Mortals - Sebastian Schelter at Researchgate Developer Day, Berlin, November 2013 + * Recommendations with Apache Mahout - Sebastian Schelter at IBM Almaden Research Center, San Jose, September 2013 + * <a href="http://de.slideshare.net/sscdotopen/next-directions-in-mahouts-recommenders">Next Directions in Mahoutâs Recommenders</a> - Sebastian Schelter at Bay Area Mahout Meetup, Redwood City, August 2013 + * <a href="http://de.slideshare.net/sscdotopen/new-directions-in-mahouts-recommenders">New Directions in Mahoutâs Recommenders</a> - Sebastian Schelter at Recommender Systems Get Together Berlin, April 2013 + * <a href="http://www.slideshare.net/VaradMeru/introduction-to-mahout-and-machine-learning">Introduction to Mahout and Machine Learning</a> - Slides by Varad Meru, Software Development Engineer at Orzota. July 27th, 2013. + * <a href="http://de.slideshare.net/sscdotopen/introduction-to-collaborative-filtering-with-apache-mahout">An Introduction to Collaborative Filtering with Apache Mahout</a> - Sebastian Schelter at Recommender Systems Challenge Workshop in conjunction with ACM RecSys 2012, Dublin, September 2012 + * <a href="https://github.com/ManuelB/facebook-recommender-demo/raw/master/docs/Talk-BedCon-Berlin-2012.pdf">How to build a recommender system based on Mahout and JavaEE</a> - Slides by Manuel Blechschmidt at Berlin Expert Days March, 2012. + * <a href="http://lanyrd.com/2011/apachecon-north-america/skdtb/">Apache Mahout for intelligent data analysis</a> - Slides from Isabel Drost at Apache Con NA November, 2011. + * <a href="http://lanyrd.com/2011/apachecon-north-america/skdrk/">Dr. Mahout: Analyzing clinical data using scalable and distributed computing</a> - Slides from Shannon Quinn at Apache Con NA November, 2011. + * Frank Scholten at Berlin Buzzwords on June 7, 2011. + * Introduction to Collaborative Filtering using Mahout (updated) - Talk by Sean Owen at the London Hadoop User Group on April 14, 2011. + * <a href="http://www.meetup.com/LA-HUG/pages/Video_from_March_16th_LA-HUG_Ted_Dunning_Mahout">Cool Tricks with Classifiers</a> - Talk by Ted Dunning at the Los Angeles HUG talking about Mahout classifiers on March 16, 2011. + * First Mahout Hackathon, Berlin, March 2011 + * <a href="http://blog.jteam.nl/2011/01/13/announcement-lucene-nl-mahout-meetup-with-isabel-drost-feb-7/">Mahout meetup</a> - there were two talks at the Apache Mahout meetup at JTeam in Amsterdam, February 2011. <a href="http://isabel-drost.de/hadoop/slides/jteam.pdf">intro slides</a> + * <a href="http://www.fosdem.org/2011/schedule/event/mahoutclustering.html">Mahout clustering </a> - Talk on Mahout clustering at data dev room FOSDEM, February 2011. + * Scaling Data Analysis with Apache Mahout - talk on Mahout at O'Reilly Strata, February 2011. + * <a href="http://www.slideshare.net/jaganadhg/mahout-tutorial-fossmeet-nitc">Practical Machine Learning</a> - Slides from Biju B and Jaganadh G, FOSSMEET-NITC, Calicut, India, February 2011. + * <a href="http://www.javaedge.com/jedge/pdf/Mahout.pdf">Mahout at AlphaCSPs The Edge 2010 (pdf)</a> - <a href="http://www.slideshare.net/arikogan/mahouts-presentation-at-alphacsps-the-edge-2010">slideshare</a> - Slides from <a href="http://il.linkedin.com/in/arielkogan">Ariel Kogan</a> AlphaCSP's The Edge, December 2010. + * <a href="http://isabel-drost.de/hadoop/slides/devoxx.pdf">Intelligent data analysis with Apache Mahout</a> - Slides from Isabel Drost, Devoxx Antwerp, November 2010. + * <a href="http://isabel-drost.de/hadoop/slides/codebits.pdf">Apache Mahout introduction</a> - Slides from Isabel Drost, codebits Lisbon, November 2010. + * <a href="http://isabel-drost.de/hadoop/slides/apachecon_2010.pdf">Apache Mahout - Making Data Analysis Easy</a> - Slides from Isabel Drost, Apache Con US Atlanta, November 2010. + * <a href="http://www.slideshare.net/jaganadhg/bck9">Practical Machine Learning</a> - Slides from Jaganadh G, BarCamp Kerala 9, November 2010. + * <a href="http://www.slideshare.net/tdunning/sdforum-11042010">Mahout and its new classification framework</a> - Slides from Ted Dunning, SDForum, November 2010. + * <a href="http://www.slideshare.net/sscdotopen/mahoutcf">Distributed Item-based Collaborative Filtering with Apache Mahout</a> - Slides from Sebastian Schelter, Hadoop Get Together Berlin, October 2010. + * <a href="http://isabel-drost.de/hadoop/slides/HMM.pdf">Hidden Markov Models for Mahout</a> - Slides from Max Heimel, Hadoop Get Together Berlin, October 2010. + * <a href="http://www.slideshare.net/robinanil/oscon-apache-mahout-mammoth-scale-machine-learning">Apache Mahout Mammoth Scale Machine Learning </a> - Slides from Robin Anil, OSCON 2010. + * <a href="http://slidesha.re/9LxOIu">Intro to Apache Mahout</a> - Slides from Grant Ingersoll, RTP Semantic Web Group. + * <a href="http://www.slideshare.net/ydn/3-biometric-hadoopsummit2010">Case study: Biometric Databases and Hadoop </a> - Slides from Jason Trost, Hadoop Summit 2010. + * <a href="http://www.slideshare.net/hadoopusergroup/mail-antispam?from=ss_embed">Spam Fighting at Yahoo</a> + * <a href="http://www.slideshare.net/hadoopusergroup/bixo-hug-talk?from=ss_embed">Web Mining with Ken Krugler</a> + * <a href="http://berlinbuzzwords.wikidot.com/local--files/links-to-slides/ingersoll_bbuzz2010.pdf">Keynote on intelligent search</a> - Slides from Grant Ingersoll, Berlin Buzzwords, June 2010. + * <a href="http://berlinbuzzwords.wikidot.com/local--files/links-to-slides/owen_bbuzz2010.pdf">Simple co-occurrence-based recommendation on Hadoop</a> - Slides from Sean Owen, Berlin Buzzwords, June, 2010. + * <a href="http://berlinbuzzwords.wikidot.com/local--files/links-to-slides/scholten_bbuzz2010.odp">Introduction to Collaborative Filtering using Mahout</a> - Slides from Frank Scholten, Berlin Buzzwords, June, 2010. + * <a href="http://lucene.grantingersoll.com/2010/02/16/trijug-intro-to-mahout-slides-and-demo-examples/">Introduction to Scalable Machine Learning</a> - Slides and demos from Grant Ingersoll, March, 2010. + * Mahout @ India Hadoop Summit - Slides from a 1 hour talk on Mahout at the India Hadoop Summit by Robin Anil, February 2010. + * <a href="http://www.isabel-drost.de/hadoop/slides/opensourceexpo09.pdf">Mahout in 10 minutes</a> - Slides from a 10 min intro to Mahout at the Map Reduce tutorial by David Zülke at Open Source Expo in Karlsruhe, Isabel Drost, November 2009. + * <a href="http://www.isabel-drost.de/hadoop/slides/apacheconus2009.pdf">Mahout at Apache Con US </a> - Slides from a talk on "Going from raw data to information" (with Mahout) at Apache Con US in Oakland, Isabel Drost, November 2009. + * <a href="http://www.isabel-drost.de/hadoop/slides/froscon2009.pdf">Mahout at FrOSCon</a> - Slides from a talk on Mahout at FrOSCon in Sankt Augustin, Isabel Drost, August 2009. + * <a href="http://www.isabel-drost.de/hadoop/slides/dai.pdf">Mahout at DAI group TU Berlin</a> - Slides from a talk on Mahout at the DAI Laboratories TU Berlin, Isabel Drost, July 2009. + * <a href="http://www.isabel-drost.de/hadoop/slides/ulf.pdf">Mahout at Machine Learning Group TU Berlin</a> - Slides from a talk on Hadoop with some detour to Mahout at the Machine + * Learning Group of Prof. Dr. Klaus-Robert Müller at TU Berlin, Isabel Drost, June 2009. + * <a href="http://www.isabel-drost.de/hadoop/slides/google.pdf">Mahout at Google Zürich</a> - Slides from a Google tech-talk on the past, present and future of Mahout, Isabel Drost, May 2009. + * <a href="http://static.last.fm/johan/huguk-20090414/isabel_drost-introducing_apache_mahout.pdf">Hadoop user group UK</a> - Slides from a talk on April 14, 2009 at the Hadoop User Group UK in London, Isabel Drost, April 2009. + * <a href="http://cwiki.apache.org/confluence/download/attachments/88410/SDForum.pdf">BI Over Petabytes: Meet Apache Mahout</a> - Slides from a talk by Jeff Eastman on April 21, 2009 at the Bay Area SD Forum Business Intelligence SIG meeting at SAP in Palo Alto, CA. + * Lucene Meetup and Apache Barcamp in Amsterdam, March 2009. + * BarCampRDU - (Raleigh) on Aug. 2, 2008 + * Introducing Mahout: Apache Machine Learning - Committer Grant Ingersoll gave a gentle introduction to Mahout and Machine Learning at ApacheCon in November (3rd through 7th) in New Orleans, USA. + * Mahout: Scaling Machine Learning - Introduction to Mahout and machine learning at FrOSCon in Sankt Augustin/Germany, Isabel Drost, August 2008. (<a href="http://cwiki.apache.org/confluence/download/attachments/88410/froscon.pdf">slides</a>) + * Mahout: Scalable Machine Learning - An introduction to Mahout and machine learning at the first German Hadoop gathering in newthinking store/ Berlin, Isabel Drost, July 2008. + * Apache Mahout: Industrial Strength Machine Learning - Committer Jeff Eastman gave an introduction to Mahout at Yahoo\!, May 2008 + * <a href="http://people.apache.org/~berndf/openexpode08-lucene-talk.pdf">Apache Lucene - Mach's wie Google</a> - Bernd Fondermann presented an overview of the Apache Lucene project, + * including Mahout at Open Source Expo 2008 in Karlsruhe, May 2008. + * Apache Mahout: Bringing Machine Learning to Industrial Strength - Committer Isabel Drost gave a Fast Feather introduction the the new project Mahout at Apache Con EU April, 2008 \ No newline at end of file http://git-wip-us.apache.org/repos/asf/mahout/blob/9c031452/website/old_site_migration/old_site/general/mahout-wiki.md ---------------------------------------------------------------------- diff --git a/website/old_site_migration/old_site/general/mahout-wiki.md b/website/old_site_migration/old_site/general/mahout-wiki.md new file mode 100644 index 0000000..2df16d4 --- /dev/null +++ b/website/old_site_migration/old_site/general/mahout-wiki.md @@ -0,0 +1,202 @@ +--- +layout: default +title: Mahout Wiki +theme: + name: retro-mahout +--- + +On the fence about including this in new site. lol at "new Apache TLP" + +Apache Mahout is a new Apache TLP project to create scalable, machine +learning algorithms under the Apache license. + +{toc:style=disc|minlevel=2} + +<a name="MahoutWiki-General"></a> +## General +[Overview](overview.html) + -- Mahout? What's that supposed to be? + +[Quickstart](quickstart.html) + -- learn how to quickly setup Apache Mahout for your project. + +[FAQ](faq.html) + -- Frequent questions encountered on the mailing lists. + +[Developer Resources](developer-resources.html) + -- overview of the Mahout development infrastructure. + +[How To Contribute](how-to-contribute.html) + -- get involved with the Mahout community. + +[How To Become A Committer](how-to-become-a-committer.html) + -- become a member of the Mahout development community. + +[Hadoop](http://hadoop.apache.org) + -- several of our implementations depend on Hadoop. + +[Machine Learning Open Source Software](http://mloss.org/software/) + -- other projects implementing Open Source Machine Learning libraries. + +[Mahout -- The name, history and its pronunciation](mahoutname.html) + +<a name="MahoutWiki-Community"></a> +## Community + +[Who we are](who-we-are.html) + -- who are the developers behind Apache Mahout? + +[Books, Tutorials, Talks, Articles, News, Background Reading, etc. on Mahout](books-tutorials-and-talks.html) + +[Issue Tracker](issue-tracker.html) + -- see what features people are working on, submit patches and file bugs. + +[Source Code (SVN)](https://svn.apache.org/repos/asf/mahout/) + -- [Fisheye|http://fisheye6.atlassian.com/browse/mahout] + -- download the Mahout source code from svn. + +[Mailing lists and IRC](mailing-lists,-irc-and-archives.html) + -- links to our mailing lists, IRC channel and archived design and +algorithm discussions, maybe your questions was answered there already? + +[Version Control](version-control.html) + -- where we track our code. + +[Powered By Mahout](powered-by-mahout.html) + -- who is using Mahout in production? + +[Professional Support](professional-support.html) + -- who is offering professional support for Mahout? + +[Mahout and Google Summer of Code](gsoc.html) + -- All you need to know about Mahout and GSoC. + + +[Glossary of commonly used terms and abbreviations](glossary.html) + +<a name="MahoutWiki-Installation/Setup"></a> +## Installation/Setup + +[System Requirements](system-requirements.html) + -- what do you need to run Mahout? + +[Quickstart](quickstart.html) + -- get started with Mahout, run the examples and get pointers to further +resources. + +[Downloads](downloads.html) + -- a list of Mahout releases. + +[Download and installation](buildingmahout.html) + -- build Mahout from the sources. + +[Mahout on Amazon's EC2 Service](mahout-on-amazon-ec2.html) + -- run Mahout on Amazon's EC2. + +[Mahout on Amazon's EMR](mahout-on-elastic-mapreduce.html) + -- Run Mahout on Amazon's Elastic Map Reduce + +[Integrating Mahout into an Application](mahoutintegration.html) + -- integrate Mahout's capabilities in your application. + +<a name="MahoutWiki-Examples"></a> +## Examples + +1. [ASF Email Examples](asfemail.html) + -- Examples of recommenders, clustering and classification all using a +public domain collection of 7 million emails. + +<a name="MahoutWiki-ImplementationBackground"></a> +## Implementation Background + +<a name="MahoutWiki-RequirementsandDesign"></a> +### Requirements and Design + +[Matrix and Vector Needs](matrix-and-vector-needs.html) + -- requirements for Mahout vectors. + +[Collection(De-)Serialization](collection(de-)serialization.html) + +<a name="MahoutWiki-CollectionsandAlgorithms"></a> +### Collections and Algorithms + +Learn more about [mahout-collections](mahout-collections.html) +, containers for efficient storage of primitive-type data and open hash +tables. + +Learn more about the [Algorithms](algorithms.html) + discussed and employed by Mahout. + +Learn more about the [Mahout recommender implementation](recommender-documentation.html) +. + +<a name="MahoutWiki-Utilities"></a> +### Utilities + +This section describes tools that might be useful for working with Mahout. + +[Converting Content](converting-content.html) + -- Mahout has some utilities for converting content such as logs to +formats more amenable for consumption by Mahout. +[Creating Vectors](creating-vectors.html) + -- Mahout's algorithms operate on vectors. Learn more on how to generate +these from raw data. +[Viewing Result](viewing-result.html) + -- How to visualize the result of your trained algorithms. + +<a name="MahoutWiki-Data"></a> +### Data + +[Collections](collections.html) + -- To try out and test Mahout's algorithms you need training data. We are +always looking for new training data collections. + +<a name="MahoutWiki-Benchmarks"></a> +### Benchmarks + +[Mahout Benchmarks](mahout-benchmarks.html) + +<a name="MahoutWiki-Committer'sResources"></a> +## Committer's Resources + +* [Testing](testing.html) + -- Information on test plans and ideas for testing + +<a name="MahoutWiki-ProjectResources"></a> +### Project Resources + +* [Dealing with Third Party Dependencies not in Maven](thirdparty-dependencies.html) +* [How To Update The Website](how-to-update-the-website.html) +* [Patch Check List](patch-check-list.html) +* [How To Release](http://cwiki.apache.org/confluence/display/MAHOUT/How+to+release) +* [Release Planning](release-planning.html) +* [Sonar Code Quality Analysis](https://analysis.apache.org/dashboard/index/63921) + +<a name="MahoutWiki-AdditionalResources"></a> +### Additional Resources + +* [Apache Machine Status](http://monitoring.apache.org/status/) + \- Check to see if SVN, other resources are available. +* [Committer's FAQ](http://www.apache.org/dev/committers.html) +* [Apache Dev](http://www.apache.org/dev/) + + +<a name="MahoutWiki-HowToEditThisWiki"></a> +## How To Edit This Wiki + +How to edit this Wiki + +This Wiki is a collaborative site, anyone can contribute and share: + +* Create an account by clicking the "Login" link at the top of any page, +and picking a username and password. +* Edit any page by pressing Edit at the top of the page + +There are some conventions used on the Mahout wiki: + + * {noformat}+*TODO:*+{noformat} (+*TODO:*+ ) is used to denote sections +that definitely need to be cleaned up. + * {noformat}+*Mahout_(version)*+{noformat} (+*Mahout_0.2*+) is used to +draw attention to which version of Mahout a feature was (or will be) added +to Mahout. + http://git-wip-us.apache.org/repos/asf/mahout/blob/9c031452/website/old_site_migration/old_site/general/professional-support.md ---------------------------------------------------------------------- diff --git a/website/old_site_migration/old_site/general/professional-support.md b/website/old_site_migration/old_site/general/professional-support.md new file mode 100644 index 0000000..45d798c --- /dev/null +++ b/website/old_site_migration/old_site/general/professional-support.md @@ -0,0 +1,41 @@ +--- +layout: default +title: Professional Support +theme: + name: retro-mahout +--- + +NOTE: on the fence about including this in new site. + +<a name="ProfessionalSupport-ProfessionalsupportforMahout"></a> +# Professional support for Mahout + +Add yourself or your company if you are offering support for Mahout +users. Please keep lists in alphabetical order. An entry here +is not an endorsement by the Apache Software Foundation nor any of its +committers. + + +<a name="ProfessionalSupport-Peopleandcompaniesforhire"></a> +## People and companies for hire + +| Name | Contact details | Notes | +|------|-----------------|-------| +| Accenture | [email protected] | [Consulting services in big data analytics](http://accenture.com) | +| Boston Predictive Analytics | [email protected] | [http://tutorteddy.com/site/free_statistics_help.php](http://tutorteddy.com/site/free_statistics_help.php) | +| Frank Scholten | [email protected] | | +| GridLine | [http://www.gridline.nl/contact](http://www.gridline.nl/contact) | Specialised in search and thesauri | +| Jagdish Nomula | [email protected] | ML, Search, Algorithms, Java [http://www.kosmex.com](http://www.kosmex.com) | +| LucidWorks | [http://www.lucidworks.com](http://www.lucidworks.com) | Big data platform including Mahout as a service for clustering, classification and more | +| Sematext International | [http://sematext.com/](http://sematext.com/) | | +| Ted Dunning | [email protected] | Full commercial support | +| Winterwell | [email protected] | Business/maths concept development & algorithms [http://winterwell.com](http://winterwell.com) | + +<a name="ProfessionalSupport-Talksandpresentations"></a> +## Talks and presentations + +| Name | Contact details | Notes | +|------|-----------------|-------| +| Andrew Musselman | [email protected] | ["Building a Recommender with Apache Mahout on Amazon Elastic-MapReduce"](https://blogs.aws.amazon.com/bigdata/post/Tx1TDK3HHBD4EZL/Building-a-Recommender-with-Apache-Mahout-on-Amazon-Elastic-MapReduce-EMR) | +| Frank Scholten | [email protected] | Mahout/Taste [http://blog.jteam.nl/author/frank/](http://blog.jteam.nl/author/frank/) | +| Isabel Drost-Fromm | [email protected] | If travel and accommodation costs are covered scheduling a talk is a lot easier. | http://git-wip-us.apache.org/repos/asf/mahout/blob/9c031452/website/old_site_migration/old_site/general/reference-reading.md ---------------------------------------------------------------------- diff --git a/website/old_site_migration/old_site/general/reference-reading.md b/website/old_site_migration/old_site/general/reference-reading.md new file mode 100644 index 0000000..ba969ac --- /dev/null +++ b/website/old_site_migration/old_site/general/reference-reading.md @@ -0,0 +1,71 @@ +--- +layout: default +title: Reference Reading +theme: + name: retro-mahout +--- + +# Reference Reading + +Here we provide references to books and courses about data analysis in general, which might also be helpful in the context of Mahout. + +<a name="ReferenceReading-GeneralBackgroundMaterials"></a> +## General Background Materials + +Don't be overwhelmed by all the maths, you can do a lot in Mahout with some +basic knowledge. The books will help you understand your +data better, and ask better questions both of Mahout's APIs, and also of +the Mahout community. And unlike learning some particular software tool, +these are skills that will remain useful decades later. + + * [Gilbert Strang](http://www-math.mit.edu/~gs) +'s [Introduction to Linear Algebra](http://math.mit.edu/linearalgebra/). His [lectures](http://ocw.mit.edu/courses/mathematics/18-06-linear-algebra-spring-2010/) are also [available online](http://web.mit.edu/18.06/www/) + and are strongly recommended. + * [Mathematical Tools for Applied Mulitvariate Analysis](http://www.amazon.com/Mathematical-Tools-Applied-Multivariate-Analysis/dp/0121609553/ref=sr_1_1?ie=UTF8&qid=1299602805&sr=8-1) by J.Douglass +Carroll. + * [Stanford Machine Learning online courseware](http://www.stanford.edu/class/cs229/) + * [MIT Machine Learning online courseware](http://ocw.mit.edu/courses/electrical-engineering-and-computer-science/6-867-machine-learning-fall-2006/) has [lecture notes](http://ocw.mit.edu/courses/electrical-engineering-and-computer-science/6-867-machine-learning-fall-2006/lecture-notes/) online. + * As a pre-requisite to probability and statistics, you'll need [basic calculus](http://en.wikipedia.org/wiki/Calculus). A maths for scientists text might be useful here such as 'Mathematics for Engineers and Scientists', Alan Jeffrey, Chapman & Hall/CRC. ([openlibrary](http://openlibrary.org/books/OL3305993M/Mathematics_for_engineers_and_scientists)) + * One of the best writers in the probability/statistics world is Sheldon Ross. Try [A First Course in Probability (8th Edition)](http://www.pearsonhighered.com/educator/product/First-Course-in-Probability-A/9780136033134.page) and then move on to his [Introduction to Probability Models](http://www.amazon.com/Introduction-Probability-Models-Sixth-Sheldon/dp/0125984707) + +Some good introductory alternatives here are: + + * [Kahn Academy](http://www.khanacademy.org/) -- videos on stats, probability, linear algebra + * [Probability and Statistics (7th Edition)](http://www.amazon.com/Probability-Statistics-Engineering-Sciences-InfoTrac/dp/0534399339), Jay L. Devore, Chapman. + * [Probability and Statistical Inference (7th Edition)](http://www.amazon.com/Probability-Statistical-Inference-Robert-Hogg/dp/0132546086), Hogg and Tanis, Pearson. + +Once you have a grasp of the basics then there are a slew of great texts that you might consult: + + * [Statistical Inference](http://www.amazon.com/Statistical-Inference-George-Casella/dp/0534243126), Casell and Berger, Duxbury/Thomson Learning. + * [Introduction to Bayesian Statistics](http://www.amazon.com/Introduction-Bayesian-Statistics-William-Bolstad/dp/0471270202), William H. Bolstad, Wiley. + * [Understanding Computational Bayesian Statistics](http://www.amazon.com/Understanding-Computational-Bayesian-Statistics-Wiley/dp/0470046090), Bolstadt + * [Bayesian Data Analysis, Gelman et al.](http://www.stat.columbia.edu/~gelman/book/) + + +## For statistics related to machine learning, these are particularly helpful: + + * [Pattern Recognition and Machine Learning by Chris Bishop](http://research.microsoft.com/en-us/um/people/cmbishop/PRML/index.htm) + * [Elements of Statistical Learning](http://www-stat.stanford.edu/~tibs/ElemStatLearn/) by Trevor Hastie, Robert Tibshirani, Jerome Friedman + * [http://research.microsoft.com/en-us/um/people/cmbishop/PRML/index.htm](http://research.microsoft.com/en-us/um/people/cmbishop/PRML/index.htm) + + +## For matrix computations/decomposition/factorization etc.: + + * Peter V. O'Neil [Introduction to Linear Algebra](http://www.amazon.com/Introduction-Linear-Algebra-Theory-Applications/dp/053400606X), great book for beginners (with some knowledge in calculus). It is not comprehensive, but, it will be a good place to start and the author starts by explaining the concepts with regards to vector spaces which I found to be a more natural way of explaining. + * David S. Watkins [Fundamentals of Matrix Computations](http://www.amazon.com/Fundamentals-Matrix-Computations-Applied-Mathematics/dp/0470528338/) + * [Matrix Computations](http://www.amazon.com/Computations-Hopkins-Studies-Mathematical-Sciences/dp/0801854148/ref=sr_1_2?s=books&ie=UTF8&qid=1394307676&sr=1-2&keywords=golub+van+loan) is the classic text for numerical linear algebra. Can't go wrong with it - great for researchers. + * Nick Trefethen's [Numerical Linear Algebra](http://people.maths.ox.ac.uk/trefethen/books.html). It's a bit more approachable for practitioners. Many chapters on SVD, there are even chapters on Lanczos. + + +## Books specifically on R: + +* Learning about R is a difficult thing. The best introduction is in MASS [http://www.stats.ox.ac.uk/pub/MASS4/](http://www.stats.ox.ac.uk/pub/MASS4/) +* [R Tutor](http://www.r-tutor.com/r-introduction) +* [Manual](http://cran.r-project.org/doc/manuals/R-intro.pdf) +* [R Course](http://faculty.washington.edu/tlumley/Rcourse/) + +In addition, you should see how to plot data well: + +* [Trellis plotting](http://www.statmethods.net/advgraphs/trellis.html) +* [ggplot2](http://had.co.nz/ggplot2/) + http://git-wip-us.apache.org/repos/asf/mahout/blob/9c031452/website/old_site_migration/old_site/users/basics/matrix-and-vector-needs.md ---------------------------------------------------------------------- diff --git a/website/old_site_migration/old_site/users/basics/matrix-and-vector-needs.md b/website/old_site_migration/old_site/users/basics/matrix-and-vector-needs.md new file mode 100644 index 0000000..39f4bfd --- /dev/null +++ b/website/old_site_migration/old_site/users/basics/matrix-and-vector-needs.md @@ -0,0 +1,88 @@ +--- +layout: default +title: Matrix and Vector Needs +theme: + name: retro-mahout +--- + +<a name="MatrixandVectorNeeds-Intro"></a> +# Intro + +Most ML algorithms require the ability to represent multidimensional data +concisely and to be able to easily perform common operations on that data. +MAHOUT-6 introduced Vector and Matrix datatypes of arbitrary cardinality, +along with a set of common operations on their instances. Vectors and +matrices are provided with sparse and dense implementations that are memory +resident and are suitable for manipulating intermediate results within +mapper, combiner and reducer implementations. They are not intended for +applications requiring vectors or matrices that exceed the size of a single +JVM, though such applications might be able to utilize them within a larger +organizing framework. + +<a name="MatrixandVectorNeeds-Background"></a> +## Background + +See [http://mail-archives.apache.org/mod_mbox/lucene-mahout-dev/200802.mbox/browser](http://mail-archives.apache.org/mod_mbox/lucene-mahout-dev/200802.mbox/browser) + +<a name="MatrixandVectorNeeds-Vectors"></a> +## Vectors + +Mahout supports a Vector interface that defines the following operations over all implementation classes: assign, cardinality, copy, divide, dot, get, haveSharedCells, like, minus, normalize, plus, set, size, times, toArray, viewPart, zSum and cross. The class DenseVector implements vectors as a double[](.html) + that is storage and access efficient. The class SparseVector implements +vectors as a HashMap<Integer, Double> that is surprisingly fast and +efficient. For sparse vectors, the size() method returns the current number +of elements whereas the cardinality() method returns the number of +dimensions it holds. An additional VectorView class allows views of an +underlying vector to be specified by the viewPart() method. See the +JavaDocs for more complete definitions. + +<a name="MatrixandVectorNeeds-Matrices"></a> +## Matrices + +Mahout also supports a Matrix interface that defines a similar set of operations over all implementation classes: assign, assignColumn, assignRow, cardinality, copy, divide, get, haveSharedCells, like, minus, plus, set, size, times, transpose, toArray, viewPart and zSum. The class DenseMatrix implements matrices as a double[](.html) +[] that is storage and access efficient. The class SparseRowMatrix +implements matrices as a Vector[] holding the rows of the matrix in a +SparseVector, and the symmetric class SparseColumnMatrix implements +matrices as a Vector[] holding the columns in a SparseVector. Each of these +classes can quickly produce a given row or column, respectively. A fourth +class SparseMatrix, uses a HashMap<Integer, Vector> which is also a +SparseVector. For sparse matrices, the size() method returns an int\[2\] +containing the actual row and column sizes whereas the cardinality() method +returns an int\[2\] with the number of dimensions of each. An additional +MatrixView class allows views of an underlying matrix to be specified by +the viewPart() method. See the JavaDocs for more complete definitions. + +The Matrix interface does not currently provide invert or determinant +methods, though these are desirable. It is arguable that the +implementations of SparseRowMatrix and SparseColumnMatrix ought to use the +HashMap<Integer, Vector> implementations and that SparseMatrix should +instead use a HashMap<Integer, HashMap<Integer, Double>>. Other forms of +sparse matrices can also be envisioned that support different storage and +access characteristics. Because the arguments of assignColumn and assignRow +operations accept all forms of Vector, it is possible to construct +instances of sparse matrices containing dense rows or columns. See the +JavaDocs for more complete definitions. + +For applications like PageRank/TextRank, iterative approaches to calculate +eigenvectors would also be useful. Batching of row/column operations would +also be useful, such as perhaps assignRow or assighColumn accepting +UnaryFunction and BinaryFunction arguments. + + +<a name="MatrixandVectorNeeds-Ideas"></a> +## Ideas + +As Vector and Matrix implementations are currently memory-resident, very +large instances greater than available memory are not supported. An +extended set of implementations that use HBase (BigTable) in Hadoop to +represent their instances would facilitate applications requiring such +large collections. +See [MAHOUT-6](https://issues.apache.org/jira/browse/MAHOUT-6) +See [Hama](http://wiki.apache.org/hadoop/Hama) + + +<a name="MatrixandVectorNeeds-References"></a> +## References + +Have a look at the old parallel computing libraries like [ScalaPACK](http://www.netlib.org/scalapack/) +, others http://git-wip-us.apache.org/repos/asf/mahout/blob/9c031452/website/old_site_migration/old_site/users/basics/principal-components-analysis.md ---------------------------------------------------------------------- diff --git a/website/old_site_migration/old_site/users/basics/principal-components-analysis.md b/website/old_site_migration/old_site/users/basics/principal-components-analysis.md new file mode 100644 index 0000000..5a9383f --- /dev/null +++ b/website/old_site_migration/old_site/users/basics/principal-components-analysis.md @@ -0,0 +1,29 @@ +--- +layout: default +title: Principal Components Analysis +theme: + name: retro-mahout +--- + +<a name="PrincipalComponentsAnalysis-PrincipalComponentsAnalysis"></a> +# Principal Components Analysis + +PCA is used to reduce high dimensional data set to lower dimensions. PCA +can be used to identify patterns in data, express the data in a lower +dimensional space. That way, similarities and differences can be +highlighted. It is mostly used in face recognition and image compression. +There are several flaws one has to be aware of when working with PCA: + +* Linearity assumption - data is assumed to be linear combinations of some +basis. There exist non-linear methods such as kernel PCA that alleviate +that problem. +* Principal components are assumed to be orthogonal. ICA tries to cope with +this limitation. +* Mean and covariance are assumed to be statistically important. +* Large variances are assumed to have important dynamics. + +<a name="PrincipalComponentsAnalysis-Parallelizationstrategy"></a> +## Parallelization strategy + +<a name="PrincipalComponentsAnalysis-Designofpackages"></a> +## Design of packages http://git-wip-us.apache.org/repos/asf/mahout/blob/9c031452/website/old_site_migration/old_site/users/basics/svd---singular-value-decomposition.md ---------------------------------------------------------------------- diff --git a/website/old_site_migration/old_site/users/basics/svd---singular-value-decomposition.md b/website/old_site_migration/old_site/users/basics/svd---singular-value-decomposition.md new file mode 100644 index 0000000..4a28934 --- /dev/null +++ b/website/old_site_migration/old_site/users/basics/svd---singular-value-decomposition.md @@ -0,0 +1,52 @@ +--- +layout: default +title: SVD - Singular Value Decomposition +theme: + name: retro-mahout +--- + +{excerpt}Singular Value Decomposition is a form of product decomposition of +a matrix in which a rectangular matrix A is decomposed into a product U s +V' where U and V are orthonormal and s is a diagonal matrix.{excerpt} The +values of A can be real or complex, but the real case dominates +applications in machine learning. The most prominent properties of the SVD +are: + + * The decomposition of any real matrix has only real values + * The SVD is unique except for column permutations of U, s and V + * If you take only the largest n values of s and set the rest to zero, +you have a least squares approximation of A with rank n. This allows SVD +to be used very effectively in least squares regression and makes partial +SVD useful. + * The SVD can be computed accurately for singular or nearly singular +matrices. For a matrix of rank n, only the first n singular values will be +non-zero. This allows SVD to be used for solution of singular linear +systems. The columns of U and V corresponding to zero singular values +define the null space of A. + * The partial SVD of very large matrices can be computed very quickly +using stochastic decompositions. See http://arxiv.org/abs/0909.4061v1 for +details. Gradient descent can also be used to compute partial SVD's and is +very useful where some values of the matrix being decomposed are not known. + +In collaborative filtering and text retrieval, it is common to compute the +partial decomposition of the user x item interaction matrix or the document +x term matrix. This allows the projection of users and items (or documents +and terms) into a common vector space representation that is often referred +to as the latent semantic representation. This process is sometimes called +Latent Semantic Analysis and has been very effective in the analysis of the +Netflix dataset. + +Dimension Reduction in Mahout: + * https://cwiki.apache.org/MAHOUT/dimensional-reduction.html + + See Also: + * http://www.kwon3d.com/theory/jkinem/svd.html + * http://en.wikipedia.org/wiki/Singular_value_decomposition + * http://en.wikipedia.org/wiki/Latent_semantic_analysis + * http://en.wikipedia.org/wiki/Netflix_Prize + * +http://www.amazon.com/Understanding-Complex-Datasets-Decompositions-Knowledge/dp/1584888326 + * http://web.mit.edu/be.400/www/SVD/Singular_Value_Decomposition.htm + * +http://www.quora.com/What-s-the-best-parallelized-sparse-SVD-code-publicly-available + * [understanding Mahout Hadoop SVD thread](http://mail-archives.apache.org/mod_mbox/mahout-user/201102.mbox/%[email protected]%3E) http://git-wip-us.apache.org/repos/asf/mahout/blob/9c031452/website/old_site_migration/old_site/users/basics/system-requirements.md ---------------------------------------------------------------------- diff --git a/website/old_site_migration/old_site/users/basics/system-requirements.md b/website/old_site_migration/old_site/users/basics/system-requirements.md new file mode 100644 index 0000000..6bef40d --- /dev/null +++ b/website/old_site_migration/old_site/users/basics/system-requirements.md @@ -0,0 +1,20 @@ +--- +layout: default +title: System Requirements +theme: + name: retro-mahout +--- + + +# System Requirements + +* Java 1.6.x or greater. +* Maven 3.x to build the source code. + +CPU, Disk and Memory requirements are based on the many choices made in +implementing your application with Mahout (document size, number of +documents, and number of hits retrieved to name a few.) + +Several of the Mahout algorithms are implemented to work on Hadoop +clusters. If not advertised differently, those implementations work with +Hadoop 0.20.0 or greater. http://git-wip-us.apache.org/repos/asf/mahout/blob/9c031452/website/old_site_migration/old_site/users/basics/tf-idf---term-frequency-inverse-document-frequency.md ---------------------------------------------------------------------- diff --git a/website/old_site_migration/old_site/users/basics/tf-idf---term-frequency-inverse-document-frequency.md b/website/old_site_migration/old_site/users/basics/tf-idf---term-frequency-inverse-document-frequency.md new file mode 100644 index 0000000..f807609 --- /dev/null +++ b/website/old_site_migration/old_site/users/basics/tf-idf---term-frequency-inverse-document-frequency.md @@ -0,0 +1,21 @@ +--- +layout: default +title: TF-IDF - Term Frequency-Inverse Document Frequency +theme: + name: retro-mahout +--- + +{excerpt}Is a weight measure often used in information retrieval and text +mining. This weight is a statistical measure used to evaluate how important +a word is to a document in a collection or corpus. The importance increases +proportionally to the number of times a word appears in the document but is +offset by the frequency of the word in the corpus.{excerpt} In other words +if a term/word appears lots in a document but also appears lots in the +corpus/collection as a whole it will get a lower score. An example of this +would be "the", "and", "it" but depending on your source material it maybe +other words that are very common to the source matter. + + + See Also: + * http://en.wikipedia.org/wiki/Tf%E2%80%93idf + * http://nlp.stanford.edu/IR-book/html/htmledition/tf-idf-weighting-1.html
