[1/9] mahout git commit: WEBSITE Triage of Old Site Migration

rawkintrevo Sat, 29 Apr 2017 16:25:24 -0700

Repository: mahout
Updated Branches:
  refs/heads/website 3a724debc -> 9c0314528



http://git-wip-us.apache.org/repos/asf/mahout/blob/9c031452/website/old_site_migration/needs_work_priority/sparkbindings/faq.md
----------------------------------------------------------------------
diff --git 
a/website/old_site_migration/needs_work_priority/sparkbindings/faq.md 
b/website/old_site_migration/needs_work_priority/sparkbindings/faq.md
new file mode 100644
index 0000000..9649e3b
--- /dev/null
+++ b/website/old_site_migration/needs_work_priority/sparkbindings/faq.md
@@ -0,0 +1,52 @@
+---
+layout: default
+title: FAQ
+theme:
+    name: retro-mahout
+---
+
+# FAQ for using Mahout with Spark
+
+**Q: Mahout Spark shell doesn't start; "ClassNotFound" problems or various 
classpath problems.**
+
+**A:** So far as of the time of this writing all reported problems starting 
the Spark shell in Mahout were revolving 
+around classpath issues one way or another. 
+
+If you are getting method signature like errors, most probably you have 
mismatch between Mahout's Spark dependency 
+and actual Spark installed. (At the time of this writing the HEAD depends on 
Spark 1.1.0) but check mahout/pom.xml.
+
+Troubleshooting general classpath issues is pretty straightforward. Since 
Mahout is using Spark's installation 
+and its classpath as reported by Spark itself for Spark-related dependencies, 
it is important to make sure 
+the classpath is sane and is made available to Mahout:
+
+1. Check Spark is of correct version (same as in Mahout's poms), is compiled 
and SPARK_HOME is set.
+2. Check Mahout is compiled and MAHOUT_HOME is set.
+3. Run `$SPARK_HOME/bin/compute-classpath.sh` and make sure it produces sane 
result with no errors. 
+If it outputs something other than a straightforward classpath string, most 
likely Spark is not compiled/set correctly (later spark versions require 
+`sbt/sbt assembly` to be run, simply runnig `sbt/sbt publish-local` is not 
enough any longer).
+4. Run `$MAHOUT_HOME/bin/mahout -spark classpath` and check that path reported 
in step (3) is included.
+
+**Q: I am using the command line Mahout jobs that run on Spark or am writing 
my own application that uses 
+Mahout's Spark code. When I run the code on my cluster I get ClassNotFound or 
signature errors during serialization. 
+What's wrong?**
+ 
+**A:** The Spark artifacts in the maven ecosystem may not match the exact 
binary you are running on your cluster. This may 
+cause class name or version mismatches. In this case you may wish 
+to build Spark yourself to guarantee that you are running exactly what you are 
building Mahout against. To do this follow these steps
+in order:
+
+1. Build Spark with maven, but **do not** use the "package" target as 
described on the Spark site. Build with the "clean install" target instead. 
+Something like: "mvn clean install -Dhadoop1.2.1" or whatever your particular 
build options are. This will put the jars for Spark
+in the local maven cache.
+2. Deploy **your** Spark build to your cluster and test it there.
+3. Build Mahout. This will cause maven to pull the jars for Spark from the 
local maven cache and may resolve missing 
+or mis-identified classes.
+4. if you are building your own code do so against the local builds of Spark 
and Mahout.
+
+**Q: The implicit SparkContext 'sc' does not work in the Mahout spark-shell.**
+
+**A:** In the Mahout spark-shell the SparkContext is called 'sdc', where the 
'd' stands for distributed. 
+
+
+
+

http://git-wip-us.apache.org/repos/asf/mahout/blob/9c031452/website/old_site_migration/needs_work_priority/sparkbindings/home.md
----------------------------------------------------------------------
diff --git 
a/website/old_site_migration/needs_work_priority/sparkbindings/home.md 
b/website/old_site_migration/needs_work_priority/sparkbindings/home.md
new file mode 100644
index 0000000..5075612
--- /dev/null
+++ b/website/old_site_migration/needs_work_priority/sparkbindings/home.md
@@ -0,0 +1,101 @@
+---
+layout: default
+title: Spark Bindings
+theme:
+    name: retro-mahout
+---
+
+# Scala & Spark Bindings:
+*Bringing algebraic semantics*
+
+## What is Scala & Spark Bindings?
+
+In short, Scala & Spark Bindings for Mahout is Scala DSL and algebraic 
optimizer of something like this (actual formula from **(d)spca**)
+        
+
+`\[\mathbf{G}=\mathbf{B}\mathbf{B}^{\top}-\mathbf{C}-\mathbf{C}^{\top}+\mathbf{s}_{q}\mathbf{s}_{q}^{\top}\boldsymbol{\xi}^{\top}\boldsymbol{\xi}\]`
+
+bound to in-core and distributed computations (currently, on Apache Spark).
+
+
+Mahout Scala & Spark Bindings expression of the above:
+
+        val g = bt.t %*% bt - c - c.t + (s_q cross s_q) * (xi dot xi)
+
+The main idea is that a scientist writing algebraic expressions cannot care 
less of distributed 
+operation plans and works **entirely on the logical level** just like he or 
she would do with R.
+
+Another idea is decoupling logical expression from distributed back-end. As 
more back-ends are added, 
+this implies **"write once, run everywhere"**.
+
+The linear algebra side works with scalars, in-core vectors and matrices, and 
Mahout Distributed
+Row Matrices (DRMs).
+
+The ecosystem of operators is built in the R's image, i.e. it follows R naming 
such as %*%, 
+colSums, nrow, length operating over vectors or matices. 
+
+Important part of Spark Bindings is expression optimizer. It looks at 
expression as a whole 
+and figures out how it can be simplified, and which physical operators should 
be picked. For example,
+there are currently about 5 different physical operators performing DRM-DRM 
multiplication
+picked based on matrix geometry, distributed dataset partitioning, orientation 
etc. 
+If we count in DRM by in-core combinations, that would be another 4, i.e. 9 
total -- all of it for just 
+simple x %*% y logical notation.
+
+
+
+Please refer to the documentation for details.
+
+## Status
+
+This environment addresses mostly R-like Linear Algebra optmizations for 
+Spark, Flink and H20.
+
+
+## Documentation
+
+* Scala and Spark bindings manual: 
[web](http://apache.github.io/mahout/doc/ScalaSparkBindings.html), 
[pdf](ScalaSparkBindings.pdf)
+* Overview blog on 0.10.x releases: 
[blog](http://www.weatheringthroughtechdays.com/2015/04/mahout-010x-first-mahout-release-as.html)
+
+## Distributed methods and solvers using Bindings
+
+* In-core ([ssvd]) and Distributed ([dssvd]) Stochastic SVD -- guinea pigs -- 
see the bindings manual
+* In-core ([spca]) and Distributed ([dspca]) Stochastic PCA -- guinea pigs -- 
see the bindings manual
+* Distributed thin QR decomposition ([dqrThin]) -- guinea pig -- see the 
bindings manual 
+* [Current list of 
algorithms](https://mahout.apache.org/users/basics/algorithms.html)
+
+[ssvd]: 
https://github.com/apache/mahout/blob/trunk/math-scala/src/main/scala/org/apache/mahout/math/scalabindings/SSVD.scala
+[spca]: 
https://github.com/apache/mahout/blob/trunk/math-scala/src/main/scala/org/apache/mahout/math/scalabindings/SSVD.scala
+[dssvd]: 
https://github.com/apache/mahout/blob/trunk/spark/src/main/scala/org/apache/mahout/sparkbindings/decompositions/DSSVD.scala
+[dspca]: 
https://github.com/apache/mahout/blob/trunk/spark/src/main/scala/org/apache/mahout/sparkbindings/decompositions/DSPCA.scala
+[dqrThin]: 
https://github.com/apache/mahout/blob/trunk/spark/src/main/scala/org/apache/mahout/sparkbindings/decompositions/DQR.scala
+
+
+## Related history of note 
+
+* CLI and Driver for Spark version of item similarity -- 
[MAHOUT-1541](https://issues.apache.org/jira/browse/MAHOUT-1541)
+* Command line interface for generalizable Spark pipelines -- 
[MAHOUT-1569](https://issues.apache.org/jira/browse/MAHOUT-1569)
+* Cooccurrence Analysis / Item-based Recommendation -- 
[MAHOUT-1464](https://issues.apache.org/jira/browse/MAHOUT-1464)
+* Spark Bindings -- 
[MAHOUT-1346](https://issues.apache.org/jira/browse/MAHOUT-1346)
+* Scala Bindings -- 
[MAHOUT-1297](https://issues.apache.org/jira/browse/MAHOUT-1297)
+* Interactive Scala & Spark Bindings Shell & Script processor -- 
[MAHOUT-1489](https://issues.apache.org/jira/browse/MAHOUT-1489)
+* OLS tutorial using Mahout shell -- 
[MAHOUT-1542](https://issues.apache.org/jira/browse/MAHOUT-1542)
+* Full abstraction of DRM apis and algorithms from a distributed engine -- 
[MAHOUT-1529](https://issues.apache.org/jira/browse/MAHOUT-1529)
+* Port Naive Bayes -- 
[MAHOUT-1493](https://issues.apache.org/jira/browse/MAHOUT-1493)
+
+## Work in progress 
+* Text-delimited files for input and output -- 
[MAHOUT-1568](https://issues.apache.org/jira/browse/MAHOUT-1568)
+<!-- * Weighted (Implicit Feedback) ALS -- 
[MAHOUT-1365](https://issues.apache.org/jira/browse/MAHOUT-1365) -->
+<!--* Data frame R-like bindings -- 
[MAHOUT-1490](https://issues.apache.org/jira/browse/MAHOUT-1490) -->
+
+* *Your issue here!*
+
+<!-- ## Stuff wanted: 
+* Data frame R-like bindings (similarly to linalg bindings)
+* Stat R-like bindings (perhaps we can just adapt to commons.math stat)
+* **BYODMs:** Bring Your Own Distributed Method on SparkBindings! 
+* In-core jBlas matrix adapter
+* In-core GPU matrix adapters -->
+
+
+
+  
\ No newline at end of file

http://git-wip-us.apache.org/repos/asf/mahout/blob/9c031452/website/old_site_migration/needs_work_priority/sparkbindings/play-with-shell.md
----------------------------------------------------------------------
diff --git 
a/website/old_site_migration/needs_work_priority/sparkbindings/play-with-shell.md
 
b/website/old_site_migration/needs_work_priority/sparkbindings/play-with-shell.md
new file mode 100644
index 0000000..3cdb8f7
--- /dev/null
+++ 
b/website/old_site_migration/needs_work_priority/sparkbindings/play-with-shell.md
@@ -0,0 +1,199 @@
+---
+layout: default
+title: Perceptron and Winnow
+theme:
+    name: retro-mahout
+---
+# Playing with Mahout's Spark Shell
+
+This tutorial will show you how to play with Mahout's scala DSL for linear 
algebra and its Spark shell. **Please keep in mind that this code is still in a 
very early experimental stage**.
+
+_(Edited for 0.10.2)_
+
+## Intro
+
+We'll use an excerpt of a publicly available [dataset about 
cereals](http://lib.stat.cmu.edu/DASL/Datafiles/Cereals.html). The dataset 
tells the protein, fat, carbohydrate and sugars (in milligrams) contained in a 
set of cereals, as well as a customer rating for the cereals. Our aim for this 
example is to fit a linear model which infers the customer rating from the 
ingredients.
+
+
+Name                    | protein | fat | carbo | sugars | rating
+:-----------------------|:--------|:----|:------|:-------|:---------
+Apple Cinnamon Cheerios | 2       | 2   | 10.5  | 10     | 29.509541
+Cap'n'Crunch            | 1       | 2   | 12    | 12     | 18.042851  
+Cocoa Puffs             | 1       | 1   | 12    | 13     | 22.736446
+Froot Loops             | 2       |    1   | 11    | 13     | 32.207582  
+Honey Graham Ohs        | 1       |    2   | 12    | 11     | 21.871292
+Wheaties Honey Gold     | 2       | 1   | 16    |  8     | 36.187559  
+Cheerios                | 6       |    2   | 17    |  1     | 50.764999
+Clusters                | 3       |    2   | 13    |  7     | 40.400208
+Great Grains Pecan      | 3       | 3   | 13    |  4     | 45.811716  
+
+
+## Installing Mahout & Spark on your local machine
+
+We describe how to do a quick toy setup of Spark & Mahout on your local 
machine, so that you can run this example and play with the shell. 
+
+ 1. Download [Apache Spark 
1.6.2](http://d3kbcqa49mib13.cloudfront.net/spark-1.6.2-bin-hadoop2.6.tgz) and 
unpack the archive file
+ 1. Change to the directory where you unpacked Spark and type ```sbt/sbt 
assembly``` to build it
+ 1. Create a directory for Mahout somewhere on your machine, change to there 
and checkout the master branch of Apache Mahout from GitHub ```git clone 
https://github.com/apache/mahout mahout```
+ 1. Change to the ```mahout``` directory and build mahout using ```mvn 
-DskipTests clean install```
+ 
+## Starting Mahout's Spark shell
+
+ 1. Goto the directory where you unpacked Spark and type 
```sbin/start-all.sh``` to locally start Spark
+ 1. Open a browser, point it to 
[http://localhost:8080/](http://localhost:8080/) to check whether Spark 
successfully started. Copy the url of the spark master at the top of the page 
(it starts with **spark://**)
+ 1. Define the following environment variables: <pre class="codehilite">export 
MAHOUT_HOME=[directory into which you checked out Mahout]
+export SPARK_HOME=[directory where you unpacked Spark]
+export MASTER=[url of the Spark master]
+</pre>
+ 1. Finally, change to the directory where you unpacked Mahout and type 
```bin/mahout spark-shell```, 
+you should see the shell starting and get the prompt ```mahout> ```. Check 
+[FAQ](http://mahout.apache.org/users/sparkbindings/faq.html) for further 
troubleshooting.
+
+## Implementation
+
+We'll use the shell to interactively play with the data and incrementally 
implement a simple [linear 
regression](https://en.wikipedia.org/wiki/Linear_regression) algorithm. Let's 
first load the dataset. Usually, we wouldn't need Mahout unless we processed a 
large dataset stored in a distributed filesystem. But for the sake of this 
example, we'll use our tiny toy dataset and "pretend" it was too big to fit 
onto a single machine.
+
+*Note: You can incrementally follow the example by copy-and-pasting the code 
into your running Mahout shell.*
+
+Mahout's linear algebra DSL has an abstraction called *DistributedRowMatrix 
(DRM)* which models a matrix that is partitioned by rows and stored in the 
memory of a cluster of machines. We use ```dense()``` to create a dense 
in-memory matrix from our toy dataset and use ```drmParallelize``` to load it 
into the cluster, "mimicking" a large, partitioned dataset.
+
+<div class="codehilite"><pre>
+val drmData = drmParallelize(dense(
+  (2, 2, 10.5, 10, 29.509541),  // Apple Cinnamon Cheerios
+  (1, 2, 12,   12, 18.042851),  // Cap'n'Crunch
+  (1, 1, 12,   13, 22.736446),  // Cocoa Puffs
+  (2, 1, 11,   13, 32.207582),  // Froot Loops
+  (1, 2, 12,   11, 21.871292),  // Honey Graham Ohs
+  (2, 1, 16,   8,  36.187559),  // Wheaties Honey Gold
+  (6, 2, 17,   1,  50.764999),  // Cheerios
+  (3, 2, 13,   7,  40.400208),  // Clusters
+  (3, 3, 13,   4,  45.811716)), // Great Grains Pecan
+  numPartitions = 2);
+</pre></div>
+
+Have a look at this matrix. The first four columns represent the ingredients 
+(our features) and the last column (the rating) is the target variable for 
+our regression. [Linear 
regression](https://en.wikipedia.org/wiki/Linear_regression) 
+assumes that the **target variable** `\(\mathbf{y}\)` is generated by the 
+linear combination of **the feature matrix** `\(\mathbf{X}\)` with the 
+**parameter vector** `\(\boldsymbol{\beta}\)` plus the
+ **noise** `\(\boldsymbol{\varepsilon}\)`, summarized in the formula 
+`\(\mathbf{y}=\mathbf{X}\boldsymbol{\beta}+\boldsymbol{\varepsilon}\)`. 
+Our goal is to find an estimate of the parameter vector 
+`\(\boldsymbol{\beta}\)` that explains the data very well.
+
+As a first step, we extract `\(\mathbf{X}\)` and `\(\mathbf{y}\)` from our 
data matrix. We get *X* by slicing: we take all rows (denoted by ```::```) and 
the first four columns, which have the ingredients in milligrams as content. 
Note that the result is again a DRM. The shell will not execute this code yet, 
it saves the history of operations and defers the execution until we really 
access a result. **Mahout's DSL automatically optimizes and parallelizes all 
operations on DRMs and runs them on Apache Spark.**
+
+<div class="codehilite"><pre>
+val drmX = drmData(::, 0 until 4)
+</pre></div>
+
+Next, we extract the target variable vector *y*, the fifth column of the data 
matrix. We assume this one fits into our driver machine, so we fetch it into 
memory using ```collect```:
+
+<div class="codehilite"><pre>
+val y = drmData.collect(::, 4)
+</pre></div>
+
+Now we are ready to think about a mathematical way to estimate the parameter 
vector *Î²*. A simple textbook approach is [ordinary least squares 
(OLS)](https://en.wikipedia.org/wiki/Ordinary_least_squares), which minimizes 
the sum of residual squares between the true target variable and the prediction 
of the target variable. In OLS, there is even a closed form expression for 
estimating `\(\boldsymbol{\beta}\)` as 
+`\(\left(\mathbf{X}^{\top}\mathbf{X}\right)^{-1}\mathbf{X}^{\top}\mathbf{y}\)`.
+
+The first thing which we compute for this is  
`\(\mathbf{X}^{\top}\mathbf{X}\)`. The code for doing this in Mahout's scala 
DSL maps directly to the mathematical formula. The operation ```.t()``` 
transposes a matrix and analogous to R ```%*%``` denotes matrix multiplication.
+
+<div class="codehilite"><pre>
+val drmXtX = drmX.t %*% drmX
+</pre></div>
+
+The same is true for computing `\(\mathbf{X}^{\top}\mathbf{y}\)`. We can 
simply type the math in scala expressions into the shell. Here, *X* lives in 
the cluster, while is *y* in the memory of the driver, and the result is a DRM 
again.
+<div class="codehilite"><pre>
+val drmXty = drmX.t %*% y
+</pre></div>
+
+We're nearly done. The next step we take is to fetch 
`\(\mathbf{X}^{\top}\mathbf{X}\)` and 
+`\(\mathbf{X}^{\top}\mathbf{y}\)` into the memory of our driver machine (we 
are targeting 
+features matrices that are tall and skinny , 
+so we can assume that `\(\mathbf{X}^{\top}\mathbf{X}\)` is small enough 
+to fit in). Then, we provide them to an in-memory solver (Mahout provides 
+the an analog to R's ```solve()``` for that) which computes ```beta```, our 
+OLS estimate of the parameter vector `\(\boldsymbol{\beta}\)`.
+
+<div class="codehilite"><pre>
+val XtX = drmXtX.collect
+val Xty = drmXty.collect(::, 0)
+
+val beta = solve(XtX, Xty)
+</pre></div>
+
+That's it! We have a implemented a distributed linear regression algorithm 
+on Apache Spark. I hope you agree that we didn't have to worry a lot about 
+parallelization and distributed systems. The goal of Mahout's linear algebra 
+DSL is to abstract away the ugliness of programming a distributed system 
+as much as possible, while still retaining decent performance and 
+scalability.
+
+We can now check how well our model fits its training data. 
+First, we multiply the feature matrix `\(\mathbf{X}\)` by our estimate of 
+`\(\boldsymbol{\beta}\)`. Then, we look at the difference (via L2-norm) of 
+the target variable `\(\mathbf{y}\)` to the fitted target variable:
+
+<div class="codehilite"><pre>
+val yFitted = (drmX %*% beta).collect(::, 0)
+(y - yFitted).norm(2)
+</pre></div>
+
+We hope that we could show that Mahout's shell allows people to interactively 
and incrementally write algorithms. We have entered a lot of individual 
commands, one-by-one, until we got the desired results. We can now refactor a 
little by wrapping our statements into easy-to-use functions. The definition of 
functions follows standard scala syntax. 
+
+We put all the commands for ordinary least squares into a function ```ols```. 
+
+<div class="codehilite"><pre>
+def ols(drmX: DrmLike[Int], y: Vector) = 
+  solve(drmX.t %*% drmX, drmX.t %*% y)(::, 0)
+
+</pre></div>
+
+Note that DSL declares implicit `collect` if coersion rules require an in-core 
argument. Hence, we can simply
+skip explicit `collect`s. 
+
+Next, we define a function ```goodnessOfFit``` that tells how well a model 
fits the target variable:
+
+<div class="codehilite"><pre>
+def goodnessOfFit(drmX: DrmLike[Int], beta: Vector, y: Vector) = {
+  val fittedY = (drmX %*% beta).collect(::, 0)
+  (y - fittedY).norm(2)
+}
+</pre></div>
+
+So far we have left out an important aspect of a standard linear regression 
+model. Usually there is a constant bias term added to the model. Without 
+that, our model always crosses through the origin and we only learn the 
+right angle. An easy way to add such a bias term to our model is to add a 
+column of ones to the feature matrix `\(\mathbf{X}\)`. 
+The corresponding weight in the parameter vector will then be the bias term.
+
+Here is how we add a bias column:
+
+<div class="codehilite"><pre>
+val drmXwithBiasColumn = drmX cbind 1
+</pre></div>
+
+Now we can give the newly created DRM ```drmXwithBiasColumn``` to our model 
fitting method ```ols``` and see how well the resulting model fits the training 
data with ```goodnessOfFit```. You should see a large improvement in the result.
+
+<div class="codehilite"><pre>
+val betaWithBiasTerm = ols(drmXwithBiasColumn, y)
+goodnessOfFit(drmXwithBiasColumn, betaWithBiasTerm, y)
+</pre></div>
+
+As a further optimization, we can make use of the DSL's caching functionality. 
We use ```drmXwithBiasColumn``` repeatedly  as input to a computation, so it 
might be beneficial to cache it in memory. This is achieved by calling 
```checkpoint()```. In the end, we remove it from the cache with uncache:
+
+<div class="codehilite"><pre>
+val cachedDrmX = drmXwithBiasColumn.checkpoint()
+
+val betaWithBiasTerm = ols(cachedDrmX, y)
+val goodness = goodnessOfFit(cachedDrmX, betaWithBiasTerm, y)
+
+cachedDrmX.uncache()
+
+goodness
+</pre></div>
+
+
+Liked what you saw? Checkout Mahout's overview for the [Scala and Spark 
bindings](https://mahout.apache.org/users/sparkbindings/home.html).
\ No newline at end of file

http://git-wip-us.apache.org/repos/asf/mahout/blob/9c031452/website/old_site_migration/needs_work_priority/wikipedia-classifier-example.md
----------------------------------------------------------------------
diff --git 
a/website/old_site_migration/needs_work_priority/wikipedia-classifier-example.md
 
b/website/old_site_migration/needs_work_priority/wikipedia-classifier-example.md
new file mode 100644
index 0000000..9df07da
--- /dev/null
+++ 
b/website/old_site_migration/needs_work_priority/wikipedia-classifier-example.md
@@ -0,0 +1,57 @@
+---
+layout: default
+title: Wikipedia XML parser and Naive Bayes Example
+theme:
+    name: retro-mahout
+---
+# Wikipedia XML parser and Naive Bayes Classifier Example
+
+## Introduction
+Mahout has an [example 
script](https://github.com/apache/mahout/blob/master/examples/bin/classify-wikipedia.sh)
 [1] which will download a recent XML dump of the (entire if desired) [English 
Wikipedia database](http://dumps.wikimedia.org/enwiki/latest/). After running 
the classification script, you can use the [document classification 
script](https://github.com/apache/mahout/blob/master/examples/bin/spark-document-classifier.mscala)
 from the Mahout 
[spark-shell](http://mahout.apache.org/users/sparkbindings/play-with-shell.html)
 to vectorize and classify text from outside of the training and testing corpus 
using a modle built on the Wikipedia dataset.  
+
+You can run this script to build and test a Naive Bayes classifier for option 
(1) 10 arbitrary countries or option (2) 2 countries (United States and United 
Kingdom).
+
+## Oververview
+
+Tou run the example simply execute the 
`$MAHOUT_HOME/examples/bin/classify-wikipedia.sh` script.
+
+By defult the script is set to run on a medium sized Wikipedia XML dump.  To 
run on the full set (the entire english Wikipedia) you can change the download 
by commenting out line 78, and uncommenting line 80  of 
[classify-wikipedia.sh](https://github.com/apache/mahout/blob/master/examples/bin/classify-wikipedia.sh)
 [1]. However this is not recommended unless you have the resources to do so. 
*Be sure to clean your work directory when changing datasets- option (3).*
+
+The step by step process for Creating a Naive Bayes Classifier for the 
Wikipedia XML dump is very similar to that for [creating a 20 Newsgroups 
Classifier](http://mahout.apache.org/users/classification/twenty-newsgroups.html)
 [4].  The only difference being that instead of running `$mahout seqdirectory` 
on the unzipped 20 Newsgroups file, you'll run `$mahout seqwiki` on the 
unzipped Wikipedia xml dump.
+
+    $ mahout seqwiki 
+
+The above command launches `WikipediaToSequenceFile.java` which accepts a text 
file of categories [3] and starts an MR job to parse the each document in the 
XML file.  This process will seek to extract documents with a wikipedia 
category tag which (exactly, if the `-exactMatchOnly` option is set) matches a 
line in the category file.  If no match is found and the `-all` option is set, 
the document will be dumped into an "unknown" category. The documents will then 
be written out as a `<Text,Text>` sequence file of the form 
(K:/category/document_title , V: document).
+
+There are 3 different example category files available to in the 
/examples/src/test/resources
+directory:  country.txt, country10.txt and country2.txt.  You can edit these 
categories to extract a different corpus from the Wikipedia dataset.
+
+The CLI options for `seqwiki` are as follows:
+
+    --input          (-i)         input pathname String
+    --output         (-o)         the output pathname String
+    --categories     (-c)         the file containing the Wikipedia categories
+    --exactMatchOnly (-e)         if set, then the Wikipedia category must 
match
+                                    exactly instead of simply containing the 
category string
+    --all            (-all)       if set select all categories
+    --removeLabels   (-rl)        if set, remove [[Category:labels]] from 
document text after extracting label.
+
+
+After `seqwiki`, the script runs `seq2sparse`, `split`, `trainnb` and `testnb` 
as in the [step by step 20newsgroups 
example](http://mahout.apache.org/users/classification/twenty-newsgroups.html). 
 When all of the jobs have finished, a confusion matrix will be displayed.
+
+#Resourcese
+
+[1] 
[classify-wikipedia.sh](https://github.com/apache/mahout/blob/master/examples/bin/classify-wikipedia.sh)
+
+[2] [Document classification script for the Mahout Spark 
Shell](https://github.com/apache/mahout/blob/master/examples/bin/spark-document-classifier.mscala)
+
+[3] [Example category 
file](https://github.com/apache/mahout/blob/master/examples/src/test/resources/country10.txt)
+
+[4] [Step by step instructions for building a Naive Bayes classifier for 
20newsgroups from the command 
line](http://mahout.apache.org/users/classification/twenty-newsgroups.html)
+
+[5] [Mahout MapReduce Naive 
Bayes](http://mahout.apache.org/users/classification/bayesian.html)
+
+[6] [Mahout Spark Naive 
Bayes](http://mahout.apache.org/users/algorithms/spark-naive-bayes.html)
+
+[7] [Mahout Scala Spark and H2O 
Bindings](http://mahout.apache.org/users/sparkbindings/home.html)
+

http://git-wip-us.apache.org/repos/asf/mahout/blob/9c031452/website/old_site_migration/old_site/general/books-tutorials-and-talks.md
----------------------------------------------------------------------
diff --git 
a/website/old_site_migration/old_site/general/books-tutorials-and-talks.md 
b/website/old_site_migration/old_site/general/books-tutorials-and-talks.md
new file mode 100644
index 0000000..bbbdeef
--- /dev/null
+++ b/website/old_site_migration/old_site/general/books-tutorials-and-talks.md
@@ -0,0 +1,121 @@
+---
+layout: default
+title: Books Tutorials and Talks
+theme:
+    name: retro-mahout
+---
+# Intro
+
+This page is a place for info about talks (past and upcoming), tutorials, 
articles, books, slides, PDFs, discussions, etc. about Mahout. No endorsements 
are implied or
+given.
+
+# Books
+
+## Mahout specific
+
+   * <a 
href="http://www.weatheringthroughtechdays.com/2016/02/mahout-samsara-book-is-out.html";>Apache
 Mahout: Beyond MapReduce</a> by Dmitriy Lyubimov and Andrew Palumbo published 
Feb 2016. Covers new features in Mahout "Samsara" releases (0.10, 0.11+).
+   * <a href="http://www.packtpub.com/apache-mahout-cookbook/book";>Apache 
Mahout cookbook</a>- Book by Piero Giacomelli published Dec 2013 by Packtpub.
+   * <a href="http://www.manning.com/owen/";>Mahout in Action</a> - Book by 
Sean Owen, Robin Anil, Ted Dunning and Ellen Friedman published Oct 2011 by 
Manning Publications.
+   * <a href="http://www.manning.com/ingersoll/";>Taming Text</a> - By Grant 
Ingersoll and Tom Morton, published by Manning Publications. Will have some 
Mahout coverage, but by no means as complete as Mahout in Action.
+
+## Engineering oriented machine learning books
+
+   * <a 
href="http://www.amazon.com/Collective-Intelligence-Action-Satnam-Alag/dp/1933988312/ref=pd_bbs_sr_3?ie=UTF8&s=books&qid=1214545249&sr=1-3";>Collective
 Intelligence in Action</a>
+   * <a 
href="http://www.amazon.com/Programming-Collective-Intelligence-Building-Applications/dp/0596529325/ref=pd_bbs_sr_1/104-1017533-9408723?ie=UTF8&s=books&qid=1214593516&sr=1-1";>Programming
 Collective Intelligence</a>
+   * <a 
href="http://www.amazon.com/Algorithms-Intelligent-Web-Haralambos-Marmanis/dp/1933988665/ref=sr_1_1?s=books&ie=UTF8&qid=1298005918&sr=1-1";>Algorithms
 of the Intelligent Web</a>
+
+## Scientific background
+
+   * <a href="http://www.cs.waikato.ac.nz/~ml/weka/book.html";>Data Mining: 
Practical Machine Learning Tools and Techniques</a>
+   * <a href="http://www-nlp.stanford.edu/IR-book/";>Introduction to 
Information Retrieval</a>
+   * <a 
href="http://www.amazon.com/Machine-Learning-Mcgraw-Hill-International-Edit/dp/0071154671/ref=pd_bbs_sr_1?ie=UTF8&s=books&qid=1214593709&sr=8-1";>Machine
 Learning</a>
+   * <a 
href="http://www.amazon.com/Pattern-Recognition-Learning-Information-Statistics/dp/0387310738/ref=pd_bbs_sr_2?ie=UTF8&s=books&qid=1214593709&sr=8-2";>Pattern
 Recognition and Machine Learning (Information Science and Statistics) </a>
+
+# News, Articles and Tutorials
+
+   * [Mahout 0.10.x: first Mahout release as a programming 
environment](http://www.weatheringthroughtechdays.com/2015/04/mahout-010x-first-mahout-release-as.html)
   
+   * [Comparing Document Classification Functions of Lucene and 
Mahout](http://soleami.com/blog/comparing-document-classification-functions-of-lucene-and-mahout.html)
+   * <a 
href="http://www.ibm.com/developerworks/java/library/j-mahout-scaling/";>Apache 
Mahout: Scalable Machine Learning for Everyone</a>
+   * <a 
href="http://emmaespina.wordpress.com/2011/04/26/ham-spam-and-elephants-or-how-to-build-a-spam-filter-server-with-mahout/";>How
 to build a spam filter server with Mahout</a> - Applying classification on a 
live server - April 2011
+   * <a 
href="http://ssc.io/deploying-a-massively-scalable-recommender-system-with-apache-mahout/";>Deploying
 a massively scalable recommender system with Apache Mahout</a> - Blogpost of 
Sebastian Schelter in April 2011
+   * <a href="http://www.redmonk.com/cote/2010/11/04/makeall013/";>Apache 
Mahout & the commoditization of machine learning </a> - Podcast interview with 
Grant Ingersoll at ApacheCon 2010
+   * <a href="http://isabel-drost.de/hadoop/slides/devoxx.pdf";>Apache Mahout 
0.4 mit neuen Algorithmen</a> - published after the 0.4 release by heise Open/ 
Developer, November 2010
+   * <a href="http://www.infoq.com/news/2009/04/mahout";>Mahout on InfoQ</a> - 
Interview with Grant Ingersoll on InfoQ
+   * <a 
href="http://www.cloudera.com/blog/2009/04/21/hadoop-uk-user-group-meeting/";>Mahout
 in the Cloudera weblog</a> - published after the Hadoop user group UK.
+   * <a 
href="http://blog.athico.com/2008/08/machine-learning-and-apache-mahout.html";>Mahout
 in the Drools weblog</a> - Michael Neale published an article on Mahout in the 
drools weblog
+   * <a 
href="https://www.ibm.com/developerworks/java/library/j-mahout/index.html";>Introducing
 Apache Mahout</a> - Grant Ingersoll - Intro to Apache Mahout focused on 
clustering, classification and collaborative filtering. Japanese translation 
available at: 
[http://www.ibm.com/developerworks/jp/java/library/j-mahout/](http://www.ibm.com/developerworks/jp/java/library/j-mahout/)
+   * <a 
href="http://philippeadjiman.com/blog/2009/11/11/flexible-collaborative-filtering-in-java-with-mahout-taste/";>Flexible
 Collaborative Filtering In Java With Mahout Taste</a> - Philippe Adjiman - 
Quick starting guide on how to use the collaborative filtering package of 
Mahout (called Taste) to quickly and flexibly create, test and compare tailored 
recommendation engines.
+   * <a 
href="http://www.lucidimagination.com/blog/2010/03/16/integrating-apache-mahout-with-apache-lucene-and-solr-part-i-of-3/";>Integrating
 Mahout with Lucene and Solr</a> Three part series on ways to integrate Mahout 
with Lucene and Solr
+   * <a href="https://www.youtube.com/watch?v=yD40rVKUwPI";>Mahout Item 
Recommender Tutorial using Java and Eclipse</a> - YouTube video tutorial by 
Steve Cook
+
+
+# Coursework/Lectures
+
+   * <a 
href="http://videolectures.net/mlss05us_chicago/";>http://videolectures.net/mlss05us_chicago/</a>
+   * <a 
href="http://videolectures.net/mlas06_pittsburgh/";>http://videolectures.net/mlas06_pittsburgh/</a>
+   * <a 
href="http://see.stanford.edu/see/lecturelist.aspx?coll=348ca38a-3a6d-4052-937d-cb017338d7b1";>Stanford
 Lectures on Machine Learning by Andrew Ng</a>
+   * <a 
href="https://docs.google.com/open?id=0ByhGL2_SCeitMDQ3OTczNjItM2ZjYi00ZDg5LWE0MzItZGQxODQ5NzkzYjNj";>CMU@Qatar
 Introduction to Mahout lecture</a>
+
+
+# Talks
+
+In reverse chronological order, so that most recent talks are at the top
+
+   * [Distributed Machine Learning with Apache Mahout] Suneel Marthi at Apache 
Big Data North America, Vancouver, Canada, May 11, 2016 and MapR Washington DC 
Big Data Everywhere, Tysons, VA, June 2 2016
+   * [Declarative Machine Learning with the Samsara 
DSL](http://www.slideshare.net/FlinkForward/sebastian-schelter-distributed-machine-learing-with-the-samsara-dsl)
 Sebastian Schelter at Flink Forward Conference, Berlin Germany, October 2015.
+   * [Bringing Algebraic Semantics to 
Mahout](http://www.slideshare.net/sscdotopen/bringing-algebraic-semantics-to-mahout)
 Sebastian Schelter at HPI Infolunch, Potsdam Germany, May 2014
+   * Mahout Spark and Scala bindings: Bringing Algebraic Semantics 
([slides](http://www.slideshare.net/DmitriyLyubimov/mahout-scala-and-spark-bindings)/[video](http://youtu.be/h9dpmvNW1Dw))
 - Dmitriy Lyubimov at Mahout Meetup, April 17, 2014. 
+   * Mahout Future Directions - Ted Dunning, Suneel Marthi, Sebastian Schelter 
at Hadoop Summit Europe 2014, Amsterdam, April 3, 2014
+   * Building Recommender Systems for Mere-Mortals - Sebastian Schelter at 
Researchgate Developer Day, Berlin, November 2013
+   * Recommendations with Apache Mahout - Sebastian Schelter at IBM Almaden 
Research Center, San Jose, September 2013
+   * <a 
href="http://de.slideshare.net/sscdotopen/next-directions-in-mahouts-recommenders";>Next
 Directions in Mahoutâs Recommenders</a> - Sebastian Schelter at Bay Area 
Mahout Meetup, Redwood City, August 2013 
+   * <a 
href="http://de.slideshare.net/sscdotopen/new-directions-in-mahouts-recommenders";>New
 Directions in Mahoutâs Recommenders</a> - Sebastian Schelter at Recommender 
Systems Get Together Berlin, April 2013
+   * <a 
href="http://www.slideshare.net/VaradMeru/introduction-to-mahout-and-machine-learning";>Introduction
 to Mahout and Machine Learning</a> - Slides by Varad Meru, Software 
Development Engineer at Orzota. July 27th, 2013.
+   * <a 
href="http://de.slideshare.net/sscdotopen/introduction-to-collaborative-filtering-with-apache-mahout";>An
 Introduction to Collaborative Filtering with Apache Mahout</a> - Sebastian 
Schelter at Recommender Systems Challenge Workshop in conjunction with ACM 
RecSys 2012, Dublin, September 2012
+   * <a 
href="https://github.com/ManuelB/facebook-recommender-demo/raw/master/docs/Talk-BedCon-Berlin-2012.pdf";>How
 to build a recommender system based on Mahout and JavaEE</a> - Slides by 
Manuel Blechschmidt at Berlin Expert Days March, 2012.
+   * <a href="http://lanyrd.com/2011/apachecon-north-america/skdtb/";>Apache 
Mahout for intelligent data analysis</a> - Slides from Isabel Drost at Apache 
Con NA November, 2011.
+   * <a href="http://lanyrd.com/2011/apachecon-north-america/skdrk/";>Dr. 
Mahout: Analyzing clinical data using scalable and distributed computing</a> - 
Slides from Shannon Quinn at Apache Con NA November, 2011.
+   * Frank Scholten at Berlin Buzzwords on June 7, 2011.
+   * Introduction to Collaborative Filtering using Mahout (updated) - Talk by 
Sean Owen at the London Hadoop User Group on April 14, 2011.
+   *  <a 
href="http://www.meetup.com/LA-HUG/pages/Video_from_March_16th_LA-HUG_Ted_Dunning_Mahout";>Cool
 Tricks with Classifiers</a> - Talk by Ted Dunning at the Los Angeles HUG 
talking about Mahout classifiers on March 16, 2011.
+   * First Mahout Hackathon, Berlin, March 2011
+   * <a 
href="http://blog.jteam.nl/2011/01/13/announcement-lucene-nl-mahout-meetup-with-isabel-drost-feb-7/";>Mahout
 meetup</a> - there were two talks at the Apache Mahout meetup at JTeam in 
Amsterdam, February 2011. <a 
href="http://isabel-drost.de/hadoop/slides/jteam.pdf";>intro slides</a>
+   * <a 
href="http://www.fosdem.org/2011/schedule/event/mahoutclustering.html";>Mahout 
clustering </a> - Talk on Mahout clustering at data dev room FOSDEM, February 
2011.
+   * Scaling Data Analysis with Apache Mahout - talk on Mahout at O'Reilly 
Strata, February 2011. 
+   * <a 
href="http://www.slideshare.net/jaganadhg/mahout-tutorial-fossmeet-nitc";>Practical
 Machine Learning</a> - Slides from Biju B and Jaganadh G, FOSSMEET-NITC, 
Calicut, India, February 2011.
+   * <a href="http://www.javaedge.com/jedge/pdf/Mahout.pdf";>Mahout at 
AlphaCSPs The Edge 2010 (pdf)</a> - <a 
href="http://www.slideshare.net/arikogan/mahouts-presentation-at-alphacsps-the-edge-2010";>slideshare</a>
 - Slides from <a href="http://il.linkedin.com/in/arielkogan";>Ariel Kogan</a> 
AlphaCSP's The Edge, December 2010.
+   * <a href="http://isabel-drost.de/hadoop/slides/devoxx.pdf";>Intelligent 
data analysis with Apache Mahout</a> - Slides from Isabel Drost, Devoxx 
Antwerp, November 2010.
+   * <a href="http://isabel-drost.de/hadoop/slides/codebits.pdf";>Apache Mahout 
introduction</a> - Slides from Isabel Drost, codebits Lisbon, November 2010.
+   * <a href="http://isabel-drost.de/hadoop/slides/apachecon_2010.pdf";>Apache 
Mahout - Making Data Analysis Easy</a> - Slides from Isabel Drost, Apache Con 
US Atlanta, November 2010.
+   * <a href="http://www.slideshare.net/jaganadhg/bck9";>Practical Machine 
Learning</a> - Slides from Jaganadh G, BarCamp Kerala 9, November 2010.
+   * <a href="http://www.slideshare.net/tdunning/sdforum-11042010";>Mahout and 
its new classification framework</a> - Slides from Ted Dunning, SDForum, 
November 2010.
+   * <a href="http://www.slideshare.net/sscdotopen/mahoutcf";>Distributed 
Item-based Collaborative Filtering with Apache Mahout</a> - Slides from 
Sebastian Schelter, Hadoop Get Together Berlin, October 2010.
+   * <a href="http://isabel-drost.de/hadoop/slides/HMM.pdf";>Hidden Markov 
Models for Mahout</a> - Slides from Max Heimel, Hadoop Get Together Berlin, 
October 2010.
+   * <a 
href="http://www.slideshare.net/robinanil/oscon-apache-mahout-mammoth-scale-machine-learning";>Apache
 Mahout Mammoth Scale Machine Learning </a> - Slides from Robin Anil, OSCON 
2010.
+   * <a href="http://slidesha.re/9LxOIu";>Intro to Apache Mahout</a> - Slides 
from Grant Ingersoll,  RTP Semantic Web Group.
+   * <a href="http://www.slideshare.net/ydn/3-biometric-hadoopsummit2010";>Case 
study: Biometric Databases and Hadoop </a> - Slides from Jason Trost, Hadoop 
Summit 2010.
+   * <a 
href="http://www.slideshare.net/hadoopusergroup/mail-antispam?from=ss_embed";>Spam
 Fighting at Yahoo</a>
+   * <a 
href="http://www.slideshare.net/hadoopusergroup/bixo-hug-talk?from=ss_embed";>Web
 Mining with Ken Krugler</a>
+   * <a 
href="http://berlinbuzzwords.wikidot.com/local--files/links-to-slides/ingersoll_bbuzz2010.pdf";>Keynote
 on intelligent search</a> - Slides from Grant Ingersoll, Berlin Buzzwords, 
June 2010.
+   * <a 
href="http://berlinbuzzwords.wikidot.com/local--files/links-to-slides/owen_bbuzz2010.pdf";>Simple
 co-occurrence-based recommendation on Hadoop</a> - Slides from Sean Owen, 
Berlin Buzzwords, June, 2010.
+   * <a 
href="http://berlinbuzzwords.wikidot.com/local--files/links-to-slides/scholten_bbuzz2010.odp";>Introduction
 to Collaborative Filtering using Mahout</a> - Slides from Frank Scholten, 
Berlin Buzzwords, June, 2010.
+   * <a 
href="http://lucene.grantingersoll.com/2010/02/16/trijug-intro-to-mahout-slides-and-demo-examples/";>Introduction
 to Scalable Machine Learning</a> - Slides and demos from Grant Ingersoll, 
March, 2010.
+   * Mahout @ India Hadoop Summit - Slides from a 1 hour talk on Mahout at the 
India Hadoop Summit by Robin Anil, February 2010.
+   * <a 
href="http://www.isabel-drost.de/hadoop/slides/opensourceexpo09.pdf";>Mahout in 
10 minutes</a> - Slides from a 10 min intro to Mahout at the Map Reduce 
tutorial by David Z&uuml;lke at Open Source Expo in Karlsruhe, Isabel Drost, 
November 2009.
+   * <a 
href="http://www.isabel-drost.de/hadoop/slides/apacheconus2009.pdf";>Mahout at 
Apache Con US </a> - Slides from a talk on "Going from raw data to information" 
(with Mahout) at Apache Con US in Oakland, Isabel Drost, November 2009.
+   * <a href="http://www.isabel-drost.de/hadoop/slides/froscon2009.pdf";>Mahout 
at FrOSCon</a> - Slides from a talk on Mahout at FrOSCon in Sankt Augustin, 
Isabel Drost, August 2009.
+   * <a href="http://www.isabel-drost.de/hadoop/slides/dai.pdf";>Mahout at DAI 
group TU Berlin</a> - Slides from a talk on Mahout at the DAI Laboratories TU 
Berlin, Isabel Drost, July 2009.
+   * <a href="http://www.isabel-drost.de/hadoop/slides/ulf.pdf";>Mahout at 
Machine Learning Group TU Berlin</a> - Slides from a talk on Hadoop with some 
detour to Mahout at the Machine
+   * Learning Group of Prof. Dr. Klaus-Robert M&uuml;ller at TU Berlin, Isabel 
Drost, June 2009.
+   * <a href="http://www.isabel-drost.de/hadoop/slides/google.pdf";>Mahout at 
Google Z&uuml;rich</a> - Slides from a Google tech-talk on the past, present 
and future of Mahout, Isabel Drost, May 2009.
+   * <a 
href="http://static.last.fm/johan/huguk-20090414/isabel_drost-introducing_apache_mahout.pdf";>Hadoop
 user group UK</a> - Slides from a talk on April 14, 2009 at the Hadoop User 
Group UK in London, Isabel Drost, April 2009.
+   * <a 
href="http://cwiki.apache.org/confluence/download/attachments/88410/SDForum.pdf";>BI
 Over Petabytes: Meet Apache Mahout</a> - Slides from a talk by Jeff Eastman on 
April 21, 2009 at the Bay Area SD Forum Business Intelligence SIG meeting at 
SAP in Palo Alto, CA.
+   * Lucene Meetup and Apache Barcamp in Amsterdam, March 2009.
+   * BarCampRDU - (Raleigh) on Aug. 2, 2008
+   * Introducing Mahout: Apache Machine Learning - Committer Grant Ingersoll 
gave a gentle introduction to Mahout and Machine Learning at ApacheCon in 
November (3rd through 7th) in New Orleans, USA. 
+   * Mahout: Scaling Machine Learning - Introduction to Mahout and machine 
learning at FrOSCon in Sankt Augustin/Germany, Isabel Drost, August 2008.  (<a 
href="http://cwiki.apache.org/confluence/download/attachments/88410/froscon.pdf";>slides</a>)
+   * Mahout: Scalable Machine Learning - An introduction to Mahout and machine 
learning at the first German Hadoop gathering in newthinking store/ Berlin, 
Isabel Drost, July 2008.
+   * Apache Mahout: Industrial Strength Machine Learning - Committer Jeff 
Eastman gave an introduction to Mahout at Yahoo\!, May 2008
+   * <a 
href="http://people.apache.org/~berndf/openexpode08-lucene-talk.pdf";>Apache 
Lucene - Mach's wie Google</a> - Bernd Fondermann presented an overview of the 
Apache Lucene project,
+   * including Mahout at Open Source Expo 2008 in Karlsruhe, May 2008.
+   * Apache Mahout: Bringing Machine Learning to Industrial Strength - 
Committer Isabel Drost gave a Fast Feather introduction the the new project 
Mahout at Apache Con EU April, 2008
\ No newline at end of file

http://git-wip-us.apache.org/repos/asf/mahout/blob/9c031452/website/old_site_migration/old_site/general/mahout-wiki.md
----------------------------------------------------------------------
diff --git a/website/old_site_migration/old_site/general/mahout-wiki.md 
b/website/old_site_migration/old_site/general/mahout-wiki.md
new file mode 100644
index 0000000..2df16d4
--- /dev/null
+++ b/website/old_site_migration/old_site/general/mahout-wiki.md
@@ -0,0 +1,202 @@
+---
+layout: default
+title: Mahout Wiki
+theme:
+    name: retro-mahout
+---
+
+On the fence about including this in new site. lol at "new Apache TLP"
+
+Apache Mahout is a new Apache TLP project to create scalable, machine
+learning algorithms under the Apache license. 
+
+{toc:style=disc|minlevel=2}
+
+<a name="MahoutWiki-General"></a>
+## General
+[Overview](overview.html)
+ -- Mahout? What's that supposed to be?
+
+[Quickstart](quickstart.html)
+ -- learn how to quickly setup Apache Mahout for your project.
+
+[FAQ](faq.html)
+ -- Frequent questions encountered on the mailing lists.
+
+[Developer Resources](developer-resources.html)
+ -- overview of the Mahout development infrastructure.
+
+[How To Contribute](how-to-contribute.html)
+ -- get involved with the Mahout community.
+
+[How To Become A Committer](how-to-become-a-committer.html)
+ -- become a member of the Mahout development community.
+
+[Hadoop](http://hadoop.apache.org)
+ -- several of our implementations depend on Hadoop.
+
+[Machine Learning Open Source Software](http://mloss.org/software/)
+ -- other projects implementing Open Source Machine Learning libraries.
+
+[Mahout -- The name, history and its pronunciation](mahoutname.html)
+
+<a name="MahoutWiki-Community"></a>
+## Community
+
+[Who we are](who-we-are.html)
+ -- who are the developers behind Apache Mahout?
+
+[Books, Tutorials, Talks, Articles, News, Background Reading, etc. on 
Mahout](books-tutorials-and-talks.html)
+
+[Issue Tracker](issue-tracker.html)
+ -- see what features people are working on, submit patches and file bugs.
+
+[Source Code (SVN)](https://svn.apache.org/repos/asf/mahout/)
+ -- [Fisheye|http://fisheye6.atlassian.com/browse/mahout]
+ -- download the Mahout source code from svn.
+
+[Mailing lists and IRC](mailing-lists,-irc-and-archives.html)
+ -- links to our mailing lists, IRC channel and archived design and
+algorithm discussions, maybe your questions was answered there already?
+
+[Version Control](version-control.html)
+ -- where we track our code.
+
+[Powered By Mahout](powered-by-mahout.html)
+ -- who is using Mahout in production?
+
+[Professional Support](professional-support.html)
+ -- who is offering professional support for Mahout?
+
+[Mahout and Google Summer of Code](gsoc.html)
+  -- All you need to know about Mahout and GSoC.
+
+
+[Glossary of commonly used terms and abbreviations](glossary.html)
+
+<a name="MahoutWiki-Installation/Setup"></a>
+## Installation/Setup
+
+[System Requirements](system-requirements.html)
+ -- what do you need to run Mahout?
+
+[Quickstart](quickstart.html)
+ -- get started with Mahout, run the examples and get pointers to further
+resources.
+
+[Downloads](downloads.html)
+ -- a list of Mahout releases.
+
+[Download and installation](buildingmahout.html)
+ -- build Mahout from the sources.
+
+[Mahout on Amazon's EC2 Service](mahout-on-amazon-ec2.html)
+ -- run Mahout on Amazon's EC2.
+
+[Mahout on Amazon's EMR](mahout-on-elastic-mapreduce.html)
+ -- Run Mahout on Amazon's Elastic Map Reduce
+
+[Integrating Mahout into an Application](mahoutintegration.html)
+ -- integrate Mahout's capabilities in your application.
+
+<a name="MahoutWiki-Examples"></a>
+## Examples
+
+1. [ASF Email Examples](asfemail.html)
+ -- Examples of recommenders, clustering and classification all using a
+public domain collection of 7 million emails.
+
+<a name="MahoutWiki-ImplementationBackground"></a>
+## Implementation Background
+
+<a name="MahoutWiki-RequirementsandDesign"></a>
+### Requirements and Design
+
+[Matrix and Vector Needs](matrix-and-vector-needs.html)
+ -- requirements for Mahout vectors.
+
+[Collection(De-)Serialization](collection(de-)serialization.html)
+
+<a name="MahoutWiki-CollectionsandAlgorithms"></a>
+### Collections and Algorithms
+
+Learn more about [mahout-collections](mahout-collections.html)
+, containers for efficient storage of primitive-type data and open hash
+tables.
+
+Learn more about the [Algorithms](algorithms.html)
+ discussed and employed by Mahout.
+
+Learn more about the [Mahout recommender 
implementation](recommender-documentation.html)
+.
+
+<a name="MahoutWiki-Utilities"></a>
+### Utilities
+
+This section describes tools that might be useful for working with Mahout.
+
+[Converting Content](converting-content.html)
+ -- Mahout has some utilities for converting content such as logs to
+formats more amenable for consumption by Mahout.
+[Creating Vectors](creating-vectors.html)
+ -- Mahout's algorithms operate on vectors. Learn more on how to generate
+these from raw data.
+[Viewing Result](viewing-result.html)
+ -- How to visualize the result of your trained algorithms.
+
+<a name="MahoutWiki-Data"></a>
+### Data
+
+[Collections](collections.html)
+ -- To try out and test Mahout's algorithms you need training data. We are
+always looking for new training data collections.
+
+<a name="MahoutWiki-Benchmarks"></a>
+### Benchmarks
+
+[Mahout Benchmarks](mahout-benchmarks.html)
+
+<a name="MahoutWiki-Committer'sResources"></a>
+## Committer's Resources
+
+* [Testing](testing.html)
+ -- Information on test plans and ideas for testing
+
+<a name="MahoutWiki-ProjectResources"></a>
+### Project Resources
+
+* [Dealing with Third Party Dependencies not in 
Maven](thirdparty-dependencies.html)
+* [How To Update The Website](how-to-update-the-website.html)
+* [Patch Check List](patch-check-list.html)
+* [How To 
Release](http://cwiki.apache.org/confluence/display/MAHOUT/How+to+release)
+* [Release Planning](release-planning.html)
+* [Sonar Code Quality 
Analysis](https://analysis.apache.org/dashboard/index/63921)
+
+<a name="MahoutWiki-AdditionalResources"></a>
+### Additional Resources
+
+* [Apache Machine Status](http://monitoring.apache.org/status/)
+ \- Check to see if SVN, other resources are available.
+* [Committer's FAQ](http://www.apache.org/dev/committers.html)
+* [Apache Dev](http://www.apache.org/dev/)
+
+
+<a name="MahoutWiki-HowToEditThisWiki"></a>
+## How To Edit This Wiki
+
+How to edit this Wiki
+
+This Wiki is a collaborative site, anyone can contribute and share:
+
+* Create an account by clicking the "Login" link at the top of any page,
+and picking a username and password.
+* Edit any page by pressing Edit at the top of the page
+
+There are some conventions used on the Mahout wiki:
+
+    * {noformat}+*TODO:*+{noformat} (+*TODO:*+ ) is used to denote sections
+that definitely need to be cleaned up.
+    * {noformat}+*Mahout_(version)*+{noformat} (+*Mahout_0.2*+) is used to
+draw attention to which version of Mahout a feature was (or will be) added
+to Mahout.
+

http://git-wip-us.apache.org/repos/asf/mahout/blob/9c031452/website/old_site_migration/old_site/general/professional-support.md
----------------------------------------------------------------------
diff --git 
a/website/old_site_migration/old_site/general/professional-support.md 
b/website/old_site_migration/old_site/general/professional-support.md
new file mode 100644
index 0000000..45d798c
--- /dev/null
+++ b/website/old_site_migration/old_site/general/professional-support.md
@@ -0,0 +1,41 @@
+---
+layout: default
+title: Professional Support
+theme:
+    name: retro-mahout
+---
+
+NOTE: on the fence about including this in new site.
+
+<a name="ProfessionalSupport-ProfessionalsupportforMahout"></a>
+# Professional support for Mahout
+
+Add yourself or your company if you are offering support for Mahout
+users. Please keep lists in alphabetical order. An entry here
+is not an endorsement by the Apache Software Foundation nor any of its
+committers.
+
+
+<a name="ProfessionalSupport-Peopleandcompaniesforhire"></a>
+## People and companies for hire
+
+| Name | Contact details | Notes |
+|------|-----------------|-------|
+| Accenture | [email protected] | [Consulting services in big 
data analytics](http://accenture.com) |
+| Boston Predictive Analytics | [email protected] | 
[http://tutorteddy.com/site/free_statistics_help.php](http://tutorteddy.com/site/free_statistics_help.php)
 |
+| Frank Scholten | [email protected] | |
+| GridLine | [http://www.gridline.nl/contact](http://www.gridline.nl/contact) 
| Specialised in search and thesauri |
+| Jagdish Nomula | [email protected] | ML, Search, Algorithms, Java 
[http://www.kosmex.com](http://www.kosmex.com) |
+| LucidWorks | [http://www.lucidworks.com](http://www.lucidworks.com) | Big 
data platform including Mahout as a service for clustering, classification and 
more |
+| Sematext International | [http://sematext.com/](http://sematext.com/) | |
+| Ted Dunning | [email protected] | Full commercial support |
+| Winterwell | [email protected] | Business/maths concept development & 
algorithms [http://winterwell.com](http://winterwell.com) |
+
+<a name="ProfessionalSupport-Talksandpresentations"></a>
+## Talks and presentations
+
+| Name | Contact details | Notes |
+|------|-----------------|-------|
+| Andrew Musselman | [email protected] | ["Building a Recommender with Apache 
Mahout on Amazon 
Elastic-MapReduce"](https://blogs.aws.amazon.com/bigdata/post/Tx1TDK3HHBD4EZL/Building-a-Recommender-with-Apache-Mahout-on-Amazon-Elastic-MapReduce-EMR)
 |
+| Frank Scholten | [email protected] | Mahout/Taste 
[http://blog.jteam.nl/author/frank/](http://blog.jteam.nl/author/frank/) |
+| Isabel Drost-Fromm | [email protected] | If travel and accommodation costs 
are covered scheduling a talk is a lot easier. |

http://git-wip-us.apache.org/repos/asf/mahout/blob/9c031452/website/old_site_migration/old_site/general/reference-reading.md
----------------------------------------------------------------------
diff --git a/website/old_site_migration/old_site/general/reference-reading.md 
b/website/old_site_migration/old_site/general/reference-reading.md
new file mode 100644
index 0000000..ba969ac
--- /dev/null
+++ b/website/old_site_migration/old_site/general/reference-reading.md
@@ -0,0 +1,71 @@
+---
+layout: default
+title: Reference Reading
+theme:
+    name: retro-mahout
+---
+
+# Reference Reading
+
+Here we provide references to books and courses about data analysis in 
general, which might also be helpful in the context of Mahout.
+
+<a name="ReferenceReading-GeneralBackgroundMaterials"></a>
+## General Background Materials
+
+Don't be overwhelmed by all the maths, you can do a lot in Mahout with some
+basic knowledge. The books will help you understand your
+data better, and ask better questions both of Mahout's APIs, and also of
+the Mahout community. And unlike learning some particular software tool,
+these are skills that will remain useful decades later.
+
+ * [Gilbert Strang](http://www-math.mit.edu/~gs)
+'s [Introduction to Linear Algebra](http://math.mit.edu/linearalgebra/). His 
[lectures](http://ocw.mit.edu/courses/mathematics/18-06-linear-algebra-spring-2010/)
 are also [available online](http://web.mit.edu/18.06/www/)
+ and are strongly recommended. 
+ * [Mathematical Tools for Applied Mulitvariate 
Analysis](http://www.amazon.com/Mathematical-Tools-Applied-Multivariate-Analysis/dp/0121609553/ref=sr_1_1?ie=UTF8&qid=1299602805&sr=8-1)
 by J.Douglass
+Carroll.
+ * [Stanford Machine Learning online 
courseware](http://www.stanford.edu/class/cs229/)
+ * [MIT Machine Learning online 
courseware](http://ocw.mit.edu/courses/electrical-engineering-and-computer-science/6-867-machine-learning-fall-2006/)
  has [lecture 
notes](http://ocw.mit.edu/courses/electrical-engineering-and-computer-science/6-867-machine-learning-fall-2006/lecture-notes/)
 online.
+ * As a pre-requisite to probability and statistics, you'll need [basic 
calculus](http://en.wikipedia.org/wiki/Calculus). A maths for scientists text 
might be useful here such as 'Mathematics for Engineers and Scientists', Alan 
Jeffrey, Chapman & Hall/CRC. 
([openlibrary](http://openlibrary.org/books/OL3305993M/Mathematics_for_engineers_and_scientists))
+ * One of the best writers in the probability/statistics world is Sheldon 
Ross. Try [A First Course in Probability (8th 
Edition)](http://www.pearsonhighered.com/educator/product/First-Course-in-Probability-A/9780136033134.page)
 and then move on to his [Introduction to Probability 
Models](http://www.amazon.com/Introduction-Probability-Models-Sixth-Sheldon/dp/0125984707)
+
+Some good introductory alternatives here are:
+
+ * [Kahn Academy](http://www.khanacademy.org/) -- videos on stats, 
probability, linear algebra
+ * [Probability and Statistics (7th 
Edition)](http://www.amazon.com/Probability-Statistics-Engineering-Sciences-InfoTrac/dp/0534399339),
 Jay L. Devore, Chapman.
+ * [Probability and Statistical Inference (7th 
Edition)](http://www.amazon.com/Probability-Statistical-Inference-Robert-Hogg/dp/0132546086),
 Hogg and Tanis, Pearson.
+
+Once you have a grasp of the basics then there are a slew of great texts that 
you might consult:
+
+ * [Statistical 
Inference](http://www.amazon.com/Statistical-Inference-George-Casella/dp/0534243126),
 Casell and Berger, Duxbury/Thomson Learning.
+ * [Introduction to Bayesian 
Statistics](http://www.amazon.com/Introduction-Bayesian-Statistics-William-Bolstad/dp/0471270202),
 William H. Bolstad, Wiley. 
+ * [Understanding Computational Bayesian 
Statistics](http://www.amazon.com/Understanding-Computational-Bayesian-Statistics-Wiley/dp/0470046090),
 Bolstadt
+ * [Bayesian Data Analysis, Gelman et 
al.](http://www.stat.columbia.edu/~gelman/book/)
+
+
+## For statistics related to machine learning, these are particularly helpful:
+
+ * [Pattern Recognition and Machine Learning by Chris 
Bishop](http://research.microsoft.com/en-us/um/people/cmbishop/PRML/index.htm)
+ * [Elements of Statistical 
Learning](http://www-stat.stanford.edu/~tibs/ElemStatLearn/) by Trevor Hastie, 
Robert Tibshirani, Jerome Friedman 
+ * 
[http://research.microsoft.com/en-us/um/people/cmbishop/PRML/index.htm](http://research.microsoft.com/en-us/um/people/cmbishop/PRML/index.htm)
+ 
+
+## For matrix computations/decomposition/factorization etc.:
+
+ * Peter V. O'Neil [Introduction to Linear 
Algebra](http://www.amazon.com/Introduction-Linear-Algebra-Theory-Applications/dp/053400606X),
 great book for beginners (with some knowledge in calculus). It is not 
comprehensive, but, it will be a good place to start and the author starts by 
explaining the concepts with regards to vector spaces which I found to be a 
more natural way of explaining.
+ * David S. Watkins [Fundamentals of Matrix 
Computations](http://www.amazon.com/Fundamentals-Matrix-Computations-Applied-Mathematics/dp/0470528338/)
+ * [Matrix 
Computations](http://www.amazon.com/Computations-Hopkins-Studies-Mathematical-Sciences/dp/0801854148/ref=sr_1_2?s=books&ie=UTF8&qid=1394307676&sr=1-2&keywords=golub+van+loan)
 is the classic text for numerical linear algebra. Can't go wrong with it - 
great for researchers.  
+ * Nick Trefethen's [Numerical Linear 
Algebra](http://people.maths.ox.ac.uk/trefethen/books.html).  It's a bit more 
approachable for practitioners. Many chapters on SVD, there are even chapters 
on Lanczos.
+
+
+## Books specifically on R:
+
+* Learning about R is a difficult thing. The best introduction is in MASS 
[http://www.stats.ox.ac.uk/pub/MASS4/](http://www.stats.ox.ac.uk/pub/MASS4/)
+* [R Tutor](http://www.r-tutor.com/r-introduction)
+* [Manual](http://cran.r-project.org/doc/manuals/R-intro.pdf)
+* [R Course](http://faculty.washington.edu/tlumley/Rcourse/)
+
+In addition, you should see how to plot data well:
+
+* [Trellis plotting](http://www.statmethods.net/advgraphs/trellis.html)
+* [ggplot2](http://had.co.nz/ggplot2/)
+

http://git-wip-us.apache.org/repos/asf/mahout/blob/9c031452/website/old_site_migration/old_site/users/basics/matrix-and-vector-needs.md
----------------------------------------------------------------------
diff --git 
a/website/old_site_migration/old_site/users/basics/matrix-and-vector-needs.md 
b/website/old_site_migration/old_site/users/basics/matrix-and-vector-needs.md
new file mode 100644
index 0000000..39f4bfd
--- /dev/null
+++ 
b/website/old_site_migration/old_site/users/basics/matrix-and-vector-needs.md
@@ -0,0 +1,88 @@
+---
+layout: default
+title: Matrix and Vector Needs
+theme:
+    name: retro-mahout
+---
+
+<a name="MatrixandVectorNeeds-Intro"></a>
+# Intro
+
+Most ML algorithms require the ability to represent multidimensional data
+concisely and to be able to easily perform common operations on that data.
+MAHOUT-6 introduced Vector and Matrix datatypes of arbitrary cardinality,
+along with a set of common operations on their instances. Vectors and
+matrices are provided with sparse and dense implementations that are memory
+resident and are suitable for manipulating intermediate results within
+mapper, combiner and reducer implementations. They are not intended for
+applications requiring vectors or matrices that exceed the size of a single
+JVM, though such applications might be able to utilize them within a larger
+organizing framework.
+
+<a name="MatrixandVectorNeeds-Background"></a>
+## Background
+
+See 
[http://mail-archives.apache.org/mod_mbox/lucene-mahout-dev/200802.mbox/browser](http://mail-archives.apache.org/mod_mbox/lucene-mahout-dev/200802.mbox/browser)
+
+<a name="MatrixandVectorNeeds-Vectors"></a>
+## Vectors
+
+Mahout supports a Vector interface that defines the following operations over 
all implementation classes: assign, cardinality, copy, divide, dot, get, 
haveSharedCells, like, minus, normalize, plus, set, size, times, toArray, 
viewPart, zSum and cross. The class DenseVector implements vectors as a 
double[](.html)
+ that is storage and access efficient. The class SparseVector implements
+vectors as a HashMap<Integer, Double> that is surprisingly fast and
+efficient. For sparse vectors, the size() method returns the current number
+of elements whereas the cardinality() method returns the number of
+dimensions it holds. An additional VectorView class allows views of an
+underlying vector to be specified by the viewPart() method. See the
+JavaDocs for more complete definitions.
+
+<a name="MatrixandVectorNeeds-Matrices"></a>
+## Matrices
+
+Mahout also supports a Matrix interface that defines a similar set of 
operations over all implementation classes: assign, assignColumn, assignRow, 
cardinality, copy, divide, get, haveSharedCells, like, minus, plus, set, size, 
times, transpose, toArray, viewPart and zSum. The class DenseMatrix implements 
matrices as a double[](.html)
+[] that is storage and access efficient. The class SparseRowMatrix
+implements matrices as a Vector[] holding the rows of the matrix in a
+SparseVector, and the symmetric class SparseColumnMatrix implements
+matrices as a Vector[] holding the columns in a SparseVector. Each of these
+classes can quickly produce a given row or column, respectively. A fourth
+class SparseMatrix, uses a HashMap<Integer, Vector> which is also a
+SparseVector. For sparse matrices, the size() method returns an int\[2\]
+containing the actual row and column sizes whereas the cardinality() method
+returns an int\[2\] with the number of dimensions of each. An additional
+MatrixView class allows views of an underlying matrix to be specified by
+the viewPart() method. See the JavaDocs for more complete definitions.
+
+The Matrix interface does not currently provide invert or determinant
+methods, though these are desirable. It is arguable that the
+implementations of SparseRowMatrix and SparseColumnMatrix ought to use the
+HashMap<Integer, Vector> implementations and that SparseMatrix should
+instead use a HashMap<Integer, HashMap<Integer, Double>>. Other forms of
+sparse matrices can also be envisioned that support different storage and
+access characteristics. Because the arguments of assignColumn and assignRow
+operations accept all forms of Vector, it is possible to construct
+instances of sparse matrices containing dense rows or columns. See the
+JavaDocs for more complete definitions.
+
+For applications like PageRank/TextRank, iterative approaches to calculate
+eigenvectors would also be useful. Batching of row/column operations would
+also be useful, such as perhaps assignRow or assighColumn accepting
+UnaryFunction and BinaryFunction arguments.
+
+
+<a name="MatrixandVectorNeeds-Ideas"></a>
+## Ideas
+
+As Vector and Matrix implementations are currently memory-resident, very
+large instances greater than available memory are not supported. An
+extended set of implementations that use HBase (BigTable) in Hadoop to
+represent their instances would facilitate applications requiring such
+large collections.  
+See [MAHOUT-6](https://issues.apache.org/jira/browse/MAHOUT-6)
+See [Hama](http://wiki.apache.org/hadoop/Hama)
+
+
+<a name="MatrixandVectorNeeds-References"></a>
+## References
+
+Have a look at the old parallel computing libraries like 
[ScalaPACK](http://www.netlib.org/scalapack/)
+, others

http://git-wip-us.apache.org/repos/asf/mahout/blob/9c031452/website/old_site_migration/old_site/users/basics/principal-components-analysis.md
----------------------------------------------------------------------
diff --git 
a/website/old_site_migration/old_site/users/basics/principal-components-analysis.md
 
b/website/old_site_migration/old_site/users/basics/principal-components-analysis.md
new file mode 100644
index 0000000..5a9383f
--- /dev/null
+++ 
b/website/old_site_migration/old_site/users/basics/principal-components-analysis.md
@@ -0,0 +1,29 @@
+---
+layout: default
+title: Principal Components Analysis
+theme:
+    name: retro-mahout
+---
+
+<a name="PrincipalComponentsAnalysis-PrincipalComponentsAnalysis"></a>
+# Principal Components Analysis
+
+PCA is used to reduce high dimensional data set to lower dimensions. PCA
+can be used to identify patterns in data, express the data in a lower
+dimensional space. That way, similarities and differences can be
+highlighted. It is mostly used in face recognition and image compression.
+There are several flaws one has to be aware of when working with PCA:
+
+* Linearity assumption - data is assumed to be linear combinations of some
+basis. There exist non-linear methods such as kernel PCA that alleviate
+that problem.
+* Principal components are assumed to be orthogonal. ICA tries to cope with
+this limitation.
+* Mean and covariance are assumed to be statistically important.
+* Large variances are assumed to have important dynamics.
+
+<a name="PrincipalComponentsAnalysis-Parallelizationstrategy"></a>
+## Parallelization strategy
+
+<a name="PrincipalComponentsAnalysis-Designofpackages"></a>
+## Design of packages

http://git-wip-us.apache.org/repos/asf/mahout/blob/9c031452/website/old_site_migration/old_site/users/basics/svd---singular-value-decomposition.md
----------------------------------------------------------------------
diff --git 
a/website/old_site_migration/old_site/users/basics/svd---singular-value-decomposition.md
 
b/website/old_site_migration/old_site/users/basics/svd---singular-value-decomposition.md
new file mode 100644
index 0000000..4a28934
--- /dev/null
+++ 
b/website/old_site_migration/old_site/users/basics/svd---singular-value-decomposition.md
@@ -0,0 +1,52 @@
+---
+layout: default
+title: SVD - Singular Value Decomposition
+theme:
+    name: retro-mahout
+---
+
+{excerpt}Singular Value Decomposition is a form of product decomposition of
+a matrix in which a rectangular matrix A is decomposed into a product U s
+V' where U and V are orthonormal and s is a diagonal matrix.{excerpt}  The
+values of A can be real or complex, but the real case dominates
+applications in machine learning.  The most prominent properties of the SVD
+are:
+
+  * The decomposition of any real matrix has only real values
+  * The SVD is unique except for column permutations of U, s and V
+  * If you take only the largest n values of s and set the rest to zero,
+you have a least squares approximation of A with rank n.  This allows SVD
+to be used very effectively in least squares regression and makes partial
+SVD useful.
+  * The SVD can be computed accurately for singular or nearly singular
+matrices.  For a matrix of rank n, only the first n singular values will be
+non-zero.  This allows SVD to be used for solution of singular linear
+systems.  The columns of U and V corresponding to zero singular values
+define the null space of A.
+  * The partial SVD of very large matrices can be computed very quickly
+using stochastic decompositions.  See http://arxiv.org/abs/0909.4061v1 for
+details.  Gradient descent can also be used to compute partial SVD's and is
+very useful where some values of the matrix being decomposed are not known.
+
+In collaborative filtering and text retrieval, it is common to compute the
+partial decomposition of the user x item interaction matrix or the document
+x term matrix. This allows the projection of users and items (or documents
+and terms) into a common vector space representation that is often referred
+to as the latent semantic representation.  This process is sometimes called
+Latent Semantic Analysis and has been very effective in the analysis of the
+Netflix dataset.
+
+Dimension Reduction in Mahout:
+ * https://cwiki.apache.org/MAHOUT/dimensional-reduction.html
+
+ See Also:
+ * http://www.kwon3d.com/theory/jkinem/svd.html
+ * http://en.wikipedia.org/wiki/Singular_value_decomposition
+ * http://en.wikipedia.org/wiki/Latent_semantic_analysis
+ * http://en.wikipedia.org/wiki/Netflix_Prize
+ *
+http://www.amazon.com/Understanding-Complex-Datasets-Decompositions-Knowledge/dp/1584888326
+ * http://web.mit.edu/be.400/www/SVD/Singular_Value_Decomposition.htm
+ *
+http://www.quora.com/What-s-the-best-parallelized-sparse-SVD-code-publicly-available
+ * [understanding Mahout Hadoop SVD 
thread](http://mail-archives.apache.org/mod_mbox/mahout-user/201102.mbox/%[email protected]%3E)

http://git-wip-us.apache.org/repos/asf/mahout/blob/9c031452/website/old_site_migration/old_site/users/basics/system-requirements.md
----------------------------------------------------------------------
diff --git 
a/website/old_site_migration/old_site/users/basics/system-requirements.md 
b/website/old_site_migration/old_site/users/basics/system-requirements.md
new file mode 100644
index 0000000..6bef40d
--- /dev/null
+++ b/website/old_site_migration/old_site/users/basics/system-requirements.md
@@ -0,0 +1,20 @@
+---
+layout: default
+title: System Requirements
+theme:
+    name: retro-mahout
+---
+
+
+# System Requirements
+
+* Java 1.6.x or greater.
+* Maven 3.x to build the source code.
+
+CPU, Disk and Memory requirements are based on the many choices made in
+implementing your application with Mahout (document size, number of
+documents, and number of hits retrieved to name a few.)
+
+Several of the Mahout algorithms are implemented to work on Hadoop
+clusters. If not advertised differently, those implementations work with
+Hadoop 0.20.0 or greater.

http://git-wip-us.apache.org/repos/asf/mahout/blob/9c031452/website/old_site_migration/old_site/users/basics/tf-idf---term-frequency-inverse-document-frequency.md
----------------------------------------------------------------------
diff --git 
a/website/old_site_migration/old_site/users/basics/tf-idf---term-frequency-inverse-document-frequency.md
 
b/website/old_site_migration/old_site/users/basics/tf-idf---term-frequency-inverse-document-frequency.md
new file mode 100644
index 0000000..f807609
--- /dev/null
+++ 
b/website/old_site_migration/old_site/users/basics/tf-idf---term-frequency-inverse-document-frequency.md
@@ -0,0 +1,21 @@
+---
+layout: default
+title: TF-IDF - Term Frequency-Inverse Document Frequency
+theme:
+    name: retro-mahout
+---
+
+{excerpt}Is a weight measure often used in information retrieval and text
+mining. This weight is a statistical measure used to evaluate how important
+a word is to a document in a collection or corpus. The importance increases
+proportionally to the number of times a word appears in the document but is
+offset by the frequency of the word in the corpus.{excerpt} In other words
+if a term/word appears lots in a document but also appears lots in the
+corpus/collection as a whole it will get a lower score. An example of this
+would be "the", "and", "it" but depending on your source material it maybe
+other words that are very common to the source matter.
+
+
+ See Also:
+ * http://en.wikipedia.org/wiki/Tf%E2%80%93idf
+ * http://nlp.stanford.edu/IR-book/html/htmledition/tf-idf-weighting-1.html

[1/9] mahout git commit: WEBSITE Triage of Old Site Migration

Reply via email to