Repository: mahout
Updated Branches:
  refs/heads/website 9c0314528 -> c81fc8b72


http://git-wip-us.apache.org/repos/asf/mahout/blob/c81fc8b7/website/old_site_migration/needs_work_priority/sparkbindings/faq.md
----------------------------------------------------------------------
diff --git 
a/website/old_site_migration/needs_work_priority/sparkbindings/faq.md 
b/website/old_site_migration/needs_work_priority/sparkbindings/faq.md
deleted file mode 100644
index 9649e3b..0000000
--- a/website/old_site_migration/needs_work_priority/sparkbindings/faq.md
+++ /dev/null
@@ -1,52 +0,0 @@
----
-layout: default
-title: FAQ
-theme:
-    name: retro-mahout
----
-
-# FAQ for using Mahout with Spark
-
-**Q: Mahout Spark shell doesn't start; "ClassNotFound" problems or various 
classpath problems.**
-
-**A:** So far as of the time of this writing all reported problems starting 
the Spark shell in Mahout were revolving 
-around classpath issues one way or another. 
-
-If you are getting method signature like errors, most probably you have 
mismatch between Mahout's Spark dependency 
-and actual Spark installed. (At the time of this writing the HEAD depends on 
Spark 1.1.0) but check mahout/pom.xml.
-
-Troubleshooting general classpath issues is pretty straightforward. Since 
Mahout is using Spark's installation 
-and its classpath as reported by Spark itself for Spark-related dependencies, 
it is important to make sure 
-the classpath is sane and is made available to Mahout:
-
-1. Check Spark is of correct version (same as in Mahout's poms), is compiled 
and SPARK_HOME is set.
-2. Check Mahout is compiled and MAHOUT_HOME is set.
-3. Run `$SPARK_HOME/bin/compute-classpath.sh` and make sure it produces sane 
result with no errors. 
-If it outputs something other than a straightforward classpath string, most 
likely Spark is not compiled/set correctly (later spark versions require 
-`sbt/sbt assembly` to be run, simply runnig `sbt/sbt publish-local` is not 
enough any longer).
-4. Run `$MAHOUT_HOME/bin/mahout -spark classpath` and check that path reported 
in step (3) is included.
-
-**Q: I am using the command line Mahout jobs that run on Spark or am writing 
my own application that uses 
-Mahout's Spark code. When I run the code on my cluster I get ClassNotFound or 
signature errors during serialization. 
-What's wrong?**
- 
-**A:** The Spark artifacts in the maven ecosystem may not match the exact 
binary you are running on your cluster. This may 
-cause class name or version mismatches. In this case you may wish 
-to build Spark yourself to guarantee that you are running exactly what you are 
building Mahout against. To do this follow these steps
-in order:
-
-1. Build Spark with maven, but **do not** use the "package" target as 
described on the Spark site. Build with the "clean install" target instead. 
-Something like: "mvn clean install -Dhadoop1.2.1" or whatever your particular 
build options are. This will put the jars for Spark
-in the local maven cache.
-2. Deploy **your** Spark build to your cluster and test it there.
-3. Build Mahout. This will cause maven to pull the jars for Spark from the 
local maven cache and may resolve missing 
-or mis-identified classes.
-4. if you are building your own code do so against the local builds of Spark 
and Mahout.
-
-**Q: The implicit SparkContext 'sc' does not work in the Mahout spark-shell.**
-
-**A:** In the Mahout spark-shell the SparkContext is called 'sdc', where the 
'd' stands for distributed. 
-
-
-
-

http://git-wip-us.apache.org/repos/asf/mahout/blob/c81fc8b7/website/old_site_migration/needs_work_priority/sparkbindings/home.md
----------------------------------------------------------------------
diff --git 
a/website/old_site_migration/needs_work_priority/sparkbindings/home.md 
b/website/old_site_migration/needs_work_priority/sparkbindings/home.md
deleted file mode 100644
index 5075612..0000000
--- a/website/old_site_migration/needs_work_priority/sparkbindings/home.md
+++ /dev/null
@@ -1,101 +0,0 @@
----
-layout: default
-title: Spark Bindings
-theme:
-    name: retro-mahout
----
-
-# Scala & Spark Bindings:
-*Bringing algebraic semantics*
-
-## What is Scala & Spark Bindings?
-
-In short, Scala & Spark Bindings for Mahout is Scala DSL and algebraic 
optimizer of something like this (actual formula from **(d)spca**)
-        
-
-`\[\mathbf{G}=\mathbf{B}\mathbf{B}^{\top}-\mathbf{C}-\mathbf{C}^{\top}+\mathbf{s}_{q}\mathbf{s}_{q}^{\top}\boldsymbol{\xi}^{\top}\boldsymbol{\xi}\]`
-
-bound to in-core and distributed computations (currently, on Apache Spark).
-
-
-Mahout Scala & Spark Bindings expression of the above:
-
-        val g = bt.t %*% bt - c - c.t + (s_q cross s_q) * (xi dot xi)
-
-The main idea is that a scientist writing algebraic expressions cannot care 
less of distributed 
-operation plans and works **entirely on the logical level** just like he or 
she would do with R.
-
-Another idea is decoupling logical expression from distributed back-end. As 
more back-ends are added, 
-this implies **"write once, run everywhere"**.
-
-The linear algebra side works with scalars, in-core vectors and matrices, and 
Mahout Distributed
-Row Matrices (DRMs).
-
-The ecosystem of operators is built in the R's image, i.e. it follows R naming 
such as %*%, 
-colSums, nrow, length operating over vectors or matices. 
-
-Important part of Spark Bindings is expression optimizer. It looks at 
expression as a whole 
-and figures out how it can be simplified, and which physical operators should 
be picked. For example,
-there are currently about 5 different physical operators performing DRM-DRM 
multiplication
-picked based on matrix geometry, distributed dataset partitioning, orientation 
etc. 
-If we count in DRM by in-core combinations, that would be another 4, i.e. 9 
total -- all of it for just 
-simple x %*% y logical notation.
-
-
-
-Please refer to the documentation for details.
-
-## Status
-
-This environment addresses mostly R-like Linear Algebra optmizations for 
-Spark, Flink and H20.
-
-
-## Documentation
-
-* Scala and Spark bindings manual: 
[web](http://apache.github.io/mahout/doc/ScalaSparkBindings.html), 
[pdf](ScalaSparkBindings.pdf)
-* Overview blog on 0.10.x releases: 
[blog](http://www.weatheringthroughtechdays.com/2015/04/mahout-010x-first-mahout-release-as.html)
-
-## Distributed methods and solvers using Bindings
-
-* In-core ([ssvd]) and Distributed ([dssvd]) Stochastic SVD -- guinea pigs -- 
see the bindings manual
-* In-core ([spca]) and Distributed ([dspca]) Stochastic PCA -- guinea pigs -- 
see the bindings manual
-* Distributed thin QR decomposition ([dqrThin]) -- guinea pig -- see the 
bindings manual 
-* [Current list of 
algorithms](https://mahout.apache.org/users/basics/algorithms.html)
-
-[ssvd]: 
https://github.com/apache/mahout/blob/trunk/math-scala/src/main/scala/org/apache/mahout/math/scalabindings/SSVD.scala
-[spca]: 
https://github.com/apache/mahout/blob/trunk/math-scala/src/main/scala/org/apache/mahout/math/scalabindings/SSVD.scala
-[dssvd]: 
https://github.com/apache/mahout/blob/trunk/spark/src/main/scala/org/apache/mahout/sparkbindings/decompositions/DSSVD.scala
-[dspca]: 
https://github.com/apache/mahout/blob/trunk/spark/src/main/scala/org/apache/mahout/sparkbindings/decompositions/DSPCA.scala
-[dqrThin]: 
https://github.com/apache/mahout/blob/trunk/spark/src/main/scala/org/apache/mahout/sparkbindings/decompositions/DQR.scala
-
-
-## Related history of note 
-
-* CLI and Driver for Spark version of item similarity -- 
[MAHOUT-1541](https://issues.apache.org/jira/browse/MAHOUT-1541)
-* Command line interface for generalizable Spark pipelines -- 
[MAHOUT-1569](https://issues.apache.org/jira/browse/MAHOUT-1569)
-* Cooccurrence Analysis / Item-based Recommendation -- 
[MAHOUT-1464](https://issues.apache.org/jira/browse/MAHOUT-1464)
-* Spark Bindings -- 
[MAHOUT-1346](https://issues.apache.org/jira/browse/MAHOUT-1346)
-* Scala Bindings -- 
[MAHOUT-1297](https://issues.apache.org/jira/browse/MAHOUT-1297)
-* Interactive Scala & Spark Bindings Shell & Script processor -- 
[MAHOUT-1489](https://issues.apache.org/jira/browse/MAHOUT-1489)
-* OLS tutorial using Mahout shell -- 
[MAHOUT-1542](https://issues.apache.org/jira/browse/MAHOUT-1542)
-* Full abstraction of DRM apis and algorithms from a distributed engine -- 
[MAHOUT-1529](https://issues.apache.org/jira/browse/MAHOUT-1529)
-* Port Naive Bayes -- 
[MAHOUT-1493](https://issues.apache.org/jira/browse/MAHOUT-1493)
-
-## Work in progress 
-* Text-delimited files for input and output -- 
[MAHOUT-1568](https://issues.apache.org/jira/browse/MAHOUT-1568)
-<!-- * Weighted (Implicit Feedback) ALS -- 
[MAHOUT-1365](https://issues.apache.org/jira/browse/MAHOUT-1365) -->
-<!--* Data frame R-like bindings -- 
[MAHOUT-1490](https://issues.apache.org/jira/browse/MAHOUT-1490) -->
-
-* *Your issue here!*
-
-<!-- ## Stuff wanted: 
-* Data frame R-like bindings (similarly to linalg bindings)
-* Stat R-like bindings (perhaps we can just adapt to commons.math stat)
-* **BYODMs:** Bring Your Own Distributed Method on SparkBindings! 
-* In-core jBlas matrix adapter
-* In-core GPU matrix adapters -->
-
-
-
-  
\ No newline at end of file

http://git-wip-us.apache.org/repos/asf/mahout/blob/c81fc8b7/website/old_site_migration/needs_work_priority/sparkbindings/play-with-shell.md
----------------------------------------------------------------------
diff --git 
a/website/old_site_migration/needs_work_priority/sparkbindings/play-with-shell.md
 
b/website/old_site_migration/needs_work_priority/sparkbindings/play-with-shell.md
deleted file mode 100644
index 3cdb8f7..0000000
--- 
a/website/old_site_migration/needs_work_priority/sparkbindings/play-with-shell.md
+++ /dev/null
@@ -1,199 +0,0 @@
----
-layout: default
-title: Perceptron and Winnow
-theme:
-    name: retro-mahout
----
-# Playing with Mahout's Spark Shell
-
-This tutorial will show you how to play with Mahout's scala DSL for linear 
algebra and its Spark shell. **Please keep in mind that this code is still in a 
very early experimental stage**.
-
-_(Edited for 0.10.2)_
-
-## Intro
-
-We'll use an excerpt of a publicly available [dataset about 
cereals](http://lib.stat.cmu.edu/DASL/Datafiles/Cereals.html). The dataset 
tells the protein, fat, carbohydrate and sugars (in milligrams) contained in a 
set of cereals, as well as a customer rating for the cereals. Our aim for this 
example is to fit a linear model which infers the customer rating from the 
ingredients.
-
-
-Name                    | protein | fat | carbo | sugars | rating
-:-----------------------|:--------|:----|:------|:-------|:---------
-Apple Cinnamon Cheerios | 2       | 2   | 10.5  | 10     | 29.509541
-Cap'n'Crunch            | 1       | 2   | 12    | 12     | 18.042851  
-Cocoa Puffs             | 1       | 1   | 12    | 13     | 22.736446
-Froot Loops             | 2       |    1   | 11    | 13     | 32.207582  
-Honey Graham Ohs        | 1       |    2   | 12    | 11     | 21.871292
-Wheaties Honey Gold     | 2       | 1   | 16    |  8     | 36.187559  
-Cheerios                | 6       |    2   | 17    |  1     | 50.764999
-Clusters                | 3       |    2   | 13    |  7     | 40.400208
-Great Grains Pecan      | 3       | 3   | 13    |  4     | 45.811716  
-
-
-## Installing Mahout & Spark on your local machine
-
-We describe how to do a quick toy setup of Spark & Mahout on your local 
machine, so that you can run this example and play with the shell. 
-
- 1. Download [Apache Spark 
1.6.2](http://d3kbcqa49mib13.cloudfront.net/spark-1.6.2-bin-hadoop2.6.tgz) and 
unpack the archive file
- 1. Change to the directory where you unpacked Spark and type ```sbt/sbt 
assembly``` to build it
- 1. Create a directory for Mahout somewhere on your machine, change to there 
and checkout the master branch of Apache Mahout from GitHub ```git clone 
https://github.com/apache/mahout mahout```
- 1. Change to the ```mahout``` directory and build mahout using ```mvn 
-DskipTests clean install```
- 
-## Starting Mahout's Spark shell
-
- 1. Goto the directory where you unpacked Spark and type 
```sbin/start-all.sh``` to locally start Spark
- 1. Open a browser, point it to 
[http://localhost:8080/](http://localhost:8080/) to check whether Spark 
successfully started. Copy the url of the spark master at the top of the page 
(it starts with **spark://**)
- 1. Define the following environment variables: <pre class="codehilite">export 
MAHOUT_HOME=[directory into which you checked out Mahout]
-export SPARK_HOME=[directory where you unpacked Spark]
-export MASTER=[url of the Spark master]
-</pre>
- 1. Finally, change to the directory where you unpacked Mahout and type 
```bin/mahout spark-shell```, 
-you should see the shell starting and get the prompt ```mahout> ```. Check 
-[FAQ](http://mahout.apache.org/users/sparkbindings/faq.html) for further 
troubleshooting.
-
-## Implementation
-
-We'll use the shell to interactively play with the data and incrementally 
implement a simple [linear 
regression](https://en.wikipedia.org/wiki/Linear_regression) algorithm. Let's 
first load the dataset. Usually, we wouldn't need Mahout unless we processed a 
large dataset stored in a distributed filesystem. But for the sake of this 
example, we'll use our tiny toy dataset and "pretend" it was too big to fit 
onto a single machine.
-
-*Note: You can incrementally follow the example by copy-and-pasting the code 
into your running Mahout shell.*
-
-Mahout's linear algebra DSL has an abstraction called *DistributedRowMatrix 
(DRM)* which models a matrix that is partitioned by rows and stored in the 
memory of a cluster of machines. We use ```dense()``` to create a dense 
in-memory matrix from our toy dataset and use ```drmParallelize``` to load it 
into the cluster, "mimicking" a large, partitioned dataset.
-
-<div class="codehilite"><pre>
-val drmData = drmParallelize(dense(
-  (2, 2, 10.5, 10, 29.509541),  // Apple Cinnamon Cheerios
-  (1, 2, 12,   12, 18.042851),  // Cap'n'Crunch
-  (1, 1, 12,   13, 22.736446),  // Cocoa Puffs
-  (2, 1, 11,   13, 32.207582),  // Froot Loops
-  (1, 2, 12,   11, 21.871292),  // Honey Graham Ohs
-  (2, 1, 16,   8,  36.187559),  // Wheaties Honey Gold
-  (6, 2, 17,   1,  50.764999),  // Cheerios
-  (3, 2, 13,   7,  40.400208),  // Clusters
-  (3, 3, 13,   4,  45.811716)), // Great Grains Pecan
-  numPartitions = 2);
-</pre></div>
-
-Have a look at this matrix. The first four columns represent the ingredients 
-(our features) and the last column (the rating) is the target variable for 
-our regression. [Linear 
regression](https://en.wikipedia.org/wiki/Linear_regression) 
-assumes that the **target variable** `\(\mathbf{y}\)` is generated by the 
-linear combination of **the feature matrix** `\(\mathbf{X}\)` with the 
-**parameter vector** `\(\boldsymbol{\beta}\)` plus the
- **noise** `\(\boldsymbol{\varepsilon}\)`, summarized in the formula 
-`\(\mathbf{y}=\mathbf{X}\boldsymbol{\beta}+\boldsymbol{\varepsilon}\)`. 
-Our goal is to find an estimate of the parameter vector 
-`\(\boldsymbol{\beta}\)` that explains the data very well.
-
-As a first step, we extract `\(\mathbf{X}\)` and `\(\mathbf{y}\)` from our 
data matrix. We get *X* by slicing: we take all rows (denoted by ```::```) and 
the first four columns, which have the ingredients in milligrams as content. 
Note that the result is again a DRM. The shell will not execute this code yet, 
it saves the history of operations and defers the execution until we really 
access a result. **Mahout's DSL automatically optimizes and parallelizes all 
operations on DRMs and runs them on Apache Spark.**
-
-<div class="codehilite"><pre>
-val drmX = drmData(::, 0 until 4)
-</pre></div>
-
-Next, we extract the target variable vector *y*, the fifth column of the data 
matrix. We assume this one fits into our driver machine, so we fetch it into 
memory using ```collect```:
-
-<div class="codehilite"><pre>
-val y = drmData.collect(::, 4)
-</pre></div>
-
-Now we are ready to think about a mathematical way to estimate the parameter 
vector *β*. A simple textbook approach is [ordinary least squares 
(OLS)](https://en.wikipedia.org/wiki/Ordinary_least_squares), which minimizes 
the sum of residual squares between the true target variable and the prediction 
of the target variable. In OLS, there is even a closed form expression for 
estimating `\(\boldsymbol{\beta}\)` as 
-`\(\left(\mathbf{X}^{\top}\mathbf{X}\right)^{-1}\mathbf{X}^{\top}\mathbf{y}\)`.
-
-The first thing which we compute for this is  
`\(\mathbf{X}^{\top}\mathbf{X}\)`. The code for doing this in Mahout's scala 
DSL maps directly to the mathematical formula. The operation ```.t()``` 
transposes a matrix and analogous to R ```%*%``` denotes matrix multiplication.
-
-<div class="codehilite"><pre>
-val drmXtX = drmX.t %*% drmX
-</pre></div>
-
-The same is true for computing `\(\mathbf{X}^{\top}\mathbf{y}\)`. We can 
simply type the math in scala expressions into the shell. Here, *X* lives in 
the cluster, while is *y* in the memory of the driver, and the result is a DRM 
again.
-<div class="codehilite"><pre>
-val drmXty = drmX.t %*% y
-</pre></div>
-
-We're nearly done. The next step we take is to fetch 
`\(\mathbf{X}^{\top}\mathbf{X}\)` and 
-`\(\mathbf{X}^{\top}\mathbf{y}\)` into the memory of our driver machine (we 
are targeting 
-features matrices that are tall and skinny , 
-so we can assume that `\(\mathbf{X}^{\top}\mathbf{X}\)` is small enough 
-to fit in). Then, we provide them to an in-memory solver (Mahout provides 
-the an analog to R's ```solve()``` for that) which computes ```beta```, our 
-OLS estimate of the parameter vector `\(\boldsymbol{\beta}\)`.
-
-<div class="codehilite"><pre>
-val XtX = drmXtX.collect
-val Xty = drmXty.collect(::, 0)
-
-val beta = solve(XtX, Xty)
-</pre></div>
-
-That's it! We have a implemented a distributed linear regression algorithm 
-on Apache Spark. I hope you agree that we didn't have to worry a lot about 
-parallelization and distributed systems. The goal of Mahout's linear algebra 
-DSL is to abstract away the ugliness of programming a distributed system 
-as much as possible, while still retaining decent performance and 
-scalability.
-
-We can now check how well our model fits its training data. 
-First, we multiply the feature matrix `\(\mathbf{X}\)` by our estimate of 
-`\(\boldsymbol{\beta}\)`. Then, we look at the difference (via L2-norm) of 
-the target variable `\(\mathbf{y}\)` to the fitted target variable:
-
-<div class="codehilite"><pre>
-val yFitted = (drmX %*% beta).collect(::, 0)
-(y - yFitted).norm(2)
-</pre></div>
-
-We hope that we could show that Mahout's shell allows people to interactively 
and incrementally write algorithms. We have entered a lot of individual 
commands, one-by-one, until we got the desired results. We can now refactor a 
little by wrapping our statements into easy-to-use functions. The definition of 
functions follows standard scala syntax. 
-
-We put all the commands for ordinary least squares into a function ```ols```. 
-
-<div class="codehilite"><pre>
-def ols(drmX: DrmLike[Int], y: Vector) = 
-  solve(drmX.t %*% drmX, drmX.t %*% y)(::, 0)
-
-</pre></div>
-
-Note that DSL declares implicit `collect` if coersion rules require an in-core 
argument. Hence, we can simply
-skip explicit `collect`s. 
-
-Next, we define a function ```goodnessOfFit``` that tells how well a model 
fits the target variable:
-
-<div class="codehilite"><pre>
-def goodnessOfFit(drmX: DrmLike[Int], beta: Vector, y: Vector) = {
-  val fittedY = (drmX %*% beta).collect(::, 0)
-  (y - fittedY).norm(2)
-}
-</pre></div>
-
-So far we have left out an important aspect of a standard linear regression 
-model. Usually there is a constant bias term added to the model. Without 
-that, our model always crosses through the origin and we only learn the 
-right angle. An easy way to add such a bias term to our model is to add a 
-column of ones to the feature matrix `\(\mathbf{X}\)`. 
-The corresponding weight in the parameter vector will then be the bias term.
-
-Here is how we add a bias column:
-
-<div class="codehilite"><pre>
-val drmXwithBiasColumn = drmX cbind 1
-</pre></div>
-
-Now we can give the newly created DRM ```drmXwithBiasColumn``` to our model 
fitting method ```ols``` and see how well the resulting model fits the training 
data with ```goodnessOfFit```. You should see a large improvement in the result.
-
-<div class="codehilite"><pre>
-val betaWithBiasTerm = ols(drmXwithBiasColumn, y)
-goodnessOfFit(drmXwithBiasColumn, betaWithBiasTerm, y)
-</pre></div>
-
-As a further optimization, we can make use of the DSL's caching functionality. 
We use ```drmXwithBiasColumn``` repeatedly  as input to a computation, so it 
might be beneficial to cache it in memory. This is achieved by calling 
```checkpoint()```. In the end, we remove it from the cache with uncache:
-
-<div class="codehilite"><pre>
-val cachedDrmX = drmXwithBiasColumn.checkpoint()
-
-val betaWithBiasTerm = ols(cachedDrmX, y)
-val goodness = goodnessOfFit(cachedDrmX, betaWithBiasTerm, y)
-
-cachedDrmX.uncache()
-
-goodness
-</pre></div>
-
-
-Liked what you saw? Checkout Mahout's overview for the [Scala and Spark 
bindings](https://mahout.apache.org/users/sparkbindings/home.html).
\ No newline at end of file

http://git-wip-us.apache.org/repos/asf/mahout/blob/c81fc8b7/website/old_site_migration/needs_work_priority/wikipedia-classifier-example.md
----------------------------------------------------------------------
diff --git 
a/website/old_site_migration/needs_work_priority/wikipedia-classifier-example.md
 
b/website/old_site_migration/needs_work_priority/wikipedia-classifier-example.md
deleted file mode 100644
index 9df07da..0000000
--- 
a/website/old_site_migration/needs_work_priority/wikipedia-classifier-example.md
+++ /dev/null
@@ -1,57 +0,0 @@
----
-layout: default
-title: Wikipedia XML parser and Naive Bayes Example
-theme:
-    name: retro-mahout
----
-# Wikipedia XML parser and Naive Bayes Classifier Example
-
-## Introduction
-Mahout has an [example 
script](https://github.com/apache/mahout/blob/master/examples/bin/classify-wikipedia.sh)
 [1] which will download a recent XML dump of the (entire if desired) [English 
Wikipedia database](http://dumps.wikimedia.org/enwiki/latest/). After running 
the classification script, you can use the [document classification 
script](https://github.com/apache/mahout/blob/master/examples/bin/spark-document-classifier.mscala)
 from the Mahout 
[spark-shell](http://mahout.apache.org/users/sparkbindings/play-with-shell.html)
 to vectorize and classify text from outside of the training and testing corpus 
using a modle built on the Wikipedia dataset.  
-
-You can run this script to build and test a Naive Bayes classifier for option 
(1) 10 arbitrary countries or option (2) 2 countries (United States and United 
Kingdom).
-
-## Oververview
-
-Tou run the example simply execute the 
`$MAHOUT_HOME/examples/bin/classify-wikipedia.sh` script.
-
-By defult the script is set to run on a medium sized Wikipedia XML dump.  To 
run on the full set (the entire english Wikipedia) you can change the download 
by commenting out line 78, and uncommenting line 80  of 
[classify-wikipedia.sh](https://github.com/apache/mahout/blob/master/examples/bin/classify-wikipedia.sh)
 [1]. However this is not recommended unless you have the resources to do so. 
*Be sure to clean your work directory when changing datasets- option (3).*
-
-The step by step process for Creating a Naive Bayes Classifier for the 
Wikipedia XML dump is very similar to that for [creating a 20 Newsgroups 
Classifier](http://mahout.apache.org/users/classification/twenty-newsgroups.html)
 [4].  The only difference being that instead of running `$mahout seqdirectory` 
on the unzipped 20 Newsgroups file, you'll run `$mahout seqwiki` on the 
unzipped Wikipedia xml dump.
-
-    $ mahout seqwiki 
-
-The above command launches `WikipediaToSequenceFile.java` which accepts a text 
file of categories [3] and starts an MR job to parse the each document in the 
XML file.  This process will seek to extract documents with a wikipedia 
category tag which (exactly, if the `-exactMatchOnly` option is set) matches a 
line in the category file.  If no match is found and the `-all` option is set, 
the document will be dumped into an "unknown" category. The documents will then 
be written out as a `<Text,Text>` sequence file of the form 
(K:/category/document_title , V: document).
-
-There are 3 different example category files available to in the 
/examples/src/test/resources
-directory:  country.txt, country10.txt and country2.txt.  You can edit these 
categories to extract a different corpus from the Wikipedia dataset.
-
-The CLI options for `seqwiki` are as follows:
-
-    --input          (-i)         input pathname String
-    --output         (-o)         the output pathname String
-    --categories     (-c)         the file containing the Wikipedia categories
-    --exactMatchOnly (-e)         if set, then the Wikipedia category must 
match
-                                    exactly instead of simply containing the 
category string
-    --all            (-all)       if set select all categories
-    --removeLabels   (-rl)        if set, remove [[Category:labels]] from 
document text after extracting label.
-
-
-After `seqwiki`, the script runs `seq2sparse`, `split`, `trainnb` and `testnb` 
as in the [step by step 20newsgroups 
example](http://mahout.apache.org/users/classification/twenty-newsgroups.html). 
 When all of the jobs have finished, a confusion matrix will be displayed.
-
-#Resourcese
-
-[1] 
[classify-wikipedia.sh](https://github.com/apache/mahout/blob/master/examples/bin/classify-wikipedia.sh)
-
-[2] [Document classification script for the Mahout Spark 
Shell](https://github.com/apache/mahout/blob/master/examples/bin/spark-document-classifier.mscala)
-
-[3] [Example category 
file](https://github.com/apache/mahout/blob/master/examples/src/test/resources/country10.txt)
-
-[4] [Step by step instructions for building a Naive Bayes classifier for 
20newsgroups from the command 
line](http://mahout.apache.org/users/classification/twenty-newsgroups.html)
-
-[5] [Mahout MapReduce Naive 
Bayes](http://mahout.apache.org/users/classification/bayesian.html)
-
-[6] [Mahout Spark Naive 
Bayes](http://mahout.apache.org/users/algorithms/spark-naive-bayes.html)
-
-[7] [Mahout Scala Spark and H2O 
Bindings](http://mahout.apache.org/users/sparkbindings/home.html)
-

Reply via email to