Repository: mahout Updated Branches: refs/heads/website 9c0314528 -> c81fc8b72
http://git-wip-us.apache.org/repos/asf/mahout/blob/c81fc8b7/website/old_site_migration/needs_work_priority/sparkbindings/faq.md ---------------------------------------------------------------------- diff --git a/website/old_site_migration/needs_work_priority/sparkbindings/faq.md b/website/old_site_migration/needs_work_priority/sparkbindings/faq.md deleted file mode 100644 index 9649e3b..0000000 --- a/website/old_site_migration/needs_work_priority/sparkbindings/faq.md +++ /dev/null @@ -1,52 +0,0 @@ ---- -layout: default -title: FAQ -theme: - name: retro-mahout ---- - -# FAQ for using Mahout with Spark - -**Q: Mahout Spark shell doesn't start; "ClassNotFound" problems or various classpath problems.** - -**A:** So far as of the time of this writing all reported problems starting the Spark shell in Mahout were revolving -around classpath issues one way or another. - -If you are getting method signature like errors, most probably you have mismatch between Mahout's Spark dependency -and actual Spark installed. (At the time of this writing the HEAD depends on Spark 1.1.0) but check mahout/pom.xml. - -Troubleshooting general classpath issues is pretty straightforward. Since Mahout is using Spark's installation -and its classpath as reported by Spark itself for Spark-related dependencies, it is important to make sure -the classpath is sane and is made available to Mahout: - -1. Check Spark is of correct version (same as in Mahout's poms), is compiled and SPARK_HOME is set. -2. Check Mahout is compiled and MAHOUT_HOME is set. -3. Run `$SPARK_HOME/bin/compute-classpath.sh` and make sure it produces sane result with no errors. -If it outputs something other than a straightforward classpath string, most likely Spark is not compiled/set correctly (later spark versions require -`sbt/sbt assembly` to be run, simply runnig `sbt/sbt publish-local` is not enough any longer). -4. Run `$MAHOUT_HOME/bin/mahout -spark classpath` and check that path reported in step (3) is included. - -**Q: I am using the command line Mahout jobs that run on Spark or am writing my own application that uses -Mahout's Spark code. When I run the code on my cluster I get ClassNotFound or signature errors during serialization. -What's wrong?** - -**A:** The Spark artifacts in the maven ecosystem may not match the exact binary you are running on your cluster. This may -cause class name or version mismatches. In this case you may wish -to build Spark yourself to guarantee that you are running exactly what you are building Mahout against. To do this follow these steps -in order: - -1. Build Spark with maven, but **do not** use the "package" target as described on the Spark site. Build with the "clean install" target instead. -Something like: "mvn clean install -Dhadoop1.2.1" or whatever your particular build options are. This will put the jars for Spark -in the local maven cache. -2. Deploy **your** Spark build to your cluster and test it there. -3. Build Mahout. This will cause maven to pull the jars for Spark from the local maven cache and may resolve missing -or mis-identified classes. -4. if you are building your own code do so against the local builds of Spark and Mahout. - -**Q: The implicit SparkContext 'sc' does not work in the Mahout spark-shell.** - -**A:** In the Mahout spark-shell the SparkContext is called 'sdc', where the 'd' stands for distributed. - - - - http://git-wip-us.apache.org/repos/asf/mahout/blob/c81fc8b7/website/old_site_migration/needs_work_priority/sparkbindings/home.md ---------------------------------------------------------------------- diff --git a/website/old_site_migration/needs_work_priority/sparkbindings/home.md b/website/old_site_migration/needs_work_priority/sparkbindings/home.md deleted file mode 100644 index 5075612..0000000 --- a/website/old_site_migration/needs_work_priority/sparkbindings/home.md +++ /dev/null @@ -1,101 +0,0 @@ ---- -layout: default -title: Spark Bindings -theme: - name: retro-mahout ---- - -# Scala & Spark Bindings: -*Bringing algebraic semantics* - -## What is Scala & Spark Bindings? - -In short, Scala & Spark Bindings for Mahout is Scala DSL and algebraic optimizer of something like this (actual formula from **(d)spca**) - - -`\[\mathbf{G}=\mathbf{B}\mathbf{B}^{\top}-\mathbf{C}-\mathbf{C}^{\top}+\mathbf{s}_{q}\mathbf{s}_{q}^{\top}\boldsymbol{\xi}^{\top}\boldsymbol{\xi}\]` - -bound to in-core and distributed computations (currently, on Apache Spark). - - -Mahout Scala & Spark Bindings expression of the above: - - val g = bt.t %*% bt - c - c.t + (s_q cross s_q) * (xi dot xi) - -The main idea is that a scientist writing algebraic expressions cannot care less of distributed -operation plans and works **entirely on the logical level** just like he or she would do with R. - -Another idea is decoupling logical expression from distributed back-end. As more back-ends are added, -this implies **"write once, run everywhere"**. - -The linear algebra side works with scalars, in-core vectors and matrices, and Mahout Distributed -Row Matrices (DRMs). - -The ecosystem of operators is built in the R's image, i.e. it follows R naming such as %*%, -colSums, nrow, length operating over vectors or matices. - -Important part of Spark Bindings is expression optimizer. It looks at expression as a whole -and figures out how it can be simplified, and which physical operators should be picked. For example, -there are currently about 5 different physical operators performing DRM-DRM multiplication -picked based on matrix geometry, distributed dataset partitioning, orientation etc. -If we count in DRM by in-core combinations, that would be another 4, i.e. 9 total -- all of it for just -simple x %*% y logical notation. - - - -Please refer to the documentation for details. - -## Status - -This environment addresses mostly R-like Linear Algebra optmizations for -Spark, Flink and H20. - - -## Documentation - -* Scala and Spark bindings manual: [web](http://apache.github.io/mahout/doc/ScalaSparkBindings.html), [pdf](ScalaSparkBindings.pdf) -* Overview blog on 0.10.x releases: [blog](http://www.weatheringthroughtechdays.com/2015/04/mahout-010x-first-mahout-release-as.html) - -## Distributed methods and solvers using Bindings - -* In-core ([ssvd]) and Distributed ([dssvd]) Stochastic SVD -- guinea pigs -- see the bindings manual -* In-core ([spca]) and Distributed ([dspca]) Stochastic PCA -- guinea pigs -- see the bindings manual -* Distributed thin QR decomposition ([dqrThin]) -- guinea pig -- see the bindings manual -* [Current list of algorithms](https://mahout.apache.org/users/basics/algorithms.html) - -[ssvd]: https://github.com/apache/mahout/blob/trunk/math-scala/src/main/scala/org/apache/mahout/math/scalabindings/SSVD.scala -[spca]: https://github.com/apache/mahout/blob/trunk/math-scala/src/main/scala/org/apache/mahout/math/scalabindings/SSVD.scala -[dssvd]: https://github.com/apache/mahout/blob/trunk/spark/src/main/scala/org/apache/mahout/sparkbindings/decompositions/DSSVD.scala -[dspca]: https://github.com/apache/mahout/blob/trunk/spark/src/main/scala/org/apache/mahout/sparkbindings/decompositions/DSPCA.scala -[dqrThin]: https://github.com/apache/mahout/blob/trunk/spark/src/main/scala/org/apache/mahout/sparkbindings/decompositions/DQR.scala - - -## Related history of note - -* CLI and Driver for Spark version of item similarity -- [MAHOUT-1541](https://issues.apache.org/jira/browse/MAHOUT-1541) -* Command line interface for generalizable Spark pipelines -- [MAHOUT-1569](https://issues.apache.org/jira/browse/MAHOUT-1569) -* Cooccurrence Analysis / Item-based Recommendation -- [MAHOUT-1464](https://issues.apache.org/jira/browse/MAHOUT-1464) -* Spark Bindings -- [MAHOUT-1346](https://issues.apache.org/jira/browse/MAHOUT-1346) -* Scala Bindings -- [MAHOUT-1297](https://issues.apache.org/jira/browse/MAHOUT-1297) -* Interactive Scala & Spark Bindings Shell & Script processor -- [MAHOUT-1489](https://issues.apache.org/jira/browse/MAHOUT-1489) -* OLS tutorial using Mahout shell -- [MAHOUT-1542](https://issues.apache.org/jira/browse/MAHOUT-1542) -* Full abstraction of DRM apis and algorithms from a distributed engine -- [MAHOUT-1529](https://issues.apache.org/jira/browse/MAHOUT-1529) -* Port Naive Bayes -- [MAHOUT-1493](https://issues.apache.org/jira/browse/MAHOUT-1493) - -## Work in progress -* Text-delimited files for input and output -- [MAHOUT-1568](https://issues.apache.org/jira/browse/MAHOUT-1568) -<!-- * Weighted (Implicit Feedback) ALS -- [MAHOUT-1365](https://issues.apache.org/jira/browse/MAHOUT-1365) --> -<!--* Data frame R-like bindings -- [MAHOUT-1490](https://issues.apache.org/jira/browse/MAHOUT-1490) --> - -* *Your issue here!* - -<!-- ## Stuff wanted: -* Data frame R-like bindings (similarly to linalg bindings) -* Stat R-like bindings (perhaps we can just adapt to commons.math stat) -* **BYODMs:** Bring Your Own Distributed Method on SparkBindings! -* In-core jBlas matrix adapter -* In-core GPU matrix adapters --> - - - - \ No newline at end of file http://git-wip-us.apache.org/repos/asf/mahout/blob/c81fc8b7/website/old_site_migration/needs_work_priority/sparkbindings/play-with-shell.md ---------------------------------------------------------------------- diff --git a/website/old_site_migration/needs_work_priority/sparkbindings/play-with-shell.md b/website/old_site_migration/needs_work_priority/sparkbindings/play-with-shell.md deleted file mode 100644 index 3cdb8f7..0000000 --- a/website/old_site_migration/needs_work_priority/sparkbindings/play-with-shell.md +++ /dev/null @@ -1,199 +0,0 @@ ---- -layout: default -title: Perceptron and Winnow -theme: - name: retro-mahout ---- -# Playing with Mahout's Spark Shell - -This tutorial will show you how to play with Mahout's scala DSL for linear algebra and its Spark shell. **Please keep in mind that this code is still in a very early experimental stage**. - -_(Edited for 0.10.2)_ - -## Intro - -We'll use an excerpt of a publicly available [dataset about cereals](http://lib.stat.cmu.edu/DASL/Datafiles/Cereals.html). The dataset tells the protein, fat, carbohydrate and sugars (in milligrams) contained in a set of cereals, as well as a customer rating for the cereals. Our aim for this example is to fit a linear model which infers the customer rating from the ingredients. - - -Name | protein | fat | carbo | sugars | rating -:-----------------------|:--------|:----|:------|:-------|:--------- -Apple Cinnamon Cheerios | 2 | 2 | 10.5 | 10 | 29.509541 -Cap'n'Crunch | 1 | 2 | 12 | 12 | 18.042851 -Cocoa Puffs | 1 | 1 | 12 | 13 | 22.736446 -Froot Loops | 2 | 1 | 11 | 13 | 32.207582 -Honey Graham Ohs | 1 | 2 | 12 | 11 | 21.871292 -Wheaties Honey Gold | 2 | 1 | 16 | 8 | 36.187559 -Cheerios | 6 | 2 | 17 | 1 | 50.764999 -Clusters | 3 | 2 | 13 | 7 | 40.400208 -Great Grains Pecan | 3 | 3 | 13 | 4 | 45.811716 - - -## Installing Mahout & Spark on your local machine - -We describe how to do a quick toy setup of Spark & Mahout on your local machine, so that you can run this example and play with the shell. - - 1. Download [Apache Spark 1.6.2](http://d3kbcqa49mib13.cloudfront.net/spark-1.6.2-bin-hadoop2.6.tgz) and unpack the archive file - 1. Change to the directory where you unpacked Spark and type ```sbt/sbt assembly``` to build it - 1. Create a directory for Mahout somewhere on your machine, change to there and checkout the master branch of Apache Mahout from GitHub ```git clone https://github.com/apache/mahout mahout``` - 1. Change to the ```mahout``` directory and build mahout using ```mvn -DskipTests clean install``` - -## Starting Mahout's Spark shell - - 1. Goto the directory where you unpacked Spark and type ```sbin/start-all.sh``` to locally start Spark - 1. Open a browser, point it to [http://localhost:8080/](http://localhost:8080/) to check whether Spark successfully started. Copy the url of the spark master at the top of the page (it starts with **spark://**) - 1. Define the following environment variables: <pre class="codehilite">export MAHOUT_HOME=[directory into which you checked out Mahout] -export SPARK_HOME=[directory where you unpacked Spark] -export MASTER=[url of the Spark master] -</pre> - 1. Finally, change to the directory where you unpacked Mahout and type ```bin/mahout spark-shell```, -you should see the shell starting and get the prompt ```mahout> ```. Check -[FAQ](http://mahout.apache.org/users/sparkbindings/faq.html) for further troubleshooting. - -## Implementation - -We'll use the shell to interactively play with the data and incrementally implement a simple [linear regression](https://en.wikipedia.org/wiki/Linear_regression) algorithm. Let's first load the dataset. Usually, we wouldn't need Mahout unless we processed a large dataset stored in a distributed filesystem. But for the sake of this example, we'll use our tiny toy dataset and "pretend" it was too big to fit onto a single machine. - -*Note: You can incrementally follow the example by copy-and-pasting the code into your running Mahout shell.* - -Mahout's linear algebra DSL has an abstraction called *DistributedRowMatrix (DRM)* which models a matrix that is partitioned by rows and stored in the memory of a cluster of machines. We use ```dense()``` to create a dense in-memory matrix from our toy dataset and use ```drmParallelize``` to load it into the cluster, "mimicking" a large, partitioned dataset. - -<div class="codehilite"><pre> -val drmData = drmParallelize(dense( - (2, 2, 10.5, 10, 29.509541), // Apple Cinnamon Cheerios - (1, 2, 12, 12, 18.042851), // Cap'n'Crunch - (1, 1, 12, 13, 22.736446), // Cocoa Puffs - (2, 1, 11, 13, 32.207582), // Froot Loops - (1, 2, 12, 11, 21.871292), // Honey Graham Ohs - (2, 1, 16, 8, 36.187559), // Wheaties Honey Gold - (6, 2, 17, 1, 50.764999), // Cheerios - (3, 2, 13, 7, 40.400208), // Clusters - (3, 3, 13, 4, 45.811716)), // Great Grains Pecan - numPartitions = 2); -</pre></div> - -Have a look at this matrix. The first four columns represent the ingredients -(our features) and the last column (the rating) is the target variable for -our regression. [Linear regression](https://en.wikipedia.org/wiki/Linear_regression) -assumes that the **target variable** `\(\mathbf{y}\)` is generated by the -linear combination of **the feature matrix** `\(\mathbf{X}\)` with the -**parameter vector** `\(\boldsymbol{\beta}\)` plus the - **noise** `\(\boldsymbol{\varepsilon}\)`, summarized in the formula -`\(\mathbf{y}=\mathbf{X}\boldsymbol{\beta}+\boldsymbol{\varepsilon}\)`. -Our goal is to find an estimate of the parameter vector -`\(\boldsymbol{\beta}\)` that explains the data very well. - -As a first step, we extract `\(\mathbf{X}\)` and `\(\mathbf{y}\)` from our data matrix. We get *X* by slicing: we take all rows (denoted by ```::```) and the first four columns, which have the ingredients in milligrams as content. Note that the result is again a DRM. The shell will not execute this code yet, it saves the history of operations and defers the execution until we really access a result. **Mahout's DSL automatically optimizes and parallelizes all operations on DRMs and runs them on Apache Spark.** - -<div class="codehilite"><pre> -val drmX = drmData(::, 0 until 4) -</pre></div> - -Next, we extract the target variable vector *y*, the fifth column of the data matrix. We assume this one fits into our driver machine, so we fetch it into memory using ```collect```: - -<div class="codehilite"><pre> -val y = drmData.collect(::, 4) -</pre></div> - -Now we are ready to think about a mathematical way to estimate the parameter vector *β*. A simple textbook approach is [ordinary least squares (OLS)](https://en.wikipedia.org/wiki/Ordinary_least_squares), which minimizes the sum of residual squares between the true target variable and the prediction of the target variable. In OLS, there is even a closed form expression for estimating `\(\boldsymbol{\beta}\)` as -`\(\left(\mathbf{X}^{\top}\mathbf{X}\right)^{-1}\mathbf{X}^{\top}\mathbf{y}\)`. - -The first thing which we compute for this is `\(\mathbf{X}^{\top}\mathbf{X}\)`. The code for doing this in Mahout's scala DSL maps directly to the mathematical formula. The operation ```.t()``` transposes a matrix and analogous to R ```%*%``` denotes matrix multiplication. - -<div class="codehilite"><pre> -val drmXtX = drmX.t %*% drmX -</pre></div> - -The same is true for computing `\(\mathbf{X}^{\top}\mathbf{y}\)`. We can simply type the math in scala expressions into the shell. Here, *X* lives in the cluster, while is *y* in the memory of the driver, and the result is a DRM again. -<div class="codehilite"><pre> -val drmXty = drmX.t %*% y -</pre></div> - -We're nearly done. The next step we take is to fetch `\(\mathbf{X}^{\top}\mathbf{X}\)` and -`\(\mathbf{X}^{\top}\mathbf{y}\)` into the memory of our driver machine (we are targeting -features matrices that are tall and skinny , -so we can assume that `\(\mathbf{X}^{\top}\mathbf{X}\)` is small enough -to fit in). Then, we provide them to an in-memory solver (Mahout provides -the an analog to R's ```solve()``` for that) which computes ```beta```, our -OLS estimate of the parameter vector `\(\boldsymbol{\beta}\)`. - -<div class="codehilite"><pre> -val XtX = drmXtX.collect -val Xty = drmXty.collect(::, 0) - -val beta = solve(XtX, Xty) -</pre></div> - -That's it! We have a implemented a distributed linear regression algorithm -on Apache Spark. I hope you agree that we didn't have to worry a lot about -parallelization and distributed systems. The goal of Mahout's linear algebra -DSL is to abstract away the ugliness of programming a distributed system -as much as possible, while still retaining decent performance and -scalability. - -We can now check how well our model fits its training data. -First, we multiply the feature matrix `\(\mathbf{X}\)` by our estimate of -`\(\boldsymbol{\beta}\)`. Then, we look at the difference (via L2-norm) of -the target variable `\(\mathbf{y}\)` to the fitted target variable: - -<div class="codehilite"><pre> -val yFitted = (drmX %*% beta).collect(::, 0) -(y - yFitted).norm(2) -</pre></div> - -We hope that we could show that Mahout's shell allows people to interactively and incrementally write algorithms. We have entered a lot of individual commands, one-by-one, until we got the desired results. We can now refactor a little by wrapping our statements into easy-to-use functions. The definition of functions follows standard scala syntax. - -We put all the commands for ordinary least squares into a function ```ols```. - -<div class="codehilite"><pre> -def ols(drmX: DrmLike[Int], y: Vector) = - solve(drmX.t %*% drmX, drmX.t %*% y)(::, 0) - -</pre></div> - -Note that DSL declares implicit `collect` if coersion rules require an in-core argument. Hence, we can simply -skip explicit `collect`s. - -Next, we define a function ```goodnessOfFit``` that tells how well a model fits the target variable: - -<div class="codehilite"><pre> -def goodnessOfFit(drmX: DrmLike[Int], beta: Vector, y: Vector) = { - val fittedY = (drmX %*% beta).collect(::, 0) - (y - fittedY).norm(2) -} -</pre></div> - -So far we have left out an important aspect of a standard linear regression -model. Usually there is a constant bias term added to the model. Without -that, our model always crosses through the origin and we only learn the -right angle. An easy way to add such a bias term to our model is to add a -column of ones to the feature matrix `\(\mathbf{X}\)`. -The corresponding weight in the parameter vector will then be the bias term. - -Here is how we add a bias column: - -<div class="codehilite"><pre> -val drmXwithBiasColumn = drmX cbind 1 -</pre></div> - -Now we can give the newly created DRM ```drmXwithBiasColumn``` to our model fitting method ```ols``` and see how well the resulting model fits the training data with ```goodnessOfFit```. You should see a large improvement in the result. - -<div class="codehilite"><pre> -val betaWithBiasTerm = ols(drmXwithBiasColumn, y) -goodnessOfFit(drmXwithBiasColumn, betaWithBiasTerm, y) -</pre></div> - -As a further optimization, we can make use of the DSL's caching functionality. We use ```drmXwithBiasColumn``` repeatedly as input to a computation, so it might be beneficial to cache it in memory. This is achieved by calling ```checkpoint()```. In the end, we remove it from the cache with uncache: - -<div class="codehilite"><pre> -val cachedDrmX = drmXwithBiasColumn.checkpoint() - -val betaWithBiasTerm = ols(cachedDrmX, y) -val goodness = goodnessOfFit(cachedDrmX, betaWithBiasTerm, y) - -cachedDrmX.uncache() - -goodness -</pre></div> - - -Liked what you saw? Checkout Mahout's overview for the [Scala and Spark bindings](https://mahout.apache.org/users/sparkbindings/home.html). \ No newline at end of file http://git-wip-us.apache.org/repos/asf/mahout/blob/c81fc8b7/website/old_site_migration/needs_work_priority/wikipedia-classifier-example.md ---------------------------------------------------------------------- diff --git a/website/old_site_migration/needs_work_priority/wikipedia-classifier-example.md b/website/old_site_migration/needs_work_priority/wikipedia-classifier-example.md deleted file mode 100644 index 9df07da..0000000 --- a/website/old_site_migration/needs_work_priority/wikipedia-classifier-example.md +++ /dev/null @@ -1,57 +0,0 @@ ---- -layout: default -title: Wikipedia XML parser and Naive Bayes Example -theme: - name: retro-mahout ---- -# Wikipedia XML parser and Naive Bayes Classifier Example - -## Introduction -Mahout has an [example script](https://github.com/apache/mahout/blob/master/examples/bin/classify-wikipedia.sh) [1] which will download a recent XML dump of the (entire if desired) [English Wikipedia database](http://dumps.wikimedia.org/enwiki/latest/). After running the classification script, you can use the [document classification script](https://github.com/apache/mahout/blob/master/examples/bin/spark-document-classifier.mscala) from the Mahout [spark-shell](http://mahout.apache.org/users/sparkbindings/play-with-shell.html) to vectorize and classify text from outside of the training and testing corpus using a modle built on the Wikipedia dataset. - -You can run this script to build and test a Naive Bayes classifier for option (1) 10 arbitrary countries or option (2) 2 countries (United States and United Kingdom). - -## Oververview - -Tou run the example simply execute the `$MAHOUT_HOME/examples/bin/classify-wikipedia.sh` script. - -By defult the script is set to run on a medium sized Wikipedia XML dump. To run on the full set (the entire english Wikipedia) you can change the download by commenting out line 78, and uncommenting line 80 of [classify-wikipedia.sh](https://github.com/apache/mahout/blob/master/examples/bin/classify-wikipedia.sh) [1]. However this is not recommended unless you have the resources to do so. *Be sure to clean your work directory when changing datasets- option (3).* - -The step by step process for Creating a Naive Bayes Classifier for the Wikipedia XML dump is very similar to that for [creating a 20 Newsgroups Classifier](http://mahout.apache.org/users/classification/twenty-newsgroups.html) [4]. The only difference being that instead of running `$mahout seqdirectory` on the unzipped 20 Newsgroups file, you'll run `$mahout seqwiki` on the unzipped Wikipedia xml dump. - - $ mahout seqwiki - -The above command launches `WikipediaToSequenceFile.java` which accepts a text file of categories [3] and starts an MR job to parse the each document in the XML file. This process will seek to extract documents with a wikipedia category tag which (exactly, if the `-exactMatchOnly` option is set) matches a line in the category file. If no match is found and the `-all` option is set, the document will be dumped into an "unknown" category. The documents will then be written out as a `<Text,Text>` sequence file of the form (K:/category/document_title , V: document). - -There are 3 different example category files available to in the /examples/src/test/resources -directory: country.txt, country10.txt and country2.txt. You can edit these categories to extract a different corpus from the Wikipedia dataset. - -The CLI options for `seqwiki` are as follows: - - --input (-i) input pathname String - --output (-o) the output pathname String - --categories (-c) the file containing the Wikipedia categories - --exactMatchOnly (-e) if set, then the Wikipedia category must match - exactly instead of simply containing the category string - --all (-all) if set select all categories - --removeLabels (-rl) if set, remove [[Category:labels]] from document text after extracting label. - - -After `seqwiki`, the script runs `seq2sparse`, `split`, `trainnb` and `testnb` as in the [step by step 20newsgroups example](http://mahout.apache.org/users/classification/twenty-newsgroups.html). When all of the jobs have finished, a confusion matrix will be displayed. - -#Resourcese - -[1] [classify-wikipedia.sh](https://github.com/apache/mahout/blob/master/examples/bin/classify-wikipedia.sh) - -[2] [Document classification script for the Mahout Spark Shell](https://github.com/apache/mahout/blob/master/examples/bin/spark-document-classifier.mscala) - -[3] [Example category file](https://github.com/apache/mahout/blob/master/examples/src/test/resources/country10.txt) - -[4] [Step by step instructions for building a Naive Bayes classifier for 20newsgroups from the command line](http://mahout.apache.org/users/classification/twenty-newsgroups.html) - -[5] [Mahout MapReduce Naive Bayes](http://mahout.apache.org/users/classification/bayesian.html) - -[6] [Mahout Spark Naive Bayes](http://mahout.apache.org/users/algorithms/spark-naive-bayes.html) - -[7] [Mahout Scala Spark and H2O Bindings](http://mahout.apache.org/users/sparkbindings/home.html) -
