http://git-wip-us.apache.org/repos/asf/mahout/blob/c81fc8b7/website/docs/algorithms/map-reduce/classification/partial-implementation.md ---------------------------------------------------------------------- diff --git a/website/docs/algorithms/map-reduce/classification/partial-implementation.md b/website/docs/algorithms/map-reduce/classification/partial-implementation.md new file mode 100644 index 0000000..2a20ccb --- /dev/null +++ b/website/docs/algorithms/map-reduce/classification/partial-implementation.md @@ -0,0 +1,146 @@ +--- +layout: default +title: Partial Implementation +theme: + name: retro-mahout +--- + + +# Classifying with random forests + +<a name="PartialImplementation-Introduction"></a> +# Introduction + +This quick start page shows how to build a decision forest using the +partial implementation. This tutorial also explains how to use the decision +forest to classify new data. +Partial Decision Forests is a mapreduce implementation where each mapper +builds a subset of the forest using only the data available in its +partition. This allows building forests using large datasets as long as +each partition can be loaded in-memory. + +<a name="PartialImplementation-Steps"></a> +# Steps +<a name="PartialImplementation-Downloadthedata"></a> +## Download the data +* The current implementation is compatible with the UCI repository file +format. In this example we'll use the NSL-KDD dataset because its large +enough to show the performances of the partial implementation. +You can download the dataset here http://nsl.cs.unb.ca/NSL-KDD/ +You can either download the full training set "KDDTrain+.ARFF", or a 20% +subset "KDDTrain+_20Percent.ARFF" (we'll use the full dataset in this +tutorial) and the test set "KDDTest+.ARFF". +* Open the train and test files and remove all the lines that begin with +'@'. All those lines are at the top of the files. Actually you can keep +those lines somewhere, because they'll help us describe the dataset to +Mahout +* Put the data in HDFS: {code} +$HADOOP_HOME/bin/hadoop fs -mkdir testdata +$HADOOP_HOME/bin/hadoop fs -put <PATH TO DATA> testdata{code} + +<a name="PartialImplementation-BuildtheJobfiles"></a> +## Build the Job files +* In $MAHOUT_HOME/ run: {code}mvn clean install -DskipTests{code} + +<a name="PartialImplementation-Generateafiledescriptorforthedataset:"></a> +## Generate a file descriptor for the dataset: +run the following command: + + $HADOOP_HOME/bin/hadoop jar +$MAHOUT_HOME/core/target/mahout-core-<VERSION>-job.jar +org.apache.mahout.classifier.df.tools.Describe -p testdata/KDDTrain+.arff +-f testdata/KDDTrain+.info -d N 3 C 2 N C 4 N C 8 N 2 C 19 N L + +The "N 3 C 2 N C 4 N C 8 N 2 C 19 N L" string describes all the attributes +of the data. In this cases, it means 1 numerical(N) attribute, followed by +3 Categorical(C) attributes, ...L indicates the label. You can also use 'I' +to ignore some attributes + +<a name="PartialImplementation-Runtheexample"></a> +## Run the example + + + $HADOOP_HOME/bin/hadoop jar +$MAHOUT_HOME/examples/target/mahout-examples-<version>-job.jar +org.apache.mahout.classifier.df.mapreduce.BuildForest +-Dmapred.max.split.size=1874231 -d testdata/KDDTrain+.arff -ds +testdata/KDDTrain+.info -sl 5 -p -t 100 -o nsl-forest + +which builds 100 trees (-t argument) using the partial implementation (-p). +Each tree is built using 5 random selected attribute per node (-sl +argument) and the example outputs the decision tree in the "nsl-forest" +directory (-o). +The number of partitions is controlled by the -Dmapred.max.split.size +argument that indicates to Hadoop the max. size of each partition, in this +case 1/10 of the size of the dataset. Thus 10 partitions will be used. +IMPORTANT: using less partitions should give better classification results, +but needs a lot of memory. So if the Jobs are failing, try increasing the +number of partitions. +* The example outputs the Build Time and the oob error estimation + + + 10/03/13 17:57:29 INFO mapreduce.BuildForest: Build Time: 0h 7m 43s 582 + 10/03/13 17:57:33 INFO mapreduce.BuildForest: oob error estimate : +0.002325895231517865 + 10/03/13 17:57:33 INFO mapreduce.BuildForest: Storing the forest in: +nsl-forest/forest.seq + + +<a name="PartialImplementation-UsingtheDecisionForesttoClassifynewdata"></a> +## Using the Decision Forest to Classify new data +run the following command: + + $HADOOP_HOME/bin/hadoop jar +$MAHOUT_HOME/examples/target/mahout-examples-<version>-job.jar +org.apache.mahout.classifier.df.mapreduce.TestForest -i +nsl-kdd/KDDTest+.arff -ds nsl-kdd/KDDTrain+.info -m nsl-forest -a -mr -o +predictions + +This will compute the predictions of "KDDTest+.arff" dataset (-i argument) +using the same data descriptor generated for the training dataset (-ds) and +the decision forest built previously (-m). Optionally (if the test dataset +contains the labels of the tuples) run the analyzer to compute the +confusion matrix (-a), and you can also store the predictions in a text +file or a directory of text files(-o). Passing the (-mr) parameter will use +Hadoop to distribute the classification. + +* The example should output the classification time and the confusion +matrix + + + 10/03/13 18:08:56 INFO mapreduce.TestForest: Classification Time: 0h 0m 6s +355 + 10/03/13 18:08:56 INFO mapreduce.TestForest: +======================================================= + Summary + ------------------------------------------------------- + Correctly Classified Instances : 17657 78.3224% + Incorrectly Classified Instances : 4887 21.6776% + Total Classified Instances : 22544 + + ======================================================= + Confusion Matrix + ------------------------------------------------------- + a b <--Classified as + 9459 252 | 9711 a = normal + 4635 8198 | 12833 b = anomaly + Default Category: unknown: 2 + + +If the input is a single file then the output will be a single text file, +in the above example 'predictions' would be one single file. If the input +if a directory containing for example two files 'a.data' and 'b.data', then +the output will be a directory 'predictions' containing two files +'a.data.out' and 'b.data.out' + +<a name="PartialImplementation-KnownIssuesandlimitations"></a> +## Known Issues and limitations +The "Decision Forest" code is still "a work in progress", many features are +still missing. Here is a list of some known issues: +* For now, the training does not support multiple input files. The input +dataset must be one single file (this support will be available with the upcoming release). +Classifying new data does support multiple +input files. +* The tree building is done when each mapper.close() method is called. +Because the mappers don't refresh their state, the job can fail when the +dataset is big and you try to build a large number of trees.
http://git-wip-us.apache.org/repos/asf/mahout/blob/c81fc8b7/website/docs/algorithms/map-reduce/classification/random-forests.md ---------------------------------------------------------------------- diff --git a/website/docs/algorithms/map-reduce/classification/random-forests.md b/website/docs/algorithms/map-reduce/classification/random-forests.md new file mode 100644 index 0000000..c8b1a47 --- /dev/null +++ b/website/docs/algorithms/map-reduce/classification/random-forests.md @@ -0,0 +1,234 @@ +--- +layout: default +title: Random Forests +theme: + name: retro-mahout +--- + +<a name="RandomForests-HowtogrowaDecisionTree"></a> +### How to grow a Decision Tree + +source : \[3\](3\.html) + +LearnUnprunedTree(*X*,*Y*) + +Input: *X* a matrix of *R* rows and *M* columns where *X{*}{*}{~}ij{~}* = +the value of the *j*'th attribute in the *i*'th input datapoint. Each +column consists of either all real values or all categorical values. +Input: *Y* a vector of *R* elements, where *Y{*}{*}{~}i{~}* = the output +class of the *i*'th datapoint. The *Y{*}{*}{~}i{~}* values are categorical. +Output: An Unpruned decision tree + + +If all records in *X* have identical values in all their attributes (this +includes the case where *R<2*), return a Leaf Node predicting the majority +output, breaking ties randomly. This case also includes +If all values in *Y* are the same, return a Leaf Node predicting this value +as the output +Else + select *m* variables at random out of the *M* variables + For *j* = 1 .. *m* + If *j*'th attribute is +categorical +* +IG{*}{*}{~}j{~}* = IG(*Y*\|*X{*}{*}{~}j{~}*) (see Information +Gain) + Else (*j*'th attribute is +real-valued) +* +IG{*}{*}{~}j{~}* = IG*(*Y*\|*X{*}{*}{~}j{~}*) (see Information Gain) + Let *j\** = argmax{~}j~ *IG{*}{*}{~}j{~}* (this is the +splitting attribute we'll use) + If *j\** is categorical then + For each value *v* of the *j*'th +attribute + Let +*X{*}{*}{^}v{^}* = subset of rows of *X* in which *X{*}{*}{~}ij{~}* = *v*. +Let *Y{*}{*}{^}v{^}* = corresponding subset of *Y* + Let *Child{*}{*}{^}v{^}* = +LearnUnprunedTree(*X{*}{*}{^}v{^}*,*Y{*}{*}{^}v{^}*) + Return a decision tree node, +splitting on *j*'th attribute. The number of children equals the number of +values of the *j*'th attribute, and the *v*'th child is +*Child{*}{*}{^}v{^}* + Else *j\** is real-valued and let *t* be the best split +threshold + Let *X{*}{*}{^}LO{^}* = subset +of rows of *X* in which *X{*}{*}{~}ij{~}* *<= t*. Let *Y{*}{*}{^}LO{^}* = +corresponding subset of *Y* + Let *Child{*}{*}{^}LO{^}* = +LearnUnprunedTree(*X{*}{*}{^}LO{^}*,*Y{*}{*}{^}LO{^}*) + Let *X{*}{*}{^}HI{^}* = subset of rows of *X* +in which *X{*}{*}{~}ij{~}* *> t*. Let *Y{*}{*}{^}HI{^}* = corresponding +subset of *Y* + Let *Child{*}{*}{^}HI{^}* = +LearnUnprunedTree(*X{*}{*}{^}HI{^}*,*Y{*}{*}{^}HI{^}*) + Return a decision tree node, splitting on +*j*'th attribute. It has two children corresponding to whether the *j*'th +attribute is above or below the given threshold. + +*Note*: There are alternatives to Information Gain for splitting nodes + + +<a name="RandomForests-Informationgain"></a> +### Information gain + +source : \[3\](3\.html) +1. h4. nominal attributes + +suppose X can have one of m values V{~}1~,V{~}2~,...,V{~}m~ +P(X=V{~}1~)=p{~}1~, P(X=V{~}2~)=p{~}2~,...,P(X=V{~}m~)=p{~}m~ + +H(X)= \-sum{~}j=1{~}{^}m^ p{~}j~ log{~}2~ p{~}j~ (The entropy of X) +H(Y\|X=v) = the entropy of Y among only those records in which X has value +v +H(Y\|X) = sum{~}j~ p{~}j~ H(Y\|X=v{~}j~) +IG(Y\|X) = H(Y) - H(Y\|X) +1. h4. real-valued attributes + +suppose X is real valued +define IG(Y\|X:t) as H(Y) - H(Y\|X:t) +define H(Y\|X:t) = H(Y\|X<t) P(X<t) + H(Y\|X>=t) P(X>=t) +define IG*(Y\|X) = max{~}t~ IG(Y\|X:t) + +<a name="RandomForests-HowtogrowaRandomForest"></a> +### How to grow a Random Forest + +source : \[1\](1\.html) + +Each tree is grown as follows: +1. if the number of cases in the training set is *N*, sample *N* cases at +random \-but with replacement, from the original data. This sample will be +the training set for the growing tree. +1. if there are *M* input variables, a number *m << M* is specified such +that at each node, *m* variables are selected at random out of the *M* and +the best split on these *m* is used to split the node. The value of *m* is +held constant during the forest growing. +1. each tree is grown to its large extent possible. There is no pruning. + +<a name="RandomForests-RandomForestparameters"></a> +### Random Forest parameters + +source : \[2\](2\.html) +Random Forests are easy to use, the only 2 parameters a user of the +technique has to determine are the number of trees to be used and the +number of variables (*m*) to be randomly selected from the available set of +variables. +Breinman's recommendations are to pick a large number of trees, as well as +the square root of the number of variables for *m*. + + +<a name="RandomForests-Howtopredictthelabelofacase"></a> +### How to predict the label of a case + +Classify(*node*,*V*) + Input: *node* from the decision tree, if *node.attribute += j* then the split is done on the *j*'th attribute + + Input: *V* a vector of *M* columns where +*V{*}{*}{~}j{~}* = the value of the *j*'th attribute. + Output: label of *V* + + If *node* is a Leaf then + Return the value predicted +by *node* + + Else + Let *j = +node.attribute* + If *j* is +categorical then + + +Let *v* = *V{*}{*}{~}j{~}* + + +Let *child{*}{*}{^}v{^}* = child node corresponding to the attribute's +value *v* + + Return Classify(*child{*}{*}{^}v{^}*,*V*) + + Else *j* is +real-valued + + +Let *t = node.threshold* (split threshold) + + If Vj < t then + + Let *child{*}{*}{^}LO{^}* = child +node corresponding to (*<t*) + + Return +Classify(*child{*}{*}{^}LO{^}*,*V*) + + +Else + + Let *child{*}{*}{^}HI{^}* = +child node corresponding to (*>=t*) + + Return +Classify(*child{*}{*}{^}HI{^}*,*V*) + + +<a name="RandomForests-Theoutofbag(oob)errorestimation"></a> +### The out of bag (oob) error estimation + +source : \[1\](1\.html) + +in random forests, there is no need for cross-validation or a separate test +set to get an unbiased estimate of the test set error. It is estimated +internally, during the run, as follows: +* each tree is constructed using a different bootstrap sample from the +original data. About one-third of the cases left of the bootstrap sample +and not used in the construction of the _kth_ tree. +* put each case left out in the construction of the _kth_ tree down the +_kth{_}tree to get a classification. In this way, a test set classification +is obtained for each case in about one-thrid of the trees. At the end of +the run, take *j* to be the class that got most of the the votes every time +case *n* was _oob_. The proportion of times that *j* is not equal to the +true class of *n* averaged over all cases is the _oob error estimate_. This +has proven to be unbiased in many tests. + +<a name="RandomForests-OtherRFuses"></a> +### Other RF uses + +source : \[1\](1\.html) +* variable importance +* gini importance +* proximities +* scaling +* prototypes +* missing values replacement for the training set +* missing values replacement for the test set +* detecting mislabeled cases +* detecting outliers +* detecting novelties +* unsupervised learning +* balancing prediction error +Please refer to \[1\](1\.html) + for a detailed description + +<a name="RandomForests-References"></a> +### References + +\[1\](1\.html) + Random Forests - Classification Description + [http://www.stat.berkeley.edu/~breiman/RandomForests/cc_home.htm](http://www.stat.berkeley.edu/~breiman/RandomForests/cc_home.htm) +\[2\](2\.html) + B. Larivi�re & D. Van Den Poel, 2004. "Predicting Customer Retention +and Profitability by Using Random Forests and Regression Forests +Techniques," + Working Papers of Faculty of +Economics and Business Administration, Ghent University, Belgium 04/282, +Ghent University, + Faculty of Economics and +Business Administration. + Available online : [http://ideas.repec.org/p/rug/rugwps/04-282.html](http://ideas.repec.org/p/rug/rugwps/04-282.html) +\[3\](3\.html) + Decision Trees - Andrew W. Moore\[4\] + http://www.cs.cmu.edu/~awm/tutorials\[1\](1\.html) +\[4\](4\.html) + Information Gain - Andrew W. Moore + [http://www.cs.cmu.edu/~awm/tutorials](http://www.cs.cmu.edu/~awm/tutorials) http://git-wip-us.apache.org/repos/asf/mahout/blob/c81fc8b7/website/docs/algorithms/map-reduce/classification/restricted-boltzmann-machines.md ---------------------------------------------------------------------- diff --git a/website/docs/algorithms/map-reduce/classification/restricted-boltzmann-machines.md b/website/docs/algorithms/map-reduce/classification/restricted-boltzmann-machines.md new file mode 100644 index 0000000..0aa8641 --- /dev/null +++ b/website/docs/algorithms/map-reduce/classification/restricted-boltzmann-machines.md @@ -0,0 +1,49 @@ +--- +layout: default +title: Restricted Boltzmann Machines +theme: + name: retro-mahout +--- + +NOTE: This implementation is a Work-In-Progress, at least till September, +2010. + +The JIRA issue is [here](https://issues.apache.org/jira/browse/MAHOUT-375) +. + +<a name="RestrictedBoltzmannMachines-BoltzmannMachines"></a> +### Boltzmann Machines +Boltzmann Machines are a type of stochastic neural networks that closely +resemble physical processes. They define a network of units with an overall +energy that is evolved over a period of time, until it reaches thermal +equilibrium. + +However, the convergence speed of Boltzmann machines that have +unconstrained connectivity is low. + +<a name="RestrictedBoltzmannMachines-RestrictedBoltzmannMachines"></a> +### Restricted Boltzmann Machines +Restricted Boltzmann Machines are a variant, that are 'restricted' in the +sense that connections between hidden units of a single layer are _not_ +allowed. In addition, stacking multiple RBM's is also feasible, with the +activities of the hidden units forming the base for a higher-level RBM. The +combination of these two features renders RBM's highly usable for +parallelization. + +In the Netflix Prize, RBM's offered distinctly orthogonal predictions to +SVD and k-NN approaches, and contributed immensely to the final solution. + +<a name="RestrictedBoltzmannMachines-RBM'sinApacheMahout"></a> +### RBM's in Apache Mahout +An implementation of Restricted Boltzmann Machines is being developed for +Apache Mahout as a Google Summer of Code 2010 project. A recommender +interface will also be provided. The key aims of the implementation are: +1. Accurate - should replicate known results, including those of the Netflix +Prize +1. Fast - The implementation uses Map-Reduce, hence, it should be fast +1. Scale - Should scale to large datasets, with a design whose critical +parts don't need a dependency between the amount of memory on your cluster +systems and the size of your dataset + +You can view the patch as it develops [here](http://github.com/sisirkoppaka/mahout-rbm/compare/trunk...rbm) +. http://git-wip-us.apache.org/repos/asf/mahout/blob/c81fc8b7/website/docs/algorithms/map-reduce/classification/support-vector-machines.md ---------------------------------------------------------------------- diff --git a/website/docs/algorithms/map-reduce/classification/support-vector-machines.md b/website/docs/algorithms/map-reduce/classification/support-vector-machines.md new file mode 100644 index 0000000..6d1b9df --- /dev/null +++ b/website/docs/algorithms/map-reduce/classification/support-vector-machines.md @@ -0,0 +1,43 @@ +--- +layout: default +title: Support Vector Machines +theme: + name: retro-mahout +--- + +<a name="SupportVectorMachines-SupportVectorMachines"></a> +# Support Vector Machines + +As with Naive Bayes, Support Vector Machines (or SVMs in short) can be used +to solve the task of assigning objects to classes. However, the way this +task is solved is completely different to the setting in Naive Bayes. + +Each object is considered to be a point in _n_ dimensional feature space, +_n_ being the number of features used to describe the objects numerically. +In addition each object is assigned a binary label, let us assume the +labels are "positive" and "negative". During learning, the algorithm tries +to find a hyperplane in that space, that perfectly separates positive from +negative objects. +It is trivial to think of settings where this might very well be +impossible. To remedy this situation, objects can be assigned so called +slack terms, that punish mistakes made during learning appropriately. That +way, the algorithm is forced to find the hyperplane that causes the least +number of mistakes. + +Another way to overcome the problem of there being no linear hyperplane to +separate positive from negative objects is to simply project each feature +vector into an higher dimensional feature space and search for a linear +separating hyperplane in that new space. Usually the main problem with +learning in high dimensional feature spaces is the so called curse of +dimensionality. That is, there are fewer learning examples available than +free parameters to tune. In the case of SVMs this problem is less +detrimental, as SVMs impose additional structural constraints on their +solutions. Each separating hyperplane needs to have a maximal margin to all +training examples. In addition, that way, the solution may be based on the +information encoded in only very few examples. + +<a name="SupportVectorMachines-Strategyforparallelization"></a> +## Strategy for parallelization + +<a name="SupportVectorMachines-Designofpackages"></a> +## Design of packages http://git-wip-us.apache.org/repos/asf/mahout/blob/c81fc8b7/website/docs/algorithms/map-reduce/index.md ---------------------------------------------------------------------- diff --git a/website/docs/algorithms/map-reduce/index.md b/website/docs/algorithms/map-reduce/index.md new file mode 100644 index 0000000..0e55a79 --- /dev/null +++ b/website/docs/algorithms/map-reduce/index.md @@ -0,0 +1,42 @@ +--- +layout: page +title: Deprecated Map Reduce Algorithms +theme: + name: mahout2 +--- + +### Classification + +[Bayesian](classification/bayesian.html) + +[Class Discovery](classification/class-discovery.html) + +[Classifying Your Data](classification/classifyingyourdata.html) + +[Collocations](classification/collocations.html) + +[Gaussian Discriminative Analysis](classification/gaussian-discriminative-analysis.html) + +[Hidden Markov Models](classification/hidden-markov-models.html) + +[Independent Component Analysis](classification/independent-component-analysis.html) + +[Locally Weighted Linear Regression](classification/locally-weighted-linear-regression.html) + +[Logistic Regression](classification/logistic-regression.html) + +[Mahout Collections](classification/mahout-collections.html) + +[Multilayer Perceptron](classification/mlp.html) + +[Naive Bayes](classification/naivebayes.html) + +[Neural Network](classification/neural-network.html) + +[Partial Implementation](classification/partial-implementation.html) + +[Random Forrests](classification/random-forrests.html) + +[Restricted Boltzman Machines](classification/restricted-boltzman-machines.html) + +[Support Vector Machines](classification/support-vector-machines.html) http://git-wip-us.apache.org/repos/asf/mahout/blob/c81fc8b7/website/docs/distributed/spark-bindings.md ---------------------------------------------------------------------- diff --git a/website/docs/distributed/spark-bindings.md b/website/docs/distributed/spark-bindings.md deleted file mode 100644 index 97f00c9..0000000 --- a/website/docs/distributed/spark-bindings.md +++ /dev/null @@ -1,101 +0,0 @@ ---- -layout: page -title: Mahout Samsara Spark Bindings -theme: - name: mahout2 ---- - -# Scala & Spark Bindings: -*Bringing algebraic semantics* - -## What is Scala & Spark Bindings? - -In short, Scala & Spark Bindings for Mahout is Scala DSL and algebraic optimizer of something like this (actual formula from **(d)spca**) - - -`\[\mathbf{G}=\mathbf{B}\mathbf{B}^{\top}-\mathbf{C}-\mathbf{C}^{\top}+\mathbf{s}_{q}\mathbf{s}_{q}^{\top}\boldsymbol{\xi}^{\top}\boldsymbol{\xi}\]` - -bound to in-core and distributed computations (currently, on Apache Spark). - - -Mahout Scala & Spark Bindings expression of the above: - - val g = bt.t %*% bt - c - c.t + (s_q cross s_q) * (xi dot xi) - -The main idea is that a scientist writing algebraic expressions cannot care less of distributed -operation plans and works **entirely on the logical level** just like he or she would do with R. - -Another idea is decoupling logical expression from distributed back-end. As more back-ends are added, -this implies **"write once, run everywhere"**. - -The linear algebra side works with scalars, in-core vectors and matrices, and Mahout Distributed -Row Matrices (DRMs). - -The ecosystem of operators is built in the R's image, i.e. it follows R naming such as %*%, -colSums, nrow, length operating over vectors or matices. - -Important part of Spark Bindings is expression optimizer. It looks at expression as a whole -and figures out how it can be simplified, and which physical operators should be picked. For example, -there are currently about 5 different physical operators performing DRM-DRM multiplication -picked based on matrix geometry, distributed dataset partitioning, orientation etc. -If we count in DRM by in-core combinations, that would be another 4, i.e. 9 total -- all of it for just -simple x %*% y logical notation. - - - -Please refer to the documentation for details. - -## Status - -This environment addresses mostly R-like Linear Algebra optmizations for -Spark, Flink and H20. - - -## Documentation - -* Scala and Spark bindings manual: [web](http://apache.github.io/mahout/doc/ScalaSparkBindings.html), [pdf](ScalaSparkBindings.pdf) -* Overview blog on 0.10.x releases: [blog](http://www.weatheringthroughtechdays.com/2015/04/mahout-010x-first-mahout-release-as.html) - -## Distributed methods and solvers using Bindings - -* In-core ([ssvd]) and Distributed ([dssvd]) Stochastic SVD -- guinea pigs -- see the bindings manual -* In-core ([spca]) and Distributed ([dspca]) Stochastic PCA -- guinea pigs -- see the bindings manual -* Distributed thin QR decomposition ([dqrThin]) -- guinea pig -- see the bindings manual -* [Current list of algorithms](https://mahout.apache.org/users/basics/algorithms.html) - -[ssvd]: https://github.com/apache/mahout/blob/trunk/math-scala/src/main/scala/org/apache/mahout/math/scalabindings/SSVD.scala -[spca]: https://github.com/apache/mahout/blob/trunk/math-scala/src/main/scala/org/apache/mahout/math/scalabindings/SSVD.scala -[dssvd]: https://github.com/apache/mahout/blob/trunk/spark/src/main/scala/org/apache/mahout/sparkbindings/decompositions/DSSVD.scala -[dspca]: https://github.com/apache/mahout/blob/trunk/spark/src/main/scala/org/apache/mahout/sparkbindings/decompositions/DSPCA.scala -[dqrThin]: https://github.com/apache/mahout/blob/trunk/spark/src/main/scala/org/apache/mahout/sparkbindings/decompositions/DQR.scala - - -## Related history of note - -* CLI and Driver for Spark version of item similarity -- [MAHOUT-1541](https://issues.apache.org/jira/browse/MAHOUT-1541) -* Command line interface for generalizable Spark pipelines -- [MAHOUT-1569](https://issues.apache.org/jira/browse/MAHOUT-1569) -* Cooccurrence Analysis / Item-based Recommendation -- [MAHOUT-1464](https://issues.apache.org/jira/browse/MAHOUT-1464) -* Spark Bindings -- [MAHOUT-1346](https://issues.apache.org/jira/browse/MAHOUT-1346) -* Scala Bindings -- [MAHOUT-1297](https://issues.apache.org/jira/browse/MAHOUT-1297) -* Interactive Scala & Spark Bindings Shell & Script processor -- [MAHOUT-1489](https://issues.apache.org/jira/browse/MAHOUT-1489) -* OLS tutorial using Mahout shell -- [MAHOUT-1542](https://issues.apache.org/jira/browse/MAHOUT-1542) -* Full abstraction of DRM apis and algorithms from a distributed engine -- [MAHOUT-1529](https://issues.apache.org/jira/browse/MAHOUT-1529) -* Port Naive Bayes -- [MAHOUT-1493](https://issues.apache.org/jira/browse/MAHOUT-1493) - -## Work in progress -* Text-delimited files for input and output -- [MAHOUT-1568](https://issues.apache.org/jira/browse/MAHOUT-1568) -<!-- * Weighted (Implicit Feedback) ALS -- [MAHOUT-1365](https://issues.apache.org/jira/browse/MAHOUT-1365) --> -<!--* Data frame R-like bindings -- [MAHOUT-1490](https://issues.apache.org/jira/browse/MAHOUT-1490) --> - -* *Your issue here!* - -<!-- ## Stuff wanted: -* Data frame R-like bindings (similarly to linalg bindings) -* Stat R-like bindings (perhaps we can just adapt to commons.math stat) -* **BYODMs:** Bring Your Own Distributed Method on SparkBindings! -* In-core jBlas matrix adapter -* In-core GPU matrix adapters --> - - - - \ No newline at end of file http://git-wip-us.apache.org/repos/asf/mahout/blob/c81fc8b7/website/docs/distributed/spark-bindings/MahoutScalaAndSparkBindings.pptx ---------------------------------------------------------------------- diff --git a/website/docs/distributed/spark-bindings/MahoutScalaAndSparkBindings.pptx b/website/docs/distributed/spark-bindings/MahoutScalaAndSparkBindings.pptx new file mode 100644 index 0000000..ec1de04 Binary files /dev/null and b/website/docs/distributed/spark-bindings/MahoutScalaAndSparkBindings.pptx differ
