http://git-wip-us.apache.org/repos/asf/mahout/blob/c81fc8b7/website/old_site_migration/completed/classification/collocations.md ---------------------------------------------------------------------- diff --git a/website/old_site_migration/completed/classification/collocations.md b/website/old_site_migration/completed/classification/collocations.md new file mode 100644 index 0000000..f107850 --- /dev/null +++ b/website/old_site_migration/completed/classification/collocations.md @@ -0,0 +1,385 @@ +--- +layout: default +title: Collocations +theme: + name: retro-mahout +--- + + + +<a name="Collocations-CollocationsinMahout"></a> +# Collocations in Mahout + +A collocation is defined as a sequence of words or terms which co-occur +more often than would be expected by chance. Statistically relevant +combinations of terms identify additional lexical units which can be +treated as features in a vector-based representation of a text. A detailed +discussion of collocations can be found on [Wikipedia](http://en.wikipedia.org/wiki/Collocation). + +See there for a more detailed discussion of collocations in the [Reuters example](http://comments.gmane.org/gmane.comp.apache.mahout.user/5685). + +<a name="Collocations-Log-LikelihoodbasedCollocationIdentification"></a> +## Theory behind implementation: Log-Likelihood based Collocation Identification + +Mahout provides an implementation of a collocation identification algorithm +which scores collocations using log-likelihood ratio. The log-likelihood +score indicates the relative usefulness of a collocation with regards other +term combinations in the text. Collocations with the highest scores in a +particular corpus will generally be more useful as features. + +Calculating the LLR is very straightforward and is described concisely in +[Ted Dunning's blog post](http://tdunning.blogspot.com/2008/03/surprise-and-coincidence.html) +. Ted describes the series of counts reqired to calculate the LLR for two +events A and B in order to determine if they co-occur more often than pure +chance. These counts include the number of times the events co-occur (k11), +the number of times the events occur without each other (k12 and k21), and +the number of times anything occurs. These counts are summarized in the +following table: + +<table> +<tr><td> </td><td> Event A </td><td> Everything but Event A </td></tr> +<tr><td> Event B </td><td> A and B together (k11) </td><td> B but not A (k12) </td></tr> +<tr><td> Everything but Event B </td><td> A but not B (k21) </td><td> Neither B nor A (k22) </td></tr> +</table> + +For the purposes of collocation identification, it is useful to begin by +thinking in word pairs, bigrams. In this case the leading or head term from +the pair corresponds to A from the table above, B corresponds to the +trailing or tail term, while neither B nor A is the total number of word +pairs in the corpus less those containing B, A or both B and A. + +Given the word pair of 'oscillation overthruster', the Log-Likelihood ratio +is computed by looking at the number of occurences of that word pair in the +corpus, the number of word pairs that begin with 'oscillation' but end with +something other than 'overthruster', the number of word pairs that end with +'overthruster' begin with something other than 'oscillation' and the number +of word pairs in the corpus that contain neither 'oscillation' and +overthruster. + +This can be extended from bigrams to trigrams, 4-grams and beyond. In these +cases, the current algorithm uses the first token of the ngram as the head +of the ngram and the remaining n-1 tokens from the ngram, the n-1gram as it +were, as the tail. Given the trigram 'hong kong cavaliers', 'hong' is +treated as the head while 'kong cavaliers' is treated as the tail. Future +versions of this algorithm will allow for variations in which tokens of the +ngram are treated as the head and tail. + +Beyond ngrams, it is often useful to inspect cases where individual words +occur around other interesting features of the text such as sentence +boundaries. + +<a name="Collocations-GeneratingNGrams"></a> +## Generating NGrams + +The tools that the collocation identification algorithm are embeeded within +either consume tokenized text as input or provide the ability to specify an +implementation of the Lucene Analyzer class perform tokenization in order +to form ngrams. The tokens are passed through a Lucene ShingleFilter to +produce NGrams of the desired length. + +Given the text "Alice was beginning to get very tired" as an example, +Lucene's StandardAnalyzer produces the tokens 'alice', 'beginning', 'get', +'very' and 'tired', while the ShingleFilter with a max NGram size set to 3 +produces the shingles 'alice beginning', 'alice beginning get', 'beginning +get', 'beginning get very', 'get very', 'get very tired' and 'very tired'. +Note that both bigrams and trigrams are produced here. A future enhancement +to the existing algorithm would involve limiting the output to a particular +gram size as opposed to solely specifiying a max ngram size. + +<a name="Collocations-RunningtheCollocationIdentificationAlgorithm."></a> +## Running the Collocation Identification Algorithm. + +There are a couple ways to run the llr-based collocation algorithm in +mahout + +<a name="Collocations-Whencreatingvectorsfromasequencefile"></a> +### When creating vectors from a sequence file + +The llr collocation identifier is integrated into the process that is used +to create vectors from sequence files of text keys and values. Collocations +are generated when the --maxNGramSize (-ng) option is not specified and +defaults to 2 or is set to a number of 2 or greater. The --minLLR option +can be used to control the cutoff that prevents collocations below the +specified LLR score from being emitted, and the --minSupport argument can +be used to filter out collocations that appear below a certain number of +times. + + + bin/mahout seq2sparse + + Usage: + [--minSupport <minSupport> --analyzerName <analyzerName> --chunkSize <chunkSize> + --output <output> --input <input> --minDF <minDF> + --maxDFPercent<maxDFPercent> --weight <weight> --norm <norm> --minLLR <minLLR> + --numReducers <numReducers> --maxNGramSize <ngramSize> --overwrite --help + --sequentialAccessVector] + Options + + --minSupport (-s) minSupport (Optional) Minimum Support. Default Value: 2 + + --analyzerName (-a) analyzerName The class name of the analyzer + + --chunkSize (-chunk) chunkSize The chunkSize in MegaBytes. 100-10000MB + + --output (-o) output The output directory + + --input (-i) input Input dir containing the documents in sequence file format + + --minDF (-md) minDF The minimum document frequency. Default is 1 + + --maxDFPercent (-x) maxDFPercent The max percentage of docs for the DF. Can be used to remove + really high frequency terms. Expressed as an + integer between 0 and 100. Default is 99. + + --weight (-wt) weight The kind of weight to use. Currently TF + or TFIDF + + --norm (-n) norm The norm to use, expressed as either a + float or "INF" if you want to use the + Infinite norm. Must be greater orequal + to 0. The default is not to normalize + + --minLLR (-ml) minLLR (Optional)The minimum Log Likelihood + Ratio(Float) Default is 1.0 + + --numReducers (-nr) numReducers (Optional) Number of reduce tasks. + Default Value: 1 + + --maxNGramSize (-ng) ngramSize (Optional) The maximum size of ngrams to + create (2 = bigrams, 3 = trigrams, etc) + Default Value:2 + + --overwrite (-w) If set, overwrite the output directory + --help (-h) Print out help + --sequentialAccessVector (-seq) (Optional) Whether output vectors should + be SequentialAccessVectors If set true + else false + + +<a name="Collocations-CollocDriver"></a> +### CollocDriver + + + bin/mahout org.apache.mahout.vectorizer.collocations.llr.CollocDriver + + Usage: + [--input <input> --output <output> --maxNGramSize <ngramSize> --overwrite + --minSupport <minSupport> --minLLR <minLLR> --numReducers <numReducers> + --analyzerName <analyzerName> --preprocess --unigram --help] + + Options + + --input (-i) input The Path for input files. + + --output (-o) output The Path write output to + + --maxNGramSize (-ng) ngramSize (Optional) The maximum size of ngramsto + create (2 = bigrams, 3 = trigrams,etc) + Default Value:2 + + --overwrite (-w) If set, overwrite the outputdirectory + + --minSupport (-s) minSupport (Optional) Minimum Support. Default + Value: 2 + + --minLLR (-ml) minLLR (Optional)The minimum Log Likelihood + Ratio(Float) Default is 1.0 + + --numReducers (-nr) numReducers (Optional) Number of reduce tasks. + Default Value: 1 + + --analyzerName (-a) analyzerName The class name of the analyzer + + --preprocess (-p) If set, input is SequenceFile<Text,Text> + where the value is the document, which + will be tokenized using the specified + analyzer. + + --unigram (-u) If set, unigrams will be emitted inthe + final output alongside collocations + + --help (-h) Print out help + + +<a name="Collocations-Algorithmdetails"></a> +## Algorithm details + +This section describes the implementation of the collocation identification +algorithm in terms of the map-reduce phases that are used to generate +ngrams and count the frequencies required to perform the log-likelihood +calculation. Unless otherwise noted, classes that are indicated in +CamelCase can be found in the mahout-utils module under the package +org.apache.mahout.utils.nlp.collocations.llr + +The algorithm is implemented in two map-reduce passes: + +<a name="Collocations-Pass1:CollocDriver.generateCollocations(...)"></a> +### Pass 1: CollocDriver.generateCollocations(...) + +Generates NGrams and counts frequencies for ngrams, head and tail subgrams. + +<a name="Collocations-Map:CollocMapper"></a> +#### Map: CollocMapper + +Input k: Text (documentId), v: StringTuple (tokens) + +Each call to the mapper passes in the full set of tokens for the +corresponding document using a StringTuple. The ShingleFilter is run across +these tokens to produce ngrams of the desired length. ngrams and +frequencies are collected across the entire document. + +Once this is done, ngrams are split into head and tail portions. A key of type GramKey is generated which is used later to join ngrams with their heads and tails in the reducer phase. The GramKey is a composite key made up of a string n-gram fragement as the primary key and a secondary key used for grouping and sorting in the reduce phase. The secondary key will either be EMPTY in the case where we are collecting either the head or tail of an ngram as the value or it will contain the byte[](.html) + form of the ngram when collecting an ngram as the value. + + + head_key(EMPTY) -> (head subgram, head frequency) + + head_key(ngram) -> (ngram, ngram frequency) + + tail_key(EMPTY) -> (tail subgram, tail frequency) + + tail_key(ngram) -> (ngram, ngram frequency) + + +subgram and ngram values are packaged in Gram objects. + +For each ngram found, the Count.NGRAM_TOTAL counter is incremented. When +the pass is complete, this counter will hold the total number of ngrams +encountered in the input which is used as a part of the LLR calculation. + +Output k: GramKey (head or tail subgram), v: Gram (head, tail or ngram with +frequency) + +<a name="Collocations-Combiner:CollocCombiner"></a> +#### Combiner: CollocCombiner + +Input k: GramKey, v:Gram (as above) + +This phase merges the counts for unique ngrams or ngram fragments across +multiple documents. The combiner treats the entire GramKey as the key and +as such, identical tuples from separate documents are passed into a single +call to the combiner's reduce method, their frequencies are summed and a +single tuple is passed out via the collector. + +Output k: GramKey, v:Gram + +<a name="Collocations-Reduce:CollocReducer"></a> +#### Reduce: CollocReducer + +Input k: GramKey, v: Gram (as above) + +The CollocReducer employs the Hadoop secondary sort strategy to avoid +caching ngram tuples in memory in order to calculate total ngram and +subgram frequencies. The GramKeyPartitioner ensures that tuples with the +same primary key are sent to the same reducer while the +GramKeyGroupComparator ensures that iterator provided by the reduce method +first returns the subgram and then returns ngram values grouped by ngram. +This eliminates the need to cache the values returned by the iterator in +order to calculate total frequencies for both subgrams and ngrams. There +input will consist of multiple frequencies for each (subgram_key, subgram) +or (subgram_key, ngram) tuple; one from each map task executed in which the +particular subgram was found. +The input will be traversed in the following order: + + + (head subgram, frequency 1) + (head subgram, frequency 2) + ... + (head subgram, frequency N) + (ngram 1, frequency 1) + (ngram 1, frequency 2) + ... + (ngram 1, frequency N) + (ngram 2, frequency 1) + (ngram 2, frequency 2) + ... + (ngram 2, frequency N) + ... + (ngram N, frequency 1) + (ngram N, frequency 2) + ... + (ngram N, frequency N) + + +Where all of the ngrams above share the same head. Data is presented in the +same manner for the tail subgrams. + +As the values for a subgram or ngram are traversed, frequencies are +accumulated. Once all values for a subgram or ngram are processed the +resulting key/value pairs are passed to the collector as long as the ngram +frequency is equal to or greater than the specified minSupport. When an +ngram is skipped in this way the Skipped.LESS_THAN_MIN_SUPPORT counter to +be incremented. + +Pairs are passed to the collector in the following format: + + + ngram, ngram frequency -> subgram subgram frequency + + +In this manner, the output becomes an unsorted version of the following: + + + ngram 1, frequency -> ngram 1 head, head frequency + ngram 1, frequency -> ngram 1 tail, tail frequency + ngram 2, frequency -> ngram 2 head, head frequency + ngram 2, frequency -> ngram 2 tail, tail frequency + ngram N, frequency -> ngram N head, head frequency + ngram N, frequency -> ngram N tail, tail frequency + + +Output is in the format k:Gram (ngram, frequency), v:Gram (subgram, +frequency) + +<a name="Collocations-Pass2:CollocDriver.computeNGramsPruneByLLR(...)"></a> +### Pass 2: CollocDriver.computeNGramsPruneByLLR(...) + +Pass 1 has calculated full frequencies for ngrams and subgrams, Pass 2 +performs the LLR calculation. + +<a name="Collocations-MapPhase:IdentityMapper(org.apache.hadoop.mapred.lib.IdentityMapper)"></a> +#### Map Phase: IdentityMapper (org.apache.hadoop.mapred.lib.IdentityMapper) + +This phase is a no-op. The data is passed through unchanged. The rest of +the work for llr calculation is done in the reduce phase. + +<a name="Collocations-ReducePhase:LLRReducer"></a> +#### Reduce Phase: LLRReducer + +Input is k:Gram, v:Gram (as above) + +This phase receives the head and tail subgrams and their frequencies for +each ngram (with frequency) produced for the input: + + + ngram 1, frequency -> ngram 1 head, frequency; ngram 1 tail, frequency + ngram 2, frequency -> ngram 2 head, frequency; ngram 2 tail, frequency + ... + ngram 1, frequency -> ngram N head, frequency; ngram N tail, frequency + + +It also reads the full ngram count obtained from the first pass, passed in +as a configuration option. The parameters to the llr calculation are +calculated as follows: + +k11 = f_n +k12 = f_h - f_n +k21 = f_t - f_n +k22 = N - ((f_h + f_t) - f_n) + +Where f_n is the ngram frequency, f_h and f_t the frequency of head and +tail and N is the total number of ngrams. + +Tokens with a llr below that of the specified minimum llr are dropped and +the Skipped.LESS_THAN_MIN_LLR counter is incremented. + +Output is k: Text (ngram), v: DoubleWritable (llr score) + +<a name="Collocations-Unigrampass-through."></a> +### Unigram pass-through. + +By default in seq2sparse, or if the -u option is provided to the +CollocDriver, unigrams (single tokens) will be passed through the job and +each token's frequency will be calculated. As with ngrams, unigrams are +subject to filtering with minSupport and minLLR. +
http://git-wip-us.apache.org/repos/asf/mahout/blob/c81fc8b7/website/old_site_migration/completed/classification/gaussian-discriminative-analysis.md ---------------------------------------------------------------------- diff --git a/website/old_site_migration/completed/classification/gaussian-discriminative-analysis.md b/website/old_site_migration/completed/classification/gaussian-discriminative-analysis.md new file mode 100644 index 0000000..e8a54af --- /dev/null +++ b/website/old_site_migration/completed/classification/gaussian-discriminative-analysis.md @@ -0,0 +1,20 @@ +--- +layout: default +title: Gaussian Discriminative Analysis +theme: + name: retro-mahout +--- + +<a name="GaussianDiscriminativeAnalysis-GaussianDiscriminativeAnalysis"></a> +# Gaussian Discriminative Analysis + +Gaussian Discriminative Analysis is a tool for multigroup classification +based on extending linear discriminant analysis. The paper on the approach +is located at http://citeseer.ist.psu.edu/4617.html (note, for some reason +the paper is backwards, in that page 1 is at the end) + +<a name="GaussianDiscriminativeAnalysis-Parallelizationstrategy"></a> +## Parallelization strategy + +<a name="GaussianDiscriminativeAnalysis-Designofpackages"></a> +## Design of packages http://git-wip-us.apache.org/repos/asf/mahout/blob/c81fc8b7/website/old_site_migration/completed/classification/hidden-markov-models.md ---------------------------------------------------------------------- diff --git a/website/old_site_migration/completed/classification/hidden-markov-models.md b/website/old_site_migration/completed/classification/hidden-markov-models.md new file mode 100644 index 0000000..7321493 --- /dev/null +++ b/website/old_site_migration/completed/classification/hidden-markov-models.md @@ -0,0 +1,102 @@ +--- +layout: default +title: Hidden Markov Models +theme: + name: retro-mahout +--- + +# Hidden Markov Models + +<a name="HiddenMarkovModels-IntroductionandUsage"></a> +## Introduction and Usage + +Hidden Markov Models are used in multiple areas of Machine Learning, such +as speech recognition, handwritten letter recognition or natural language +processing. + +<a name="HiddenMarkovModels-FormalDefinition"></a> +## Formal Definition + +A Hidden Markov Model (HMM) is a statistical model of a process consisting +of two (in our case discrete) random variables O and Y, which change their +state sequentially. The variable Y with states \{y_1, ... , y_n\} is called +the "hidden variable", since its state is not directly observable. The +state of Y changes sequentially with a so called - in our case first-order +- Markov Property. This means, that the state change probability of Y only +depends on its current state and does not change in time. Formally we +write: P(Y(t+1)=y_i|Y(0)...Y(t)) = P(Y(t+1)=y_i|Y(t)) = P(Y(2)=y_i|Y(1)). +The variable O with states \{o_1, ... , o_m\} is called the "observable +variable", since its state can be directly observed. O does not have a +Markov Property, but its state probability depends statically on the +current state of Y. + +Formally, an HMM is defined as a tuple M=(n,m,P,A,B), where n is the number of hidden states, m is the number of observable states, P is an n-dimensional vector containing initial hidden state probabilities, A is the nxn-dimensional "transition matrix" containing the transition probabilities such that A\[i,j\](i,j\.html) +=P(Y(t)=y_i|Y(t-1)=y_j) and B is the mxn-dimensional "emission matrix" +containing the observation probabilities such that B\[i,j\]= +P(O=o_i|Y=y_j). + +<a name="HiddenMarkovModels-Problems"></a> +## Problems + +Rabiner \[1\](1\.html) + defined three main problems for HMM models: + +1. Evaluation: Given a sequence O of observations and a model M, what is +the probability P(O|M) that sequence O was generated by model M. The +Evaluation problem can be efficiently solved using the Forward algorithm +2. Decoding: Given a sequence O of observations and a model M, what is +the most likely sequence Y*=argmax(Y) P(O|M,Y) of hidden variables to +generate this sequence. The Decoding problem can be efficiently solved +using the Viterbi algorithm. +3. Learning: Given a sequence O of observations, what is the most likely +model M*=argmax(M)P(O|M) to generate this sequence. The Learning problem +can be efficiently solved using the Baum-Welch algorithm. + +<a name="HiddenMarkovModels-Example"></a> +## Example + +To build a Hidden Markov Model and use it to build some predictions, try a simple example like this: + +Create an input file to train the model. Here we have a sequence drawn from the set of states 0, 1, 2, and 3, separated by space characters. + + $ echo "0 1 2 2 2 1 1 0 0 3 3 3 2 1 2 1 1 1 1 2 2 2 0 0 0 0 0 0 2 2 2 0 0 0 0 0 0 2 2 2 3 3 3 3 3 3 2 3 2 3 2 3 2 1 3 0 0 0 1 0 1 0 2 1 2 1 2 1 2 3 3 3 3 2 2 3 2 1 1 0" > hmm-input + +Now run the baumwelch job to train your model, after first setting MAHOUT_LOCAL to true, to use your local file system. + + $ export MAHOUT_LOCAL=true + $ $MAHOUT_HOME/bin/mahout baumwelch -i hmm-input -o hmm-model -nh 3 -no 4 -e .0001 -m 1000 + +Output like the following should appear in the console. + + Initial probabilities: + 0 1 2 + 1.0 0.0 3.5659361683006626E-251 + Transition matrix: + 0 1 2 + 0 6.098919959130616E-5 0.9997275322964165 2.1147850399214744E-4 + 1 7.404648706054873E-37 0.9086408633885092 0.09135913661149081 + 2 0.2284374545687356 7.01786289571088E-11 0.7715625453610858 + Emission matrix: + 0 1 2 3 + 0 0.9999997858591223 2.0536163836449762E-39 2.1414087769942127E-7 1.052441093535389E-27 + 1 7.495656581383351E-34 0.2241269055449904 0.4510889999455847 0.32478409450942497 + 2 0.815051477991782 0.18494852200821799 8.465660634827592E-33 2.8603899591778015E-36 + 14/03/22 09:52:21 INFO driver.MahoutDriver: Program took 180 ms (Minutes: 0.003) + +The model trained with the input set now is in the file 'hmm-model', which we can use to build a predicted sequence. + + $ $MAHOUT_HOME/bin/mahout hmmpredict -m hmm-model -o hmm-predictions -l 10 + +To see the predictions: + + $ cat hmm-predictions + 0 1 3 3 2 2 2 2 1 2 + + +<a name="HiddenMarkovModels-Resources"></a> +## Resources + +\[1\] + Lawrence R. Rabiner (February 1989). "A tutorial on Hidden Markov Models +and selected applications in speech recognition". Proceedings of the IEEE +77 (2): 257-286. doi:10.1109/5.18626. \ No newline at end of file http://git-wip-us.apache.org/repos/asf/mahout/blob/c81fc8b7/website/old_site_migration/completed/classification/independent-component-analysis.md ---------------------------------------------------------------------- diff --git a/website/old_site_migration/completed/classification/independent-component-analysis.md b/website/old_site_migration/completed/classification/independent-component-analysis.md new file mode 100644 index 0000000..6035b54 --- /dev/null +++ b/website/old_site_migration/completed/classification/independent-component-analysis.md @@ -0,0 +1,17 @@ +--- +layout: default +title: Independent Component Analysis +theme: + name: retro-mahout +--- + +<a name="IndependentComponentAnalysis-IndependentComponentAnalysis"></a> +# Independent Component Analysis + +See also: Principal Component Analysis. + +<a name="IndependentComponentAnalysis-Parallelizationstrategy"></a> +## Parallelization strategy + +<a name="IndependentComponentAnalysis-Designofpackages"></a> +## Design of packages http://git-wip-us.apache.org/repos/asf/mahout/blob/c81fc8b7/website/old_site_migration/completed/classification/locally-weighted-linear-regression.md ---------------------------------------------------------------------- diff --git a/website/old_site_migration/completed/classification/locally-weighted-linear-regression.md b/website/old_site_migration/completed/classification/locally-weighted-linear-regression.md new file mode 100644 index 0000000..7b23d85 --- /dev/null +++ b/website/old_site_migration/completed/classification/locally-weighted-linear-regression.md @@ -0,0 +1,25 @@ +--- +layout: default +title: Locally Weighted Linear Regression +theme: + name: retro-mahout +--- + +<a name="LocallyWeightedLinearRegression-LocallyWeightedLinearRegression"></a> +# Locally Weighted Linear Regression + +Model-based methods, such as SVM, Naive Bayes and the mixture of Gaussians, +use the data to build a parameterized model. After training, the model is +used for predictions and the data are generally discarded. In contrast, +"memory-based" methods are non-parametric approaches that explicitly retain +the training data, and use it each time a prediction needs to be made. +Locally weighted regression (LWR) is a memory-based method that performs a +regression around a point of interest using only training data that are +"local" to that point. Source: +http://www.cs.cmu.edu/afs/cs/project/jair/pub/volume4/cohn96a-html/node7.html + +<a name="LocallyWeightedLinearRegression-Strategyforparallelregression"></a> +## Strategy for parallel regression + +<a name="LocallyWeightedLinearRegression-Designofpackages"></a> +## Design of packages http://git-wip-us.apache.org/repos/asf/mahout/blob/c81fc8b7/website/old_site_migration/completed/classification/logistic-regression.md ---------------------------------------------------------------------- diff --git a/website/old_site_migration/completed/classification/logistic-regression.md b/website/old_site_migration/completed/classification/logistic-regression.md new file mode 100644 index 0000000..b066fda --- /dev/null +++ b/website/old_site_migration/completed/classification/logistic-regression.md @@ -0,0 +1,129 @@ +--- +layout: default +title: Logistic Regression +theme: + name: retro-mahout +--- + +<a name="LogisticRegression-LogisticRegression(SGD)"></a> +# Logistic Regression (SGD) + +Logistic regression is a model used for prediction of the probability of +occurrence of an event. It makes use of several predictor variables that +may be either numerical or categories. + +Logistic regression is the standard industry workhorse that underlies many +production fraud detection and advertising quality and targeting products. +The Mahout implementation uses Stochastic Gradient Descent (SGD) to all +large training sets to be used. + +For a more detailed analysis of the approach, have a look at the [thesis of +Paul Komarek](http://www.autonlab.org/autonweb/14709/version/4/part/5/data/komarek:lr_thesis.pdf?branch=main&language=en) [1]. + +See MAHOUT-228 for the main JIRA issue for SGD. + +A more detailed overview of the Mahout Linear Regression classifier and [detailed discription of building a Logistic Regression classifier](http://blog.trifork.com/2014/02/04/an-introduction-to-mahouts-logistic-regression-sgd-classifier/) for the classic [Iris flower dataset](http://en.wikipedia.org/wiki/Iris_flower_data_set) is also available [2]. + +An example of training a Logistic Regression classifier for the [UCI Bank Marketing Dataset](http://mlr.cs.umass.edu/ml/datasets/Bank+Marketing) can be found [on the Mahout website](http://mahout.apache.org/users/classification/bankmarketing-example.html) [3]. + +An example of training and testing a Logistic Regression document classifier for the classic [20 newsgroups corpus](https://github.com/apache/mahout/blob/master/examples/bin/classify-20newsgroups.sh) [4] is also available. + +<a name="LogisticRegression-Parallelizationstrategy"></a> +## Parallelization strategy + +The bad news is that SGD is an inherently sequential algorithm. The good +news is that it is blazingly fast and thus it is not a problem for Mahout's +implementation to handle training sets of tens of millions of examples. +With the down-sampling typical in many data-sets, this is equivalent to a +dataset with billions of raw training examples. + +The SGD system in Mahout is an online learning algorithm which means that +you can learn models in an incremental fashion and that you can do +performance testing as your system runs. Often this means that you can +stop training when a model reaches a target level of performance. The SGD +framework includes classes to do on-line evaluation using cross validation +(the CrossFoldLearner) and an evolutionary system to do learning +hyper-parameter optimization on the fly (the AdaptiveLogisticRegression). +The AdaptiveLogisticRegression system makes heavy use of threads to +increase machine utilization. The way it works is that it runs 20 +CrossFoldLearners in separate threads, each with slightly different +learning parameters. As better settings are found, these new settings are +propagating to the other learners. + +<a name="LogisticRegression-Designofpackages"></a> +## Design of packages + +There are three packages that are used in Mahout's SGD system. These +include + +* The vector encoding package (found in org.apache.mahout.vectorizer.encoders) + +* The SGD learning package (found in org.apache.mahout.classifier.sgd) + +* The evolutionary optimization system (found in org.apache.mahout.ep) + +<a name="LogisticRegression-Featurevectorencoding"></a> +## Feature vector encoding + +Because the SGD algorithms need to have fixed length feature vectors and +because it is a pain to build a dictionary ahead of time, most SGD +applications use the hashed feature vector encoding system that is rooted +at FeatureVectorEncoder. + +The basic idea is that you create a vector, typically a +RandomAccessSparseVector, and then you use various feature encoders to +progressively add features to that vector. The size of the vector should +be large enough to avoid feature collisions as features are hashed. + +There are specialized encoders for a variety of data types. You can +normally encode either a string representation of the value you want to +encode or you can encode a byte level representation to avoid string +conversion. In the case of ContinuousValueEncoder and +ConstantValueEncoder, it is also possible to encode a null value and pass +the real value in as a weight. This avoids numerical parsing entirely in +case you are getting your training data from a system like Avro. + +Here is a class diagram for the encoders package: + + + +<a name="LogisticRegression-SGDLearning"></a> +## SGD Learning + +For the simplest applications, you can construct an +OnlineLogisticRegression and be off and running. Typically, though, it is +nice to have running estimates of performance on held out data. To do +that, you should use a CrossFoldLearner which keeps a stable of five (by +default) OnlineLogisticRegression objects. Each time you pass a training +example to a CrossFoldLearner, it passes this example to all but one of its +children as training and passes the example to the last child to evaluate +current performance. The children are used for evaluation in a round-robin +fashion so, if you are using the default 5 way split, all of the children +get 80% of the training data for training and get 20% of the data for +evaluation. + +To avoid the pesky need to configure learning rates, regularization +parameters and annealing schedules, you can use the +AdaptiveLogisticRegression. This class maintains a pool of +CrossFoldLearners and adapts learning rates and regularization on the fly +so that you don't have to. + +Here is a class diagram for the classifiers.sgd package. As you can see, +the number of twiddlable knobs is pretty large. For some examples, see the +[TrainNewsGroups](https://github.com/apache/mahout/blob/master/examples/src/main/java/org/apache/mahout/classifier/sgd/TrainNewsGroups.java) example code. + + + +## References + +[1] [Thesis of +Paul Komarek](http://www.autonlab.org/autonweb/14709/version/4/part/5/data/komarek:lr_thesis.pdf?branch=main&language=en) + +[2] [An Introduction To Mahout's Logistic Regression SGD Classifier](http://blog.trifork.com/2014/02/04/an-introduction-to-mahouts-logistic-regression-sgd-classifier/) + +## Examples + +[3] [SGD Bank Marketing Example](http://mahout.apache.org/users/classification/bankmarketing-example.html) + +[4] [SGD 20 newsgroups classification](https://github.com/apache/mahout/blob/master/examples/bin/classify-20newsgroups.sh) + http://git-wip-us.apache.org/repos/asf/mahout/blob/c81fc8b7/website/old_site_migration/completed/classification/mahout-collections.md ---------------------------------------------------------------------- diff --git a/website/old_site_migration/completed/classification/mahout-collections.md b/website/old_site_migration/completed/classification/mahout-collections.md new file mode 100644 index 0000000..99f22f6 --- /dev/null +++ b/website/old_site_migration/completed/classification/mahout-collections.md @@ -0,0 +1,60 @@ +--- +layout: default +title: mahout-collections +theme: + name: retro-mahout +--- + +# Mahout collections + +<a name="mahout-collections-Introduction"></a> +## Introduction + +The Mahout Collections library is a set of container classes that address +some limitations of the standard collections in Java. [This presentation](http://domino.research.ibm.com/comm/research_people.nsf/pages/sevitsky.pubs.html/$FILE/oopsla08%20memory-efficient%20java%20slides.pdf) + describes a number of performance problems with the standard collections. + +Mahout collections addresses two of the more glaring: the lack of support +for primitive types and the lack of open hashing. + +<a name="mahout-collections-PrimitiveTypes"></a> +## Primitive Types + +The most visible feature of Mahout Collections is the large collection of +primitive type collections. Given Java's asymmetrical support for the +primitive types, the only efficient way to handle them is with many +classes. So, there are ArrayList-like containers for all of the primitive +types, and hash maps for all the useful combinations of primitive type and +object keys and values. + +These classes do not, in general, implement interfaces from *java.util*. +Even when the *java.util* interfaces could be type-compatible, they tend +to include requirements that are not consistent with efficient use of +primitive types. + +<a name="mahout-collections-OpenAddressing"></a> +# Open Addressing + +All of the sets and maps in Mahout Collections are open-addressed hash +tables. Open addressing has a much smaller memory footprint than chaining. +Since the purpose of these collections is to avoid the memory cost of +autoboxing, open addressing is a consistent design choice. + +<a name="mahout-collections-Sets"></a> +## Sets + +Mahout Collections includes open hash sets. Unlike *java.util*, a set is +not a recycled hash table; the sets are separately implemented and do not +have any additional storage usage for unused keys. + +<a name="mahout-collections-CreditwhereCreditisdue"></a> +# Credit where Credit is due + +The implementation of Mahout Collections is derived from [Cern Colt](http://acs.lbl.gov/~hoschek/colt/) +. + + + + + + http://git-wip-us.apache.org/repos/asf/mahout/blob/c81fc8b7/website/old_site_migration/completed/classification/mlp.md ---------------------------------------------------------------------- diff --git a/website/old_site_migration/completed/classification/mlp.md b/website/old_site_migration/completed/classification/mlp.md new file mode 100644 index 0000000..a98f033 --- /dev/null +++ b/website/old_site_migration/completed/classification/mlp.md @@ -0,0 +1,172 @@ +--- +layout: default +title: Multilayer Perceptron +theme: + name: retro-mahout +--- + +Multilayer Perceptron +===================== + +A multilayer perceptron is a biologically inspired feed-forward network that can +be trained to represent a nonlinear mapping between input and output data. It +consists of multiple layers, each containing multiple artificial neuron units and +can be used for classification and regression tasks in a supervised learning approach. + +Command line usage +------------------ + +The MLP implementation is currently located in the MapReduce-Legacy package. It +can be used with the following commands: + + +# model training + $ bin/mahout org.apache.mahout.classifier.mlp.TrainMultilayerPerceptron +# model usage + $ bin/mahout org.apache.mahout.classifier.mlp.RunMultilayerPerceptron + + +To train and use the model, a number of parameters can be specified. Parameters without default values have to be specified by the user. Consider that not all parameters can be used both for training and running the model. We give an example of the usage below. + +### Parameters + +| Command | Default | Description | Type | +|:---------|---------:|:-------------|:---------| +| --input -i | | Path to the input data (currently, only .csv-files are allowed) | | +| --skipHeader -sh | false | Skip first row of the input file (corresponds to the csv headers)| | +|--update -u | false | Whether the model should be updated incrementally with every new training instance. If this parameter is not given, the model is trained from scratch. | training | +| --labels -labels | | Instance labels separated by whitespaces. | training | +| --model -mo | | Location where the model will be stored / is stored (if the specified location has an existing model, it will update the model through incremental learning). | | +| --layerSize -ls | | Number of units per layer, including input, hidden and ouput layers. This parameter specifies the topology of the network (see [this image][mlp] for an example specified by `-ls 4 8 3`). | training | +| --squashingFunction -sf| Sigmoid | The squashing function to use for the units. Currently only the sigmoid fucntion is available. | training | +| --learningRate -l | 0.5 | The learning rate that is used for weight updates. | training | +| --momemtumWeight -m | 0.1 | The momentum weight that is used for gradient descent. Must be in the range between 0 ... 1.0 | training | +| --regularizationWeight -r | 0 | Regularization value for the weight vector. Must be in the range between 0 ... 0.1 | training | +| --format -f | csv | Input file format. Currently only csv is supported. | | +|--columnRange -cr | | Range of the columns to use from the input file, starting with 0 (i.e. `-cr 0 5` for including the first six columns only) | testing | +| --output -o | | Path to store the labeled results from running the model. | testing | + +Example usage +------------- + +In this example, we will train a multilayer perceptron for classification on the iris data set. The iris flower data set contains data of three flower species where each datapoint consists of four features. +The dimensions of the data set are given through some flower parameters (sepal length, sepal width, ...). All samples contain a label that indicates the flower species they belong to. + +### Training + +To train our multilayer perceptron model from the command line, we call the following command + + + $ bin/mahout org.apache.mahout.classifier.mlp.TrainMultilayerPerceptron \ + -i ./mrlegacy/src/test/resources/iris.csv -sh \ + -labels setosa versicolor virginica \ + -mo /tmp/model.model -ls 4 8 3 -l 0.2 -m 0.35 -r 0.0001 + + +The individual parameters are explained in the following. + +- `-i ./mrlegacy/src/test/resources/iris.csv` use the iris data set as input data +- `-sh` since the file `iris.csv` contains a header row, this row needs to be skipped +- `-labels setosa versicolor virginica` we specify, which class labels should be learnt (which are the flower species in this case) +- `-mo /tmp/model.model` specify where to store the model file +- `-ls 4 8 3` we specify the structure and depth of our layers. The actual network structure can be seen in the figure below. +- `-l 0.2` we set the learning rate to `0.2` +- `-m 0.35` momemtum weight is set to `0.35` +- `-r 0.0001` regularization weight is set to `0.0001` + +| | | +|---|---| +| The picture shows the architecture defined by the above command. The topolgy of the network is completely defined through the number of layers and units because in this implementation of the MLP every unit is fully connected to the units of the next and previous layer. Bias units are added automatically. | ![Multilayer perceptron network][mlp] | + +[mlp]: mlperceptron_structure.png "Architecture of a three-layer MLP" +### Testing + +To test / run the multilayer perceptron classification on the trained model, we can use the following command + + + $ bin/mahout org.apache.mahout.classifier.mlp.RunMultilayerPerceptron \ + -i ./mrlegacy/src/test/resources/iris.csv -sh -cr 0 3 \ + -mo /tmp/model.model -o /tmp/labelResult.txt + + +The individual parameters are explained in the following. + +- `-i ./mrlegacy/src/test/resources/iris.csv` use the iris data set as input data +- `-sh` since the file `iris.csv` contains a header row, this row needs to be skipped +- `-cr 0 3` we specify the column range of the input file +- `-mo /tmp/model.model` specify where the model file is stored +- `-o /tmp/labelResult.txt` specify where the labeled output file will be stored + +Implementation +-------------- + +The Multilayer Perceptron implementation is based on a more general Neural Network class. Command line support was added later on and provides a simple usage of the MLP as shown in the example. It is implemented to run on a single machine using stochastic gradient descent where the weights are updated using one datapoint at a time, resulting in a weight update of the form: +$$ \vec{w}^{(t + 1)} = \vec{w}^{(t)} - n \Delta E_n(\vec{w}^{(t)}) $$ + +where *a* is the activation of the unit. It is not yet possible to change the learning to more advanced methods using adaptive learning rates yet. + +The number of layers and units per layer can be specified manually and determines the whole topology with each unit being fully connected to the previous layer. A bias unit is automatically added to the input of every layer. +Currently, the logistic sigmoid is used as a squashing function in every hidden and output layer. It is of the form: + +$$ \frac{1}{1 + exp(-a)} $$ + +The command line version **does not perform iterations** which leads to bad results on small datasets. Another restriction is, that the CLI version of the MLP only supports classification, since the labels have to be given explicitly when executing on the command line. + +A learned model can be stored and updated with new training instanced using the `--update` flag. Output of classification reults is saved as a .txt-file and only consists of the assigned labels. Apart from the command-line interface, it is possible to construct and compile more specialized neural networks using the API and interfaces in the mrlegacy package. + + +Theoretical Background +------------------------- + +The *multilayer perceptron* was inspired by the biological structure of the brain where multiple neurons are connected and form columns and layers. Perceptual input enters this network through our sensory organs and is then further processed into higher levels. +The term multilayer perceptron is a little misleading since the *perceptron* is a special case of a single *artificial neuron* that can be used for simple computations [\[1\]][1]. The difference is that the perceptron uses a discontinous nonlinearity while for the MLP neurons that are implemented in mahout it is important to use continous nonlinearities. This is necessary for the implemented learning algorithm, where the error is propagated back from the output layer to the input layer and the weights of the connections are changed according to their contribution to the overall error. This algorithm is called backpropagation and uses gradient descent to update the weights. To compute the gradients we need continous nonlinearities. But let's start from the beginning! + +The first layer of the MLP represents the input and has no other purpose than routing the input to every connected unit in a feed-forward fashion. Following layers are called hidden layers and the last layer serves the special purpose to determine the output. The activation of a unit *u* in a hidden layer is computed through a weighted sum of all inputs, resulting in +$$ a_j = \sum_{i=1}^{D} w_{ji}^{(l)} x_i + w_{j0}^{(l)} $$ +This computes the activation *a* for neuron *j* where *w* is the weight from neuron *i* to neuron *j* in layer *l*. The last part, where *i = 0* is called the bias and can be used as an offset, independent from the input. + +The activation is then transformed by the aforementioned differentiable, nonlinear *activation function* and serves as the input to the next layer. The activation function is usually chosen from the family of sigmoidal functions such as *tanh* or *logistic sigmoidal* [\[2\]][2]. Often sigmoidal and logistic sigmoidal are used synonymous. Another word for the activation function is *squashing function* since the s-shape of this function class *squashes* the input. + +For different units or layers, different activation functions can be used to obtain different behaviors. Especially in the output layer, the activation function can be chosen to obtain the output value *y*, depending on the learning problem: +$$ y_k = \sigma (a_k) $$ + +If the learning problem is a linear regression task, sigma can be chosen to be the identity function. In case of classification problems, the choice of the squashing functions depends on the exact task at hand and often softmax activation functions are used. + +The equation for a MLP with three layers (one input, one hidden and one output) is then given by + +$$ y_k(\vec{x}, \vec{w}) = h \left( \sum_{j=1}^{M} w_{kj}^{(2)} h \left( \sum_{i=1}^{D} w_{ji}^{(1)} x_i + w_{j0}^{(1)} \right) + w_{k0}^{(2)} \right) $$ + +where *h* indicates the respective squashing function that is used in the units of a layer. *M* and *D* specify the number of incoming connections to a unit and we can see that the input to the first layer (hidden layer) is just the original input *x* whereas the input into the second layer (output layer) is the transformed output of layer one. The output *y* of unit *k* is therefore given by the above equation and depends on the input *x* and the weight vector *w*. This shows us, that the parameter that we can optimize during learning is *w* since we can not do anything about the input *x*. To facilitate the following steps, we can include the bias-terms into the weight vector and correct for the indices by adding another dimension with the value 1 to the input vector. The bias is a constant factor that is added to the weighted sum and that serves as a scaling factor of the nonlinear transformation. Including it into the weight vector leads to: + +$$ y_k(\vec{x}, \vec{w}) = h \left( \sum_{j=0}^{M} w_{kj}^{(2)} h \left( \sum_{i=0}^{D} w_{ji}^{(1)} x_i \right) \right) $$ + +The previous paragraphs described how the MLP transforms a given input into some output using a combination of different nonlinear functions. Of course what we really want is to learn the structure of our data so that we can feed data with unknown labels into the network and get the estimated target labels *t*. To achieve this, we have to train our network. In this context, training means optimizing some function such that the error between the real labels *y* and the network-output *t* becomes smallest. We have seen in the previous pragraph, that our only knob to change is the weight vector *w*, making the function to be optimized a function of *w*. For simplicitly and because it is widely used, we choose the so called *sum-of-squares* error function as an example that is given by + +$$ E(\vec{w}) = \frac{1}{2} \sum_{n=1}^N \left( y(\vec{x}_n, \vec{w}) - t_n \right)^2 $$ + +The goal is to minimize this function and thereby increase the performance of our model. A common method to achieve this is to use gradient descent and the so called technique of *backpropagation* where the goal is to compute the contribution of every unit to the overall error and changing the weight according to this contribution and into the direction of the gradient of the error function at this particular unit. In the following we try to give a short overview of the model training with gradient descent and backpropagation. A more detailed example can be found in [\[3\]][3] where much of this information is taken from. + +The problem with minimizing the error function is that the error can only be computed at the output layers where we get *t*, but we want to update all the weights of all the units. Therefore we use the technique of backpropagation to propagate the error, that we first compute at the output layer, back to the units of the previous layers. For this approach we also need to compute the gradients of the activation function. + +Weights are then updated with a small step in the direction of the negative gradient, regulated by the learning rate *n* such that we arrive at the formula for weight update: + +$$ \vec{w}^{(t + 1)} = \vec{w}^{(t)} - n \Delta E(\vec{w}^{(t)}) $$ + +A momentum weight can be set as a parameter of the gradient descent method to increase the probability of finding better local or global optima of the error function. + + + + + +[1]: http://en.wikipedia.org/wiki/Perceptron "The perceptron in wikipedia" +[2]: http://en.wikipedia.org/wiki/Sigmoid_function "Sigmoid function on wikipedia" +[3]: http://research.microsoft.com/en-us/um/people/cmbishop/prml/ "Christopher M. Bishop: Pattern Recognition and Machine Learning, Springer 2009" + +References + +\[1\] http://en.wikipedia.org/wiki/Perceptron + +\[2\] http://en.wikipedia.org/wiki/Sigmoid_function + +\[3\] [Christopher M. Bishop: Pattern Recognition and Machine Learning, Springer 2009](http://research.microsoft.com/en-us/um/people/cmbishop/prml/) + http://git-wip-us.apache.org/repos/asf/mahout/blob/c81fc8b7/website/old_site_migration/completed/classification/naivebayes.md ---------------------------------------------------------------------- diff --git a/website/old_site_migration/completed/classification/naivebayes.md b/website/old_site_migration/completed/classification/naivebayes.md new file mode 100644 index 0000000..a697653 --- /dev/null +++ b/website/old_site_migration/completed/classification/naivebayes.md @@ -0,0 +1,45 @@ +--- +layout: default +title: NaiveBayes +theme: + name: retro-mahout +--- + +<a name="NaiveBayes-NaiveBayes"></a> +# Naive Bayes + +Naive Bayes is an algorithm that can be used to classify objects into +usually binary categories. It is one of the most common learning algorithms +in spam filters. Despite its simplicity and rather naive assumptions it has +proven to work surprisingly well in practice. + +Before applying the algorithm, the objects to be classified need to be +represented by numerical features. In the case of e-mail spam each feature +might indicate whether some specific word is present or absent in the mail +to classify. The algorithm comes in two phases: Learning and application. +During learning, a set of feature vectors is given to the algorithm, each +vector labeled with the class the object it represents, belongs to. From +that it is deduced which combination of features appears with high +probability in spam messages. Given this information, during application +one can easily compute the probability of a new message being either spam +or not. + +The algorithm does make several assumptions, that are not true for most +datasets, but make computations easier. The worst probably being, that all +features of an objects are considered independent. In practice, that means, +given the phrase "Statue of Liberty" was already found in a text, does not +influence the probability of seeing the phrase "New York" as well. + +<a name="NaiveBayes-StrategyforaparallelNaiveBayes"></a> +## Strategy for a parallel Naive Bayes + +See [https://issues.apache.org/jira/browse/MAHOUT-9](https://issues.apache.org/jira/browse/MAHOUT-9) +. + + +<a name="NaiveBayes-Examples"></a> +## Examples + +[20Newsgroups](20newsgroups.html) + - Example code showing how to train and use the Naive Bayes classifier +using the 20 Newsgroups data available at [http://people.csail.mit.edu/jrennie/20Newsgroups/] http://git-wip-us.apache.org/repos/asf/mahout/blob/c81fc8b7/website/old_site_migration/completed/classification/neural-network.md ---------------------------------------------------------------------- diff --git a/website/old_site_migration/completed/classification/neural-network.md b/website/old_site_migration/completed/classification/neural-network.md new file mode 100644 index 0000000..7180656 --- /dev/null +++ b/website/old_site_migration/completed/classification/neural-network.md @@ -0,0 +1,22 @@ +--- +layout: default +title: Neural Network +theme: + name: retro-mahout +--- + +<a name="NeuralNetwork-NeuralNetworks"></a> +# Neural Networks + +Neural Networks are a means for classifying multi dimensional objects. We +concentrate on implementing back propagation networks with one hidden layer +as these networks have been covered by the [2006 NIPS map reduce paper](http://www.cs.stanford.edu/people/ang/papers/nips06-mapreducemulticore.pdf) +. Those networks are capable of learning not only linear separating hyper +planes but arbitrary decision boundaries. + +<a name="NeuralNetwork-Strategyforparallelbackpropagationnetwork"></a> +## Strategy for parallel backpropagation network + + +<a name="NeuralNetwork-Designofimplementation"></a> +## Design of implementation http://git-wip-us.apache.org/repos/asf/mahout/blob/c81fc8b7/website/old_site_migration/completed/classification/partial-implementation.md ---------------------------------------------------------------------- diff --git a/website/old_site_migration/completed/classification/partial-implementation.md b/website/old_site_migration/completed/classification/partial-implementation.md new file mode 100644 index 0000000..2a20ccb --- /dev/null +++ b/website/old_site_migration/completed/classification/partial-implementation.md @@ -0,0 +1,146 @@ +--- +layout: default +title: Partial Implementation +theme: + name: retro-mahout +--- + + +# Classifying with random forests + +<a name="PartialImplementation-Introduction"></a> +# Introduction + +This quick start page shows how to build a decision forest using the +partial implementation. This tutorial also explains how to use the decision +forest to classify new data. +Partial Decision Forests is a mapreduce implementation where each mapper +builds a subset of the forest using only the data available in its +partition. This allows building forests using large datasets as long as +each partition can be loaded in-memory. + +<a name="PartialImplementation-Steps"></a> +# Steps +<a name="PartialImplementation-Downloadthedata"></a> +## Download the data +* The current implementation is compatible with the UCI repository file +format. In this example we'll use the NSL-KDD dataset because its large +enough to show the performances of the partial implementation. +You can download the dataset here http://nsl.cs.unb.ca/NSL-KDD/ +You can either download the full training set "KDDTrain+.ARFF", or a 20% +subset "KDDTrain+_20Percent.ARFF" (we'll use the full dataset in this +tutorial) and the test set "KDDTest+.ARFF". +* Open the train and test files and remove all the lines that begin with +'@'. All those lines are at the top of the files. Actually you can keep +those lines somewhere, because they'll help us describe the dataset to +Mahout +* Put the data in HDFS: {code} +$HADOOP_HOME/bin/hadoop fs -mkdir testdata +$HADOOP_HOME/bin/hadoop fs -put <PATH TO DATA> testdata{code} + +<a name="PartialImplementation-BuildtheJobfiles"></a> +## Build the Job files +* In $MAHOUT_HOME/ run: {code}mvn clean install -DskipTests{code} + +<a name="PartialImplementation-Generateafiledescriptorforthedataset:"></a> +## Generate a file descriptor for the dataset: +run the following command: + + $HADOOP_HOME/bin/hadoop jar +$MAHOUT_HOME/core/target/mahout-core-<VERSION>-job.jar +org.apache.mahout.classifier.df.tools.Describe -p testdata/KDDTrain+.arff +-f testdata/KDDTrain+.info -d N 3 C 2 N C 4 N C 8 N 2 C 19 N L + +The "N 3 C 2 N C 4 N C 8 N 2 C 19 N L" string describes all the attributes +of the data. In this cases, it means 1 numerical(N) attribute, followed by +3 Categorical(C) attributes, ...L indicates the label. You can also use 'I' +to ignore some attributes + +<a name="PartialImplementation-Runtheexample"></a> +## Run the example + + + $HADOOP_HOME/bin/hadoop jar +$MAHOUT_HOME/examples/target/mahout-examples-<version>-job.jar +org.apache.mahout.classifier.df.mapreduce.BuildForest +-Dmapred.max.split.size=1874231 -d testdata/KDDTrain+.arff -ds +testdata/KDDTrain+.info -sl 5 -p -t 100 -o nsl-forest + +which builds 100 trees (-t argument) using the partial implementation (-p). +Each tree is built using 5 random selected attribute per node (-sl +argument) and the example outputs the decision tree in the "nsl-forest" +directory (-o). +The number of partitions is controlled by the -Dmapred.max.split.size +argument that indicates to Hadoop the max. size of each partition, in this +case 1/10 of the size of the dataset. Thus 10 partitions will be used. +IMPORTANT: using less partitions should give better classification results, +but needs a lot of memory. So if the Jobs are failing, try increasing the +number of partitions. +* The example outputs the Build Time and the oob error estimation + + + 10/03/13 17:57:29 INFO mapreduce.BuildForest: Build Time: 0h 7m 43s 582 + 10/03/13 17:57:33 INFO mapreduce.BuildForest: oob error estimate : +0.002325895231517865 + 10/03/13 17:57:33 INFO mapreduce.BuildForest: Storing the forest in: +nsl-forest/forest.seq + + +<a name="PartialImplementation-UsingtheDecisionForesttoClassifynewdata"></a> +## Using the Decision Forest to Classify new data +run the following command: + + $HADOOP_HOME/bin/hadoop jar +$MAHOUT_HOME/examples/target/mahout-examples-<version>-job.jar +org.apache.mahout.classifier.df.mapreduce.TestForest -i +nsl-kdd/KDDTest+.arff -ds nsl-kdd/KDDTrain+.info -m nsl-forest -a -mr -o +predictions + +This will compute the predictions of "KDDTest+.arff" dataset (-i argument) +using the same data descriptor generated for the training dataset (-ds) and +the decision forest built previously (-m). Optionally (if the test dataset +contains the labels of the tuples) run the analyzer to compute the +confusion matrix (-a), and you can also store the predictions in a text +file or a directory of text files(-o). Passing the (-mr) parameter will use +Hadoop to distribute the classification. + +* The example should output the classification time and the confusion +matrix + + + 10/03/13 18:08:56 INFO mapreduce.TestForest: Classification Time: 0h 0m 6s +355 + 10/03/13 18:08:56 INFO mapreduce.TestForest: +======================================================= + Summary + ------------------------------------------------------- + Correctly Classified Instances : 17657 78.3224% + Incorrectly Classified Instances : 4887 21.6776% + Total Classified Instances : 22544 + + ======================================================= + Confusion Matrix + ------------------------------------------------------- + a b <--Classified as + 9459 252 | 9711 a = normal + 4635 8198 | 12833 b = anomaly + Default Category: unknown: 2 + + +If the input is a single file then the output will be a single text file, +in the above example 'predictions' would be one single file. If the input +if a directory containing for example two files 'a.data' and 'b.data', then +the output will be a directory 'predictions' containing two files +'a.data.out' and 'b.data.out' + +<a name="PartialImplementation-KnownIssuesandlimitations"></a> +## Known Issues and limitations +The "Decision Forest" code is still "a work in progress", many features are +still missing. Here is a list of some known issues: +* For now, the training does not support multiple input files. The input +dataset must be one single file (this support will be available with the upcoming release). +Classifying new data does support multiple +input files. +* The tree building is done when each mapper.close() method is called. +Because the mappers don't refresh their state, the job can fail when the +dataset is big and you try to build a large number of trees. http://git-wip-us.apache.org/repos/asf/mahout/blob/c81fc8b7/website/old_site_migration/completed/classification/random-forests.md ---------------------------------------------------------------------- diff --git a/website/old_site_migration/completed/classification/random-forests.md b/website/old_site_migration/completed/classification/random-forests.md new file mode 100644 index 0000000..c8b1a47 --- /dev/null +++ b/website/old_site_migration/completed/classification/random-forests.md @@ -0,0 +1,234 @@ +--- +layout: default +title: Random Forests +theme: + name: retro-mahout +--- + +<a name="RandomForests-HowtogrowaDecisionTree"></a> +### How to grow a Decision Tree + +source : \[3\](3\.html) + +LearnUnprunedTree(*X*,*Y*) + +Input: *X* a matrix of *R* rows and *M* columns where *X{*}{*}{~}ij{~}* = +the value of the *j*'th attribute in the *i*'th input datapoint. Each +column consists of either all real values or all categorical values. +Input: *Y* a vector of *R* elements, where *Y{*}{*}{~}i{~}* = the output +class of the *i*'th datapoint. The *Y{*}{*}{~}i{~}* values are categorical. +Output: An Unpruned decision tree + + +If all records in *X* have identical values in all their attributes (this +includes the case where *R<2*), return a Leaf Node predicting the majority +output, breaking ties randomly. This case also includes +If all values in *Y* are the same, return a Leaf Node predicting this value +as the output +Else + select *m* variables at random out of the *M* variables + For *j* = 1 .. *m* + If *j*'th attribute is +categorical +* +IG{*}{*}{~}j{~}* = IG(*Y*\|*X{*}{*}{~}j{~}*) (see Information +Gain) + Else (*j*'th attribute is +real-valued) +* +IG{*}{*}{~}j{~}* = IG*(*Y*\|*X{*}{*}{~}j{~}*) (see Information Gain) + Let *j\** = argmax{~}j~ *IG{*}{*}{~}j{~}* (this is the +splitting attribute we'll use) + If *j\** is categorical then + For each value *v* of the *j*'th +attribute + Let +*X{*}{*}{^}v{^}* = subset of rows of *X* in which *X{*}{*}{~}ij{~}* = *v*. +Let *Y{*}{*}{^}v{^}* = corresponding subset of *Y* + Let *Child{*}{*}{^}v{^}* = +LearnUnprunedTree(*X{*}{*}{^}v{^}*,*Y{*}{*}{^}v{^}*) + Return a decision tree node, +splitting on *j*'th attribute. The number of children equals the number of +values of the *j*'th attribute, and the *v*'th child is +*Child{*}{*}{^}v{^}* + Else *j\** is real-valued and let *t* be the best split +threshold + Let *X{*}{*}{^}LO{^}* = subset +of rows of *X* in which *X{*}{*}{~}ij{~}* *<= t*. Let *Y{*}{*}{^}LO{^}* = +corresponding subset of *Y* + Let *Child{*}{*}{^}LO{^}* = +LearnUnprunedTree(*X{*}{*}{^}LO{^}*,*Y{*}{*}{^}LO{^}*) + Let *X{*}{*}{^}HI{^}* = subset of rows of *X* +in which *X{*}{*}{~}ij{~}* *> t*. Let *Y{*}{*}{^}HI{^}* = corresponding +subset of *Y* + Let *Child{*}{*}{^}HI{^}* = +LearnUnprunedTree(*X{*}{*}{^}HI{^}*,*Y{*}{*}{^}HI{^}*) + Return a decision tree node, splitting on +*j*'th attribute. It has two children corresponding to whether the *j*'th +attribute is above or below the given threshold. + +*Note*: There are alternatives to Information Gain for splitting nodes + + +<a name="RandomForests-Informationgain"></a> +### Information gain + +source : \[3\](3\.html) +1. h4. nominal attributes + +suppose X can have one of m values V{~}1~,V{~}2~,...,V{~}m~ +P(X=V{~}1~)=p{~}1~, P(X=V{~}2~)=p{~}2~,...,P(X=V{~}m~)=p{~}m~ + +H(X)= \-sum{~}j=1{~}{^}m^ p{~}j~ log{~}2~ p{~}j~ (The entropy of X) +H(Y\|X=v) = the entropy of Y among only those records in which X has value +v +H(Y\|X) = sum{~}j~ p{~}j~ H(Y\|X=v{~}j~) +IG(Y\|X) = H(Y) - H(Y\|X) +1. h4. real-valued attributes + +suppose X is real valued +define IG(Y\|X:t) as H(Y) - H(Y\|X:t) +define H(Y\|X:t) = H(Y\|X<t) P(X<t) + H(Y\|X>=t) P(X>=t) +define IG*(Y\|X) = max{~}t~ IG(Y\|X:t) + +<a name="RandomForests-HowtogrowaRandomForest"></a> +### How to grow a Random Forest + +source : \[1\](1\.html) + +Each tree is grown as follows: +1. if the number of cases in the training set is *N*, sample *N* cases at +random \-but with replacement, from the original data. This sample will be +the training set for the growing tree. +1. if there are *M* input variables, a number *m << M* is specified such +that at each node, *m* variables are selected at random out of the *M* and +the best split on these *m* is used to split the node. The value of *m* is +held constant during the forest growing. +1. each tree is grown to its large extent possible. There is no pruning. + +<a name="RandomForests-RandomForestparameters"></a> +### Random Forest parameters + +source : \[2\](2\.html) +Random Forests are easy to use, the only 2 parameters a user of the +technique has to determine are the number of trees to be used and the +number of variables (*m*) to be randomly selected from the available set of +variables. +Breinman's recommendations are to pick a large number of trees, as well as +the square root of the number of variables for *m*. + + +<a name="RandomForests-Howtopredictthelabelofacase"></a> +### How to predict the label of a case + +Classify(*node*,*V*) + Input: *node* from the decision tree, if *node.attribute += j* then the split is done on the *j*'th attribute + + Input: *V* a vector of *M* columns where +*V{*}{*}{~}j{~}* = the value of the *j*'th attribute. + Output: label of *V* + + If *node* is a Leaf then + Return the value predicted +by *node* + + Else + Let *j = +node.attribute* + If *j* is +categorical then + + +Let *v* = *V{*}{*}{~}j{~}* + + +Let *child{*}{*}{^}v{^}* = child node corresponding to the attribute's +value *v* + + Return Classify(*child{*}{*}{^}v{^}*,*V*) + + Else *j* is +real-valued + + +Let *t = node.threshold* (split threshold) + + If Vj < t then + + Let *child{*}{*}{^}LO{^}* = child +node corresponding to (*<t*) + + Return +Classify(*child{*}{*}{^}LO{^}*,*V*) + + +Else + + Let *child{*}{*}{^}HI{^}* = +child node corresponding to (*>=t*) + + Return +Classify(*child{*}{*}{^}HI{^}*,*V*) + + +<a name="RandomForests-Theoutofbag(oob)errorestimation"></a> +### The out of bag (oob) error estimation + +source : \[1\](1\.html) + +in random forests, there is no need for cross-validation or a separate test +set to get an unbiased estimate of the test set error. It is estimated +internally, during the run, as follows: +* each tree is constructed using a different bootstrap sample from the +original data. About one-third of the cases left of the bootstrap sample +and not used in the construction of the _kth_ tree. +* put each case left out in the construction of the _kth_ tree down the +_kth{_}tree to get a classification. In this way, a test set classification +is obtained for each case in about one-thrid of the trees. At the end of +the run, take *j* to be the class that got most of the the votes every time +case *n* was _oob_. The proportion of times that *j* is not equal to the +true class of *n* averaged over all cases is the _oob error estimate_. This +has proven to be unbiased in many tests. + +<a name="RandomForests-OtherRFuses"></a> +### Other RF uses + +source : \[1\](1\.html) +* variable importance +* gini importance +* proximities +* scaling +* prototypes +* missing values replacement for the training set +* missing values replacement for the test set +* detecting mislabeled cases +* detecting outliers +* detecting novelties +* unsupervised learning +* balancing prediction error +Please refer to \[1\](1\.html) + for a detailed description + +<a name="RandomForests-References"></a> +### References + +\[1\](1\.html) + Random Forests - Classification Description + [http://www.stat.berkeley.edu/~breiman/RandomForests/cc_home.htm](http://www.stat.berkeley.edu/~breiman/RandomForests/cc_home.htm) +\[2\](2\.html) + B. Larivi�re & D. Van Den Poel, 2004. "Predicting Customer Retention +and Profitability by Using Random Forests and Regression Forests +Techniques," + Working Papers of Faculty of +Economics and Business Administration, Ghent University, Belgium 04/282, +Ghent University, + Faculty of Economics and +Business Administration. + Available online : [http://ideas.repec.org/p/rug/rugwps/04-282.html](http://ideas.repec.org/p/rug/rugwps/04-282.html) +\[3\](3\.html) + Decision Trees - Andrew W. Moore\[4\] + http://www.cs.cmu.edu/~awm/tutorials\[1\](1\.html) +\[4\](4\.html) + Information Gain - Andrew W. Moore + [http://www.cs.cmu.edu/~awm/tutorials](http://www.cs.cmu.edu/~awm/tutorials) http://git-wip-us.apache.org/repos/asf/mahout/blob/c81fc8b7/website/old_site_migration/completed/classification/restricted-boltzmann-machines.md ---------------------------------------------------------------------- diff --git a/website/old_site_migration/completed/classification/restricted-boltzmann-machines.md b/website/old_site_migration/completed/classification/restricted-boltzmann-machines.md new file mode 100644 index 0000000..0aa8641 --- /dev/null +++ b/website/old_site_migration/completed/classification/restricted-boltzmann-machines.md @@ -0,0 +1,49 @@ +--- +layout: default +title: Restricted Boltzmann Machines +theme: + name: retro-mahout +--- + +NOTE: This implementation is a Work-In-Progress, at least till September, +2010. + +The JIRA issue is [here](https://issues.apache.org/jira/browse/MAHOUT-375) +. + +<a name="RestrictedBoltzmannMachines-BoltzmannMachines"></a> +### Boltzmann Machines +Boltzmann Machines are a type of stochastic neural networks that closely +resemble physical processes. They define a network of units with an overall +energy that is evolved over a period of time, until it reaches thermal +equilibrium. + +However, the convergence speed of Boltzmann machines that have +unconstrained connectivity is low. + +<a name="RestrictedBoltzmannMachines-RestrictedBoltzmannMachines"></a> +### Restricted Boltzmann Machines +Restricted Boltzmann Machines are a variant, that are 'restricted' in the +sense that connections between hidden units of a single layer are _not_ +allowed. In addition, stacking multiple RBM's is also feasible, with the +activities of the hidden units forming the base for a higher-level RBM. The +combination of these two features renders RBM's highly usable for +parallelization. + +In the Netflix Prize, RBM's offered distinctly orthogonal predictions to +SVD and k-NN approaches, and contributed immensely to the final solution. + +<a name="RestrictedBoltzmannMachines-RBM'sinApacheMahout"></a> +### RBM's in Apache Mahout +An implementation of Restricted Boltzmann Machines is being developed for +Apache Mahout as a Google Summer of Code 2010 project. A recommender +interface will also be provided. The key aims of the implementation are: +1. Accurate - should replicate known results, including those of the Netflix +Prize +1. Fast - The implementation uses Map-Reduce, hence, it should be fast +1. Scale - Should scale to large datasets, with a design whose critical +parts don't need a dependency between the amount of memory on your cluster +systems and the size of your dataset + +You can view the patch as it develops [here](http://github.com/sisirkoppaka/mahout-rbm/compare/trunk...rbm) +. http://git-wip-us.apache.org/repos/asf/mahout/blob/c81fc8b7/website/old_site_migration/completed/classification/support-vector-machines.md ---------------------------------------------------------------------- diff --git a/website/old_site_migration/completed/classification/support-vector-machines.md b/website/old_site_migration/completed/classification/support-vector-machines.md new file mode 100644 index 0000000..6d1b9df --- /dev/null +++ b/website/old_site_migration/completed/classification/support-vector-machines.md @@ -0,0 +1,43 @@ +--- +layout: default +title: Support Vector Machines +theme: + name: retro-mahout +--- + +<a name="SupportVectorMachines-SupportVectorMachines"></a> +# Support Vector Machines + +As with Naive Bayes, Support Vector Machines (or SVMs in short) can be used +to solve the task of assigning objects to classes. However, the way this +task is solved is completely different to the setting in Naive Bayes. + +Each object is considered to be a point in _n_ dimensional feature space, +_n_ being the number of features used to describe the objects numerically. +In addition each object is assigned a binary label, let us assume the +labels are "positive" and "negative". During learning, the algorithm tries +to find a hyperplane in that space, that perfectly separates positive from +negative objects. +It is trivial to think of settings where this might very well be +impossible. To remedy this situation, objects can be assigned so called +slack terms, that punish mistakes made during learning appropriately. That +way, the algorithm is forced to find the hyperplane that causes the least +number of mistakes. + +Another way to overcome the problem of there being no linear hyperplane to +separate positive from negative objects is to simply project each feature +vector into an higher dimensional feature space and search for a linear +separating hyperplane in that new space. Usually the main problem with +learning in high dimensional feature spaces is the so called curse of +dimensionality. That is, there are fewer learning examples available than +free parameters to tune. In the case of SVMs this problem is less +detrimental, as SVMs impose additional structural constraints on their +solutions. Each separating hyperplane needs to have a maximal margin to all +training examples. In addition, that way, the solution may be based on the +information encoded in only very few examples. + +<a name="SupportVectorMachines-Strategyforparallelization"></a> +## Strategy for parallelization + +<a name="SupportVectorMachines-Designofpackages"></a> +## Design of packages http://git-wip-us.apache.org/repos/asf/mahout/blob/c81fc8b7/website/old_site_migration/completed/environment/h2o-internals.md ---------------------------------------------------------------------- diff --git a/website/old_site_migration/completed/environment/h2o-internals.md b/website/old_site_migration/completed/environment/h2o-internals.md new file mode 100644 index 0000000..c72a7ae --- /dev/null +++ b/website/old_site_migration/completed/environment/h2o-internals.md @@ -0,0 +1,51 @@ +--- +layout: default +title: +theme: + name: retro-mahout +--- + +# Introduction + +This document provides an overview of how the Mahout Samsara environment is implemented over the H2O backend engine. The document is aimed at Mahout developers, to give a high level description of the design so that one can explore the code inside `h2o/` with some context. + +## H2O Overview + +H2O is a distributed scalable machine learning system. Internal architecture of H2O has a distributed math engine (h2o-core) and a separate layer on top for algorithms and UI. The Mahout integration requires only the math engine (h2o-core). + +## H2O Data Model + +The data model of the H2O math engine is a distributed columnar store (of primarily numbers, but also strings). A column of numbers is called a Vector, which is broken into Chunks (of a few thousand elements). Chunks are distributed across the cluster based on a deterministic hash. Therefore, any member of the cluster knows where a particular Chunk of a Vector is homed. Each Chunk is separately compressed in memory and elements are individually decompressed on the fly upon access with purely register operations (thereby achieving high memory throughput). An ordered set of similarly partitioned Vecs are composed into a Frame. A Frame is therefore a large two dimensional table of numbers. All elements of a logical row in the Frame are guaranteed to be homed in the same server of the cluster. Generally speaking, H2O works well on "tall skinny" data, i.e, lots of rows (100s of millions) and modest number of columns (10s of thousands). + + +## Mahout DRM + +The Mahout DRM, or Distributed Row Matrix, is an abstraction for storing a large matrix of numbers in-memory in a cluster by distributing logical rows among servers. Mahout's scala DSL provides an abstract API on DRMs for backend engines to provide implementations of this API. Examples are the Spark and H2O backend engines. Each engine has it's own design of mapping the abstract API onto its data model and provides implementations for algebraic operators over that mapping. + + +## H2O Environment Engine + +The H2O backend implements the abstract DRM as an H2O Frame. Each logical column in the DRM is an H2O Vector. All elements of a logical DRM row are guaranteed to be homed on the same server. A set of rows stored on a server are presented as a read-only virtual in-core Matrix (i.e BlockMatrix) for the closure method in the `mapBlock(...)` API. + +H2O provides a flexible execution framework called `MRTask`. The `MRTask` framework typically executes over a Frame (or even a Vector), supports various types of map() methods, can optionally modify the Frame or Vector (though this never happens in the Mahout integration), and optionally create a new Vector or set of Vectors (to combine them into a new Frame, and consequently a new DRM). + + +## Source Layout + +Within mahout.git, the top level directory, `h2o/` holds all the source code related to the H2O backend engine. Part of the code (that interfaces with the rest of the Mahout componenets) is in Scala, and part of the code (that interfaces with h2o-core and implements algebraic operators) is in Java. Here is a brief overview of what functionality can be found where within `h2o/`. + + h2o/ - top level directory containing all H2O related code + + h2o/src/main/java/org/apache/mahout/h2obindings/ops/*.java - Physical operator code for the various DSL algebra + + h2o/src/main/java/org/apache/mahout/h2obindings/drm/*.java - DRM backing (onto Frame) and Broadcast implementation + + h2o/src/main/java/org/apache/mahout/h2obindings/H2OHdfs.java - Read / Write between DRM (Frame) and files on HDFS + + h2o/src/main/java/org/apache/mahout/h2obindings/H2OBlockMatrix.java - A vertical block matrix of DRM presented as a virtual copy-on-write in-core Matrix. Used in mapBlock() API + + h2o/src/main/java/org/apache/mahout/h2obindings/H2OHelper.java - A collection of various functionality and helpers. For e.g, convert between in-core Matrix and DRM, various summary statistics on DRM/Frame. + + h2o/src/main/scala/org/apache/mahout/h2obindings/H2OEngine.scala - DSL operator graph evaluator and various abstract API implementations for a distributed engine + + h2o/src/main/scala/org/apache/mahout/h2obindings/* - Various abstract API implementations ("glue work") \ No newline at end of file http://git-wip-us.apache.org/repos/asf/mahout/blob/c81fc8b7/website/old_site_migration/completed/environment/spark-internals.md ---------------------------------------------------------------------- diff --git a/website/old_site_migration/completed/environment/spark-internals.md b/website/old_site_migration/completed/environment/spark-internals.md new file mode 100644 index 0000000..f5d72a4 --- /dev/null +++ b/website/old_site_migration/completed/environment/spark-internals.md @@ -0,0 +1,25 @@ +--- +layout: default +title: +theme: + name: retro-mahout +--- + +# Introduction + +This document provides an overview of how the Mahout Scala DSL (distributed algebraic operators) is implemented over the Spark back end engine. The document is aimed at Mahout developers, to give a high level description of the design. + +## Spark Overview + +## Spark Data Model + + +## Mahout DRM + +Mahout DRM, or Distributed Row Matrix, is an abstraction for storing a large matrix of numbers in-memory in a cluster by distributing logical rows among servers. The DSL provides an abstract API on DRMs for backend engines to provide implementations of this API. Examples are Spark and H2O backend engines. Each engine has its own design of mapping the abstract API onto its data model and provide implementations for algebraic operators over that mapping. + + +## Spark DSL Engine + + +## Source Layout http://git-wip-us.apache.org/repos/asf/mahout/blob/c81fc8b7/website/old_site_migration/completed/flinkbindings/flink-internals.md ---------------------------------------------------------------------- diff --git a/website/old_site_migration/completed/flinkbindings/flink-internals.md b/website/old_site_migration/completed/flinkbindings/flink-internals.md new file mode 100644 index 0000000..8c8145a --- /dev/null +++ b/website/old_site_migration/completed/flinkbindings/flink-internals.md @@ -0,0 +1,50 @@ +--- +layout: default +title: +theme: + name: retro-mahout +--- + +#Introduction + +This document provides an overview of how the Mahout Samsara environment is implemented over the Apache Flink backend engine. This document gives an overview of the code layout for the Flink backend engine, the source code for which can be found under /flink directory in the Mahout codebase. + +Apache Flink is a distributed big data streaming engine that supports both Streaming and Batch interfaces. Batch processing is an extension of Flinkâs Stream processing engine. + +The Mahout Flink integration presently supports Flinkâs batch processing capabilities leveraging the DataSet API. + +The Mahout DRM, or Distributed Row Matrix, is an abstraction for storing a large matrix of numbers in-memory in a cluster by distributing logical rows among servers. Mahout's scala DSL provides an abstract API on DRMs for backend engines to provide implementations of this API. An example is the Spark backend engine. Each engine has it's own design of mapping the abstract API onto its data model and provides implementations for algebraic operators over that mapping. + +#Flink Overview + +Apache Flink is an open source, distributed Stream and Batch Processing Framework. At it's core, Flink is a Stream Processing engine and Batch processing is an extension of Stream Processing. + +Flink includes several APIs for building applications with the Flink Engine: + + <ol> +<li><b>DataSet API</b> for Batch data in Java, Scala and Python</li> +<li><b>DataStream API</b> for Stream Processing in Java and Scala</li> +<li><b>Table API</b> with SQL-like regular expression language in Java and Scala</li> +<li><b>Gelly</b> Graph Processing API in Java and Scala</li> +<li><b>CEP API</b>, a complex event processing library</li> +<li><b>FlinkML</b>, a Machine Learning library</li> +</ol> +#Flink Environment Engine + +The Flink backend implements the abstract DRM as a Flink DataSet. A Flink job runs in the context of an ExecutionEnvironment (from the Flink Batch processing API). + +#Source Layout + +Within mahout.git, the top level directory, flink/ holds all the source code for the Flink backend engine. Sections of code that interface with the rest of the Mahout components are in Scala, and sections of the code that interface with Flink DataSet API and implement algebraic operators are in Java. Here is a brief overview of what functionality can be found within flink/ folder. + +flink/ - top level directory containing all Flink related code + +flink/src/main/scala/org/apache/mahout/flinkbindings/blas/*.scala - Physical operator code for the Samsara DSL algebra + +flink/src/main/scala/org/apache/mahout/flinkbindings/drm/*.scala - Flink Dataset DRM and broadcast implementation + +flink/src/main/scala/org/apache/mahout/flinkbindings/io/*.scala - Read / Write between DRMDataSet and files on HDFS + +flink/src/main/scala/org/apache/mahout/flinkbindings/FlinkEngine.scala - DSL operator graph evaluator and various abstract API implementations for a distributed engine. + +
