http://git-wip-us.apache.org/repos/asf/mahout/blob/ec5eb314/website-old/docs/algorithms/map-reduce/clustering/streaming-k-means.md ---------------------------------------------------------------------- diff --git a/website-old/docs/algorithms/map-reduce/clustering/streaming-k-means.md b/website-old/docs/algorithms/map-reduce/clustering/streaming-k-means.md new file mode 100644 index 0000000..599921f --- /dev/null +++ b/website-old/docs/algorithms/map-reduce/clustering/streaming-k-means.md @@ -0,0 +1,174 @@ +--- +layout: algorithm +title: (Deprecated) Spectral Clustering +theme: + name: retro-mahout +--- + +# *StreamingKMeans* algorithm + +The *StreamingKMeans* algorithm is a variant of Algorithm 1 from [Shindler et al][1] and consists of two steps: + + 1. Streaming step + 2. BallKMeans step. + +The streaming step is a randomized algorithm that makes one pass through the data and +produces as many centroids as it determines is optimal. This step can be viewed as +a preparatory dimensionality reduction. If the size of the data stream is *n* and the +expected number of clusters is *k*, the streaming step will produce roughly *k\*log(n)* +clusters that will be passed on to the BallKMeans step which will further reduce the +number of clusters down to *k*. BallKMeans is a randomized Lloyd-type algorithm that +has been studied in detail, see [Ostrovsky et al][2]. + +## Streaming step + +--- + +### Overview + +The streaming step is a derivative of the streaming +portion of Algorithm 1 in [Shindler et al][1]. The main difference between the two is that +Algorithm 1 of [Shindler et al][1] assumes +the knowledge of the size of the data stream and uses it to set a key parameter +for the algorithm. More precisely, the initial *distanceCutoff* (defined below), which is +denoted by *f* in [Shindler et al][1], is set to *1/(k(1+log(n))*. The *distanceCutoff* influences the number of clusters that the algorithm +will produce. +In contrast, Mahout implementation does not require the knowledge of the size of the +data stream. Instead, it dynamically re-evaluates the parameters that depend on the size +of the data stream at runtime as more and more data is processed. In particular, +the parameter *numClusters* (defined below) changes its value as the data is processed. + +###Parameters + + - **numClusters** (int): Conceptually, *numClusters* represents the algorithm's guess at the optimal +number of clusters it is shooting for. In particular, *numClusters* will increase at run +time as more and more data is processed. Note that â¢numClusters⢠is not the number of clusters that the algorithm will produce. Also, *numClusters* should not be set to the final number of clusters that we expect to receive as the output of *StreamingKMeans*. + - **distanceCutoff** (double): a parameter representing the value of the distance between a point and +its closest centroid after which +the new point will definitely be assigned to a new cluster. *distanceCutoff* can be thought +of as an estimate of the variable *f* from Shindler et al. The default initial value for +*distanceCutoff* is *1.0/numClusters* and *distanceCutoff* grows as a geometric progression with +common ratio *beta* (see below). + - **beta** (double): a constant parameter that controls the growth of *distanceCutoff*. If the initial setting of *distanceCutoff* is *d0*, *distanceCutoff* will grow as the geometric progression with initial term *d0* and common ratio *beta*. The default value for *beta* is 1.3. + - **clusterLogFactor** (double): a constant parameter such that *clusterLogFactor* *log(numProcessedPoints)* is the runtime estimate of the number of clusters to be produced by the streaming step. If the final number of clusters (that we expect *StreamingKMeans* to output) is *k*, *clusterLogFactor* can be set to *k*. + - **clusterOvershoot** (double): a constant multiplicative slack factor that slows down the collapsing of clusters. The default value is 2. + + +###Algorithm + +The algorithm processes the data one-by-one and makes only one pass through the data. +The first point from the data stream will form the centroid of the first cluster (this designation may change as more points are processed). Suppose there are *r* clusters at one point and a new point *p* is being processed. The new point can either be added to one of the existing *r* clusters or become a new cluster. To decide: + + - let *c* be the closest cluster to point *p* + - let *d* be the distance between *c* and *p* + - if *d > distanceCutoff*, create a new cluster from *p* (*p* is too far away from the clusters to be part of any one of them) + - else (*d <= distanceCutoff*), create a new cluster with probability *d / distanceCutoff* (the probability of creating a new cluster increases as *d* increases). + +There will be either *r* or *r+1* clusters after processing a new point. + +As the number of clusters increases, it will go over the *clusterOvershoot \* numClusters* limit (*numClusters* represents a recommendation for the number of clusters that the streaming step should aim for and *clusterOvershoot* is the slack). To decrease the number of clusters the existing clusters +are treated as data points and are re-clustered (collapsed). This tends to make the number of clusters go down. If the number of clusters is still too high, *distanceCutoff* is increased. + +## BallKMeans step +--- +### Overview +The algorithm is a Lloyd-type algorithm that takes a set of weighted vectors and returns k centroids, see [Ostrovsky et al][2] for details. The algorithm has two stages: + + 1. Seeding + 2. Ball k-means + +The seeding stage is an initial guess of where the centroids should be. The initial guess is improved using the ball k-means stage. + +### Parameters + +* **numClusters** (int): the number k of centroids to return. The algorithm will return exactly this number of centroids. + +* **maxNumIterations** (int): After seeding, the iterative clustering procedure will be run at most *maxNumIterations* times. 1 or 2 iterations are recommended. Increasing beyond this will increase the accuracy of the result at the expense of runtime. Each successive iteration yields diminishing returns in lowering the cost. + +* **trimFraction** (double): Outliers are ignored when computing the center of mass for a cluster. For any datapoint *x*, let *c* be the nearest centroid. Let *d* be the minimum distance from *c* to another centroid. If the distance from *x* to *c* is greater than *trimFraction \* d*, then *x* is considered an outlier during that iteration of ball k-means. The default is 9/10. In [Ostrovsky et al][2], the authors use *trimFraction* = 1/3, but this does not mean that 1/3 is optimal in practice. + +* **kMeansPlusPlusInit** (boolean): If true, the seeding method is k-means++. If false, the seeding method is to select points uniformly at random. The default is true. + +* **correctWeights** (boolean): If *correctWeights* is true, outliers will be considered when calculating the weight of centroids. The default is true. Note that outliers are not considered when calculating the position of centroids. + +* **testProbability** (double): If *testProbability* is *p* (0 < *p* < 1), the data (of size n) is partitioned into a test set (of size *p\*n*) and a training set (of size *(1-p)\*n*). If 0, no test set is created (the entire data set is used for both training and testing). The default is 0.1 if *numRuns* > 1. If *numRuns* = 1, then no test set should be created (since it is only used to compare the cost between different runs). + +* **numRuns** (int): This is the number of runs to perform. The solution of lowest cost is returned. The default is 1 run. + +###Algorithm +The algorithm can be instructed to take multiple independent runs (using the *numRuns* parameter) and the algorithm will select the best solution (i.e., the one with the lowest cost). In practice, one run is sufficient to find a good solution. + +Each run operates as follows: a seeding procedure is used to select k centroids, and then ball k-means is run iteratively to refine the solution. + +The seeding procedure can be set to either 'uniformly at random' or 'k-means++' using *kMeansPlusPlusInit* boolean variable. Seeding with k-means++ involves more computation but offers better results in practice. + +Each iteration of ball k-means runs as follows: + +1. Clusters are formed by assigning each datapoint to the nearest centroid +2. The centers of mass of the trimmed clusters (see *trimFraction* parameter above) become the new centroids + +The data may be partitioned into a test set and a training set (see *testProbability*). The seeding procedure and ball k-means run on the training set. The cost is computed on the test set. + + +##Usage of *StreamingKMeans* + + bin/mahout streamingkmeans + -i <input> + -o <output> + -ow + -k <k> + -km <estimatedNumMapClusters> + -e <estimatedDistanceCutoff> + -mi <maxNumIterations> + -tf <trimFraction> + -ri + -iw + -testp <testProbability> + -nbkm <numBallKMeansRuns> + -dm <distanceMeasure> + -sc <searcherClass> + -np <numProjections> + -s <searchSize> + -rskm + -xm <method> + -h + --tempDir <tempDir> + --startPhase <startPhase> + --endPhase <endPhase> + + +###Details on Job-Specific Options: + + * `--input (-i) <input>`: Path to job input directory. + * `--output (-o) <output>`: The directory pathname for output. + * `--overwrite (-ow)`: If present, overwrite the output directory before running job. + * `--numClusters (-k) <k>`: The k in k-Means. Approximately this many clusters will be generated. + * `--estimatedNumMapClusters (-km) <estimatedNumMapClusters>`: The estimated number of clusters to use for the Map phase of the job when running StreamingKMeans. This should be around k \* log(n), where k is the final number of clusters and n is the total number of data points to cluster. + * `--estimatedDistanceCutoff (-e) <estimatedDistanceCutoff>`: The initial estimated distance cutoff between two points for forming new clusters. If no value is given, it's estimated from the data set + * `--maxNumIterations (-mi) <maxNumIterations>`: The maximum number of iterations to run for the BallKMeans algorithm used by the reducer. If no value is given, defaults to 10. + * `--trimFraction (-tf) <trimFraction>`: The 'ball' aspect of ball k-means means that only the closest points to the centroid will actually be used for updating. The fraction of the points to be used is those points whose distance to the center is within trimFraction \* distance to the closest other center. If no value is given, defaults to 0.9. + * `--randomInit` (`-ri`) Whether to use k-means++ initialization or random initialization of the seed centroids. Essentially, k-means++ provides better clusters, but takes longer, whereas random initialization takes less time, but produces worse clusters, and tends to fail more often and needs multiple runs to compare to k-means++. If set, uses the random initialization. + * `--ignoreWeights (-iw)`: Whether to correct the weights of the centroids after the clustering is done. The weights end up being wrong because of the trimFraction and possible train/test splits. In some cases, especially in a pipeline, having an accurate count of the weights is useful. If set, ignores the final weights. + * `--testProbability (-testp) <testProbability>`: A double value between 0 and 1 that represents the percentage of points to be used for 'testing' different clustering runs in the final BallKMeans step. If no value is given, defaults to 0.1 + * `--numBallKMeansRuns (-nbkm) <numBallKMeansRuns>`: Number of BallKMeans runs to use at the end to try to cluster the points. If no value is given, defaults to 4 + * `--distanceMeasure (-dm) <distanceMeasure>`: The classname of the DistanceMeasure. Default is SquaredEuclidean. + * `--searcherClass (-sc) <searcherClass>`: The type of searcher to be used when performing nearest neighbor searches. Defaults to ProjectionSearch. + * `--numProjections (-np) <numProjections>`: The number of projections considered in estimating the distances between vectors. Only used when the distance measure requested is either ProjectionSearch or FastProjectionSearch. If no value is given, defaults to 3. + * `--searchSize (-s) <searchSize>`: In more efficient searches (non BruteSearch), not all distances are calculated for determining the nearest neighbors. The number of elements whose distances from the query vector is actually computer is proportional to searchSize. If no value is given, defaults to 1. + * `--reduceStreamingKMeans (-rskm)`: There might be too many intermediate clusters from the mapper to fit into memory, so the reducer can run another pass of StreamingKMeans to collapse them down to a fewer clusters. + * `--method (-xm)` method The execution method to use: sequential or mapreduce. Default is mapreduce. + * `-- help (-h)`: Print out help + * `--tempDir <tempDir>`: Intermediate output directory. + * `--startPhase <startPhase>` First phase to run. + * `--endPhase <endPhase>` Last phase to run. + + +##References + +1. [M. Shindler, A. Wong, A. Meyerson: Fast and Accurate k-means For Large Datasets][1] +2. [R. Ostrovsky, Y. Rabani, L. Schulman, Ch. Swamy: The Effectiveness of Lloyd-Type Methods for the k-means Problem][2] + + +[1]: http://nips.cc/Conferences/2011/Program/event.php?ID=2989 "M. Shindler, A. Wong, A. Meyerson: Fast and Accurate k-means For Large Datasets" + +[2]: http://www.math.uwaterloo.ca/~cswamy/papers/kmeansfnl.pdf "R. Ostrovsky, Y. Rabani, L. Schulman, Ch. Swamy: The Effectiveness of Lloyd-Type Methods for the k-means Problem"
http://git-wip-us.apache.org/repos/asf/mahout/blob/ec5eb314/website-old/docs/algorithms/map-reduce/index.md ---------------------------------------------------------------------- diff --git a/website-old/docs/algorithms/map-reduce/index.md b/website-old/docs/algorithms/map-reduce/index.md new file mode 100644 index 0000000..14d099a --- /dev/null +++ b/website-old/docs/algorithms/map-reduce/index.md @@ -0,0 +1,42 @@ +--- +layout: algorithm +title: (Deprecated) Deprecated Map Reduce Algorithms +theme: + name: mahout2 +--- + +### Classification + +[Bayesian](classification/bayesian.html) + +[Class Discovery](classification/class-discovery.html) + +[Classifying Your Data](classification/classifyingyourdata.html) + +[Collocations](classification/collocations.html) + +[Gaussian Discriminative Analysis](classification/gaussian-discriminative-analysis.html) + +[Hidden Markov Models](classification/hidden-markov-models.html) + +[Independent Component Analysis](classification/independent-component-analysis.html) + +[Locally Weighted Linear Regression](classification/locally-weighted-linear-regression.html) + +[Logistic Regression](classification/logistic-regression.html) + +[Mahout Collections](classification/mahout-collections.html) + +[Multilayer Perceptron](classification/mlp.html) + +[Naive Bayes](classification/naivebayes.html) + +[Neural Network](classification/neural-network.html) + +[Partial Implementation](classification/partial-implementation.html) + +[Random Forrests](classification/random-forrests.html) + +[Restricted Boltzman Machines](classification/restricted-boltzman-machines.html) + +[Support Vector Machines](classification/support-vector-machines.html) http://git-wip-us.apache.org/repos/asf/mahout/blob/ec5eb314/website-old/docs/algorithms/preprocessors/AsFactor.md ---------------------------------------------------------------------- diff --git a/website-old/docs/algorithms/preprocessors/AsFactor.md b/website-old/docs/algorithms/preprocessors/AsFactor.md new file mode 100644 index 0000000..255f982 --- /dev/null +++ b/website-old/docs/algorithms/preprocessors/AsFactor.md @@ -0,0 +1,35 @@ +--- +layout: algorithm +title: AsFactor +theme: + name: mahout2 +--- + + +### About + +The `AsFactor` preprocessor is used to turn the integer values of the columns into sparse vectors where the value is 1 + at the index that corresponds to the 'category' of that column. This is also known as "One Hot Encoding" in many other + packages. + + +### Parameters + +`AsFactor` takes no parameters. + +### Example + +```scala +import org.apache.mahout.math.algorithms.preprocessing.AsFactor + +val A = drmParallelize(dense( + (3, 2, 1, 2), + (0, 0, 0, 0), + (1, 1, 1, 1)), numPartitions = 2) + +// 0 -> 2, 3 -> 5, 6 -> 9 +val factorizer: AsFactorModel = new AsFactor().fit(A) + +val factoredA = factorizer.transform(A) +``` + http://git-wip-us.apache.org/repos/asf/mahout/blob/ec5eb314/website-old/docs/algorithms/preprocessors/MeanCenter.md ---------------------------------------------------------------------- diff --git a/website-old/docs/algorithms/preprocessors/MeanCenter.md b/website-old/docs/algorithms/preprocessors/MeanCenter.md new file mode 100644 index 0000000..771b6a3 --- /dev/null +++ b/website-old/docs/algorithms/preprocessors/MeanCenter.md @@ -0,0 +1,30 @@ +--- +layout: algorithm +title: MeanCenter +theme: + name: mahout2 +--- + +### About + +`MeanCenter` centers values about the column mean. + +### Parameters + +### Example + +```scala +import org.apache.mahout.math.algorithms.preprocessing.MeanCenter + +val A = drmParallelize(dense( + (1, 1, -2), + (2, 5, 2), + (3, 9, 0)), numPartitions = 2) + +val scaler: MeanCenterModel = new MeanCenter().fit(A) + +val centeredA = scaler.transform(A) +``` + + + http://git-wip-us.apache.org/repos/asf/mahout/blob/ec5eb314/website-old/docs/algorithms/preprocessors/StandardScaler.md ---------------------------------------------------------------------- diff --git a/website-old/docs/algorithms/preprocessors/StandardScaler.md b/website-old/docs/algorithms/preprocessors/StandardScaler.md new file mode 100644 index 0000000..5b33709 --- /dev/null +++ b/website-old/docs/algorithms/preprocessors/StandardScaler.md @@ -0,0 +1,44 @@ +--- +layout: algorithm +title: StandardScaler +theme: + name: mahout2 +--- + +### About + +`StandardScaler` centers the values of each column to their mean, and scales them to unit variance. + +#### Relation to the `scale` function in R-base +The `StandardScaler` is the equivelent of the R-base function [`scale`](https://stat.ethz.ch/R-manual/R-devel/library/base/html/scale.html) with +one noteable tweek. R's `scale` function (indeed all of R) calculates standard deviation with 1 degree of freedom, Mahout +(like many other statistical packages aimed at larger data sets) does not make this adjustment. In larger datasets the difference +is trivial, however when testing the function on smaller datasets the practicioner may be confused by the discrepency. + +To verify this function against R on an arbitrary matrix, use the following form in R to "undo" the degrees of freedom correction. +```R +N <- nrow(x) +scale(x, scale= apply(x, 2, sd) * sqrt(N-1/N)) +``` + +### Parameters + +`StandardScaler` takes no parameters at this time. + +### Example + + +```scala +import org.apache.mahout.math.algorithms.preprocessing.StandardScaler + +val A = drmParallelize(dense( + (1, 1, 5), + (2, 5, -15), + (3, 9, -2)), numPartitions = 2) + +val scaler: StandardScalerModel = new StandardScaler().fit(A) + +val scaledA = scaler.transform(A) +``` + + http://git-wip-us.apache.org/repos/asf/mahout/blob/ec5eb314/website-old/docs/algorithms/preprocessors/index.md ---------------------------------------------------------------------- diff --git a/website-old/docs/algorithms/preprocessors/index.md b/website-old/docs/algorithms/preprocessors/index.md new file mode 100644 index 0000000..204e06c --- /dev/null +++ b/website-old/docs/algorithms/preprocessors/index.md @@ -0,0 +1,10 @@ +--- +layout: algorithm +title: Preprocesors +theme: + name: mahout2 +--- + +[AsFactor](AsFactor.html) - For "one-hot-encoding" + +[StandardScaler](StandardScaler.html) - For mean centering an unit variance \ No newline at end of file http://git-wip-us.apache.org/repos/asf/mahout/blob/ec5eb314/website-old/docs/algorithms/reccomenders/cco.md ---------------------------------------------------------------------- diff --git a/website-old/docs/algorithms/reccomenders/cco.md b/website-old/docs/algorithms/reccomenders/cco.md new file mode 100644 index 0000000..4a6f3f8 --- /dev/null +++ b/website-old/docs/algorithms/reccomenders/cco.md @@ -0,0 +1,435 @@ +--- +layout: algorithm +title: Intro to Cooccurrence Recommenders with Spark +theme: + name: retro-mahout +--- + +Mahout provides several important building blocks for creating recommendations using Spark. *spark-itemsimilarity* can +be used to create "other people also liked these things" type recommendations and paired with a search engine can +personalize recommendations for individual users. *spark-rowsimilarity* can provide non-personalized content based +recommendations and when paired with a search engine can be used to personalize content based recommendations. + +## References + +1. A free ebook, which talks about the general idea: [Practical Machine Learning](https://www.mapr.com/practical-machine-learning) +2. A slide deck, which talks about mixing actions or other indicators: [Creating a Unified Recommender](http://occamsmachete.com/ml/2014/10/07/creating-a-unified-recommender-with-mahout-and-a-search-engine/) +3. Two blog posts: [What's New in Recommenders: part #1](http://occamsmachete.com/ml/2014/08/11/mahout-on-spark-whats-new-in-recommenders/) +and [What's New in Recommenders: part #2](http://occamsmachete.com/ml/2014/09/09/mahout-on-spark-whats-new-in-recommenders-part-2/) +3. A post describing the loglikelihood ratio: [Surprise and Coinsidense](http://tdunning.blogspot.com/2008/03/surprise-and-coincidence.html) LLR is used to reduce noise in the data while keeping the calculations O(n) complexity. + +Below are the command line jobs but the drivers and associated code can also be customized and accessed from the Scala APIs. + +## 1. spark-itemsimilarity +*spark-itemsimilarity* is the Spark counterpart of the of the Mahout mapreduce job called *itemsimilarity*. It takes in elements of interactions, which have userID, itemID, and optionally a value. It will produce one of more indicator matrices created by comparing every user's interactions with every other user. The indicator matrix is an item x item matrix where the values are log-likelihood ratio strengths. For the legacy mapreduce version, there were several possible similarity measures but these are being deprecated in favor of LLR because in practice it performs the best. + +Mahout's mapreduce version of itemsimilarity takes a text file that is expected to have user and item IDs that conform to +Mahout's ID requirements--they are non-negative integers that can be viewed as row and column numbers in a matrix. + +*spark-itemsimilarity* also extends the notion of cooccurrence to cross-cooccurrence, in other words the Spark version will +account for multi-modal interactions and create indicator matrices allowing the use of much more data in +creating recommendations or similar item lists. People try to do this by mixing different actions and giving them weights. +For instance they might say an item-view is 0.2 of an item purchase. In practice this is often not helpful. Spark-itemsimilarity's +cross-cooccurrence is a more principled way to handle this case. In effect it scrubs secondary actions with the action you want +to recommend. + + + spark-itemsimilarity Mahout 1.0 + Usage: spark-itemsimilarity [options] + + Disconnected from the target VM, address: '127.0.0.1:64676', transport: 'socket' + Input, output options + -i <value> | --input <value> + Input path, may be a filename, directory name, or comma delimited list of HDFS supported URIs (required) + -i2 <value> | --input2 <value> + Secondary input path for cross-similarity calculation, same restrictions as "--input" (optional). Default: empty. + -o <value> | --output <value> + Path for output, any local or HDFS supported URI (required) + + Algorithm control options: + -mppu <value> | --maxPrefs <value> + Max number of preferences to consider per user (optional). Default: 500 + -m <value> | --maxSimilaritiesPerItem <value> + Limit the number of similarities per item to this number (optional). Default: 100 + + Note: Only the Log Likelihood Ratio (LLR) is supported as a similarity measure. + + Input text file schema options: + -id <value> | --inDelim <value> + Input delimiter character (optional). Default: "[,\t]" + -f1 <value> | --filter1 <value> + String (or regex) whose presence indicates a datum for the primary item set (optional). Default: no filter, all data is used + -f2 <value> | --filter2 <value> + String (or regex) whose presence indicates a datum for the secondary item set (optional). If not present no secondary dataset is collected + -rc <value> | --rowIDColumn <value> + Column number (0 based Int) containing the row ID string (optional). Default: 0 + -ic <value> | --itemIDColumn <value> + Column number (0 based Int) containing the item ID string (optional). Default: 1 + -fc <value> | --filterColumn <value> + Column number (0 based Int) containing the filter string (optional). Default: -1 for no filter + + Using all defaults the input is expected of the form: "userID<tab>itemId" or "userID<tab>itemID<tab>any-text..." and all rows will be used + + File discovery options: + -r | --recursive + Searched the -i path recursively for files that match --filenamePattern (optional), Default: false + -fp <value> | --filenamePattern <value> + Regex to match in determining input files (optional). Default: filename in the --input option or "^part-.*" if --input is a directory + + Output text file schema options: + -rd <value> | --rowKeyDelim <value> + Separates the rowID key from the vector values list (optional). Default: "\t" + -cd <value> | --columnIdStrengthDelim <value> + Separates column IDs from their values in the vector values list (optional). Default: ":" + -td <value> | --elementDelim <value> + Separates vector element values in the values list (optional). Default: " " + -os | --omitStrength + Do not write the strength to the output files (optional), Default: false. + This option is used to output indexable data for creating a search engine recommender. + + Default delimiters will produce output of the form: "itemID1<tab>itemID2:value2<space>itemID10:value10..." + + Spark config options: + -ma <value> | --master <value> + Spark Master URL (optional). Default: "local". Note that you can specify the number of cores to get a performance improvement, for example "local[4]" + -sem <value> | --sparkExecutorMem <value> + Max Java heap available as "executor memory" on each node (optional). Default: 4g + -rs <value> | --randomSeed <value> + + -h | --help + prints this usage text + +This looks daunting but defaults to simple fairly sane values to take exactly the same input as legacy code and is pretty flexible. It allows the user to point to a single text file, a directory full of files, or a tree of directories to be traversed recursively. The files included can be specified with either a regex-style pattern or filename. The schema for the file is defined by column numbers, which map to the important bits of data including IDs and values. The files can even contain filters, which allow unneeded rows to be discarded or used for cross-cooccurrence calculations. + +See ItemSimilarityDriver.scala in Mahout's spark module if you want to customize the code. + +### Defaults in the _**spark-itemsimilarity**_ CLI + +If all defaults are used the input can be as simple as: + + userID1,itemID1 + userID2,itemID2 + ... + +With the command line: + + + bash$ mahout spark-itemsimilarity --input in-file --output out-dir + + +This will use the "local" Spark context and will output the standard text version of a DRM + + itemID1<tab>itemID2:value2<space>itemID10:value10... + +### <a name="multiple-actions">How To Use Multiple User Actions</a> + +Often we record various actions the user takes for later analytics. These can now be used to make recommendations. +The idea of a recommender is to recommend the action you want the user to make. For an ecom app this might be +a purchase action. It is usually not a good idea to just treat other actions the same as the action you want to recommend. +For instance a view of an item does not indicate the same intent as a purchase and if you just mixed the two together you +might even make worse recommendations. It is tempting though since there are so many more views than purchases. With *spark-itemsimilarity* +we can now use both actions. Mahout will use cross-action cooccurrence analysis to limit the views to ones that do predict purchases. +We do this by treating the primary action (purchase) as data for the indicator matrix and use the secondary action (view) +to calculate the cross-cooccurrence indicator matrix. + +*spark-itemsimilarity* can read separate actions from separate files or from a mixed action log by filtering certain lines. For a mixed +action log of the form: + + u1,purchase,iphone + u1,purchase,ipad + u2,purchase,nexus + u2,purchase,galaxy + u3,purchase,surface + u4,purchase,iphone + u4,purchase,galaxy + u1,view,iphone + u1,view,ipad + u1,view,nexus + u1,view,galaxy + u2,view,iphone + u2,view,ipad + u2,view,nexus + u2,view,galaxy + u3,view,surface + u3,view,nexus + u4,view,iphone + u4,view,ipad + u4,view,galaxy + +### Command Line + + +Use the following options: + + bash$ mahout spark-itemsimilarity \ + --input in-file \ # where to look for data + --output out-path \ # root dir for output + --master masterUrl \ # URL of the Spark master server + --filter1 purchase \ # word that flags input for the primary action + --filter2 view \ # word that flags input for the secondary action + --itemIDPosition 2 \ # column that has the item ID + --rowIDPosition 0 \ # column that has the user ID + --filterPosition 1 # column that has the filter word + + + +### Output + +The output of the job will be the standard text version of two Mahout DRMs. This is a case where we are calculating +cross-cooccurrence so a primary indicator matrix and cross-cooccurrence indicator matrix will be created + + out-path + |-- similarity-matrix - TDF part files + \-- cross-similarity-matrix - TDF part-files + +The indicator matrix will contain the lines: + + galaxy\tnexus:1.7260924347106847 + ipad\tiphone:1.7260924347106847 + nexus\tgalaxy:1.7260924347106847 + iphone\tipad:1.7260924347106847 + surface + +The cross-cooccurrence indicator matrix will contain: + + iphone\tnexus:1.7260924347106847 iphone:1.7260924347106847 ipad:1.7260924347106847 galaxy:1.7260924347106847 + ipad\tnexus:0.6795961471815897 iphone:0.6795961471815897 ipad:0.6795961471815897 galaxy:0.6795961471815897 + nexus\tnexus:0.6795961471815897 iphone:0.6795961471815897 ipad:0.6795961471815897 galaxy:0.6795961471815897 + galaxy\tnexus:1.7260924347106847 iphone:1.7260924347106847 ipad:1.7260924347106847 galaxy:1.7260924347106847 + surface\tsurface:4.498681156950466 nexus:0.6795961471815897 + +**Note:** You can run this multiple times to use more than two actions or you can use the underlying +SimilarityAnalysis.cooccurrence API, which will more efficiently calculate any number of cross-cooccurrence indicators. + +### Log File Input + +A common method of storing data is in log files. If they are written using some delimiter they can be consumed directly by spark-itemsimilarity. For instance input of the form: + + 2014-06-23 14:46:53.115\tu1\tpurchase\trandom text\tiphone + 2014-06-23 14:46:53.115\tu1\tpurchase\trandom text\tipad + 2014-06-23 14:46:53.115\tu2\tpurchase\trandom text\tnexus + 2014-06-23 14:46:53.115\tu2\tpurchase\trandom text\tgalaxy + 2014-06-23 14:46:53.115\tu3\tpurchase\trandom text\tsurface + 2014-06-23 14:46:53.115\tu4\tpurchase\trandom text\tiphone + 2014-06-23 14:46:53.115\tu4\tpurchase\trandom text\tgalaxy + 2014-06-23 14:46:53.115\tu1\tview\trandom text\tiphone + 2014-06-23 14:46:53.115\tu1\tview\trandom text\tipad + 2014-06-23 14:46:53.115\tu1\tview\trandom text\tnexus + 2014-06-23 14:46:53.115\tu1\tview\trandom text\tgalaxy + 2014-06-23 14:46:53.115\tu2\tview\trandom text\tiphone + 2014-06-23 14:46:53.115\tu2\tview\trandom text\tipad + 2014-06-23 14:46:53.115\tu2\tview\trandom text\tnexus + 2014-06-23 14:46:53.115\tu2\tview\trandom text\tgalaxy + 2014-06-23 14:46:53.115\tu3\tview\trandom text\tsurface + 2014-06-23 14:46:53.115\tu3\tview\trandom text\tnexus + 2014-06-23 14:46:53.115\tu4\tview\trandom text\tiphone + 2014-06-23 14:46:53.115\tu4\tview\trandom text\tipad + 2014-06-23 14:46:53.115\tu4\tview\trandom text\tgalaxy + +Can be parsed with the following CLI and run on the cluster producing the same output as the above example. + + bash$ mahout spark-itemsimilarity \ + --input in-file \ + --output out-path \ + --master spark://sparkmaster:4044 \ + --filter1 purchase \ + --filter2 view \ + --inDelim "\t" \ + --itemIDPosition 4 \ + --rowIDPosition 1 \ + --filterPosition 2 + +## 2. spark-rowsimilarity + +*spark-rowsimilarity* is the companion to *spark-itemsimilarity* the primary difference is that it takes a text file version of +a matrix of sparse vectors with optional application specific IDs and it finds similar rows rather than items (columns). Its use is +not limited to collaborative filtering. The input is in text-delimited form where there are three delimiters used. By +default it reads (rowID<tab>columnID1:strength1<space>columnID2:strength2...) Since this job only supports LLR similarity, + which does not use the input strengths, they may be omitted in the input. It writes +(rowID<tab>rowID1:strength1<space>rowID2:strength2...) +The output is sorted by strength descending. The output can be interpreted as a row ID from the primary input followed +by a list of the most similar rows. + +The command line interface is: + + spark-rowsimilarity Mahout 1.0 + Usage: spark-rowsimilarity [options] + + Input, output options + -i <value> | --input <value> + Input path, may be a filename, directory name, or comma delimited list of HDFS supported URIs (required) + -o <value> | --output <value> + Path for output, any local or HDFS supported URI (required) + + Algorithm control options: + -mo <value> | --maxObservations <value> + Max number of observations to consider per row (optional). Default: 500 + -m <value> | --maxSimilaritiesPerRow <value> + Limit the number of similarities per item to this number (optional). Default: 100 + + Note: Only the Log Likelihood Ratio (LLR) is supported as a similarity measure. + Disconnected from the target VM, address: '127.0.0.1:49162', transport: 'socket' + + Output text file schema options: + -rd <value> | --rowKeyDelim <value> + Separates the rowID key from the vector values list (optional). Default: "\t" + -cd <value> | --columnIdStrengthDelim <value> + Separates column IDs from their values in the vector values list (optional). Default: ":" + -td <value> | --elementDelim <value> + Separates vector element values in the values list (optional). Default: " " + -os | --omitStrength + Do not write the strength to the output files (optional), Default: false. + This option is used to output indexable data for creating a search engine recommender. + + Default delimiters will produce output of the form: "itemID1<tab>itemID2:value2<space>itemID10:value10..." + + File discovery options: + -r | --recursive + Searched the -i path recursively for files that match --filenamePattern (optional), Default: false + -fp <value> | --filenamePattern <value> + Regex to match in determining input files (optional). Default: filename in the --input option or "^part-.*" if --input is a directory + + Spark config options: + -ma <value> | --master <value> + Spark Master URL (optional). Default: "local". Note that you can specify the number of cores to get a performance improvement, for example "local[4]" + -sem <value> | --sparkExecutorMem <value> + Max Java heap available as "executor memory" on each node (optional). Default: 4g + -rs <value> | --randomSeed <value> + + -h | --help + prints this usage text + +See RowSimilarityDriver.scala in Mahout's spark module if you want to customize the code. + +# 3. Using *spark-rowsimilarity* with Text Data + +Another use case for *spark-rowsimilarity* is in finding similar textual content. For instance given the tags associated with +a blog post, + which other posts have similar tags. In this case the columns are tags and the rows are posts. Since LLR is +the only similarity method supported this is not the optimal way to determine general "bag-of-words" document similarity. +LLR is used more as a quality filter than as a similarity measure. However *spark-rowsimilarity* will produce +lists of similar docs for every doc if input is docs with lists of terms. The Apache [Lucene](http://lucene.apache.org) project provides several methods of [analyzing and tokenizing](http://lucene.apache.org/core/4_9_0/core/org/apache/lucene/analysis/package-summary.html#package_description) documents. + +# <a name="unified-recommender">4. Creating a Unified Recommender</a> + +Using the output of *spark-itemsimilarity* and *spark-rowsimilarity* you can build a unified cooccurrence and content based + recommender that can be used in both or either mode depending on indicators available and the history available at +runtime for a user. + +## Requirements + +1. Mahout 0.10.0 or later +2. Hadoop +3. Spark, the correct version for your version of Mahout and Hadoop +4. A search engine like Solr or Elasticsearch + +## Indicators + +Indicators come in 3 types + +1. **Cooccurrence**: calculated with *spark-itemsimilarity* from user actions +2. **Content**: calculated from item metadata or content using *spark-rowsimilarity* +3. **Intrinsic**: assigned to items as metadata. Can be anything that describes the item. + +The query for recommendations will be a mix of values meant to match one of your indicators. The query can be constructed +from user history and values derived from context (category being viewed for instance) or special precalculated data +(popularity rank for instance). This blending of indicators allows for creating many flavors or recommendations to fit +a very wide variety of circumstances. + +With the right mix of indicators developers can construct a single query that works for completely new items and new users +while working well for items with lots of interactions and users with many recorded actions. In other words by adding in content and intrinsic +indicators developers can create a solution for the "cold-start" problem that gracefully improves with more user history +and as items have more interactions. It is also possible to create a completely content-based recommender that personalizes +recommendations. + +## Example with 3 Indicators + +You will need to decide how you store user action data so they can be processed by the item and row similarity jobs and +this is most easily done by using text files as described above. The data that is processed by these jobs is considered the +training data. You will need some amount of user history in your recs query. It is typical to use the most recent user history +but need not be exactly what is in the training set, which may include a greater volume of historical data. Keeping the user +history for query purposes could be done with a database by storing it in a users table. In the example above the two +collaborative filtering actions are "purchase" and "view", but let's also add tags (taken from catalog categories or other +descriptive metadata). + +We will need to create 1 cooccurrence indicator from the primary action (purchase) 1 cross-action cooccurrence indicator +from the secondary action (view) +and 1 content indicator (tags). We'll have to run *spark-itemsimilarity* once and *spark-rowsimilarity* once. + +We have described how to create the collaborative filtering indicator and cross-cooccurrence indicator for purchase and view (the [How to use Multiple User +Actions](#multiple-actions) section) but tags will be a slightly different process. We want to use the fact that +certain items have tags similar to the ones associated with a user's purchases. This is not a collaborative filtering indicator +but rather a "content" or "metadata" type indicator since you are not using other users' history, only the +individual that you are making recs for. This means that this method will make recommendations for items that have +no collaborative filtering data, as happens with new items in a catalog. New items may have tags assigned but no one + has purchased or viewed them yet. In the final query we will mix all 3 indicators. + +## Content Indicator + +To create a content-indicator we'll make use of the fact that the user has purchased items with certain tags. We want to find +items with the most similar tags. Notice that other users' behavior is not considered--only other item's tags. This defines a +content or metadata indicator. They are used when you want to find items that are similar to other items by using their +content or metadata, not by which users interacted with them. + +For this we need input of the form: + + itemID<tab>list-of-tags + ... + +The full collection will look like the tags column from a catalog DB. For our ecom example it might be: + + 3459860b<tab>men long-sleeve chambray clothing casual + 9446577d<tab>women tops chambray clothing casual + ... + +We'll use *spark-rowimilairity* because we are looking for similar rows, which encode items in this case. As with the +collaborative filtering indicator and cross-cooccurrence indicator we use the --omitStrength option. The strengths created are +probabilistic log-likelihood ratios and so are used to filter unimportant similarities. Once the filtering or downsampling +is finished we no longer need the strengths. We will get an indicator matrix of the form: + + itemID<tab>list-of-item IDs + ... + +This is a content indicator since it has found other items with similar content or metadata. + + 3459860b<tab>3459860b 3459860b 6749860c 5959860a 3434860a 3477860a + 9446577d<tab>9446577d 9496577d 0943577d 8346577d 9442277d 9446577e + ... + +We now have three indicators, two collaborative filtering type and one content type. + +## Unified Recommender Query + +The actual form of the query for recommendations will vary depending on your search engine but the intent is the same. +For a given user, map their history of an action or content to the correct indicator field and perform an OR'd query. + +We have 3 indicators, these are indexed by the search engine into 3 fields, we'll call them "purchase", "view", and "tags". +We take the user's history that corresponds to each indicator and create a query of the form: + + Query: + field: purchase; q:user's-purchase-history + field: view; q:user's view-history + field: tags; q:user's-tags-associated-with-purchases + +The query will result in an ordered list of items recommended for purchase but skewed towards items with similar tags to +the ones the user has already purchased. + +This is only an example and not necessarily the optimal way to create recs. It illustrates how business decisions can be +translated into recommendations. This technique can be used to skew recommendations towards intrinsic indicators also. +For instance you may want to put personalized popular item recs in a special place in the UI. Create a popularity indicator +by tagging items with some category of popularity (hot, warm, cold for instance) then +index that as a new indicator field and include the corresponding value in a query +on the popularity field. If we use the ecom example but use the query to get "hot" recommendations it might look like this: + + Query: + field: purchase; q:user's-purchase-history + field: view; q:user's view-history + field: popularity; q:"hot" + +This will return recommendations favoring ones that have the intrinsic indicator "hot". + +## Notes +1. Use as much user action history as you can gather. Choose a primary action that is closest to what you want to recommend and the others will be used to create cross-cooccurrence indicators. Using more data in this fashion will almost always produce better recommendations. +2. Content can be used where there is no recorded user behavior or when items change too quickly to get much interaction history. They can be used alone or mixed with other indicators. +3. Most search engines support "boost" factors so you can favor one or more indicators. In the example query, if you want tags to only have a small effect you could boost the CF indicators. +4. In the examples we have used space delimited strings for lists of IDs in indicators and in queries. It may be better to use arrays of strings if your storage system and search engine support them. For instance Solr allows multi-valued fields, which correspond to arrays. http://git-wip-us.apache.org/repos/asf/mahout/blob/ec5eb314/website-old/docs/algorithms/reccomenders/d-als.md ---------------------------------------------------------------------- diff --git a/website-old/docs/algorithms/reccomenders/d-als.md b/website-old/docs/algorithms/reccomenders/d-als.md new file mode 100644 index 0000000..b5fe1b0 --- /dev/null +++ b/website-old/docs/algorithms/reccomenders/d-als.md @@ -0,0 +1,58 @@ +--- +layout: algorithm +title: Mahout Samsara Distributed ALS +theme: + name: mahout2 +--- +Seems like someone has jacked up this page? +TODO: Find the ALS Page + +## Intro + +Mahout has a distributed implementation of QR decomposition for tall thin matricies[1]. + +## Algorithm + +For the classic QR decomposition of the form `\(\mathbf{A}=\mathbf{QR},\mathbf{A}\in\mathbb{R}^{m\times n}\)` a distributed version is fairly easily achieved if `\(\mathbf{A}\)` is tall and thin such that `\(\mathbf{A}^{\top}\mathbf{A}\)` fits in memory, i.e. *m* is large but *n* < ~5000 Under such circumstances, only `\(\mathbf{A}\)` and `\(\mathbf{Q}\)` are distributed matricies and `\(\mathbf{A^{\top}A}\)` and `\(\mathbf{R}\)` are in-core products. We just compute the in-core version of the Cholesky decomposition in the form of `\(\mathbf{LL}^{\top}= \mathbf{A}^{\top}\mathbf{A}\)`. After that we take `\(\mathbf{R}= \mathbf{L}^{\top}\)` and `\(\mathbf{Q}=\mathbf{A}\left(\mathbf{L}^{\top}\right)^{-1}\)`. The latter is easily achieved by multiplying each verticle block of `\(\mathbf{A}\)` by `\(\left(\mathbf{L}^{\top}\right)^{-1}\)`. (There is no actual matrix inversion happening). + + + +## Implementation + +Mahout `dqrThin(...)` is implemented in the mahout `math-scala` algebraic optimizer which translates Mahout's R-like linear algebra operators into a physical plan for both Spark and H2O distributed engines. + + def dqrThin[K: ClassTag](A: DrmLike[K], checkRankDeficiency: Boolean = true): (DrmLike[K], Matrix) = { + if (drmA.ncol > 5000) + log.warn("A is too fat. A'A must fit in memory and easily broadcasted.") + implicit val ctx = drmA.context + val AtA = (drmA.t %*% drmA).checkpoint() + val inCoreAtA = AtA.collect + val ch = chol(inCoreAtA) + val inCoreR = (ch.getL cloned) t + if (checkRankDeficiency && !ch.isPositiveDefinite) + throw new IllegalArgumentException("R is rank-deficient.") + val bcastAtA = sc.broadcast(inCoreAtA) + val Q = A.mapBlock() { + case (keys, block) => keys -> chol(bcastAtA).solveRight(block) + } + Q -> inCoreR + } + + +## Usage + +The scala `dqrThin(...)` method can easily be called in any Spark or H2O application built with the `math-scala` library and the corresponding `Spark` or `H2O` engine module as follows: + + import org.apache.mahout.math._ + import decompositions._ + import drm._ + + val(drmQ, inCoreR) = dqrThin(drma) + + +## References + +[1]: [Mahout Scala and Mahout Spark Bindings for Linear Algebra Subroutines](http://mahout.apache.org/users/sparkbindings/ScalaSparkBindings.pdf) + +[2]: [Mahout Spark and Scala Bindings](http://mahout.apache.org/users/sparkbindings/home.html) + http://git-wip-us.apache.org/repos/asf/mahout/blob/ec5eb314/website-old/docs/algorithms/reccomenders/index.md ---------------------------------------------------------------------- diff --git a/website-old/docs/algorithms/reccomenders/index.md b/website-old/docs/algorithms/reccomenders/index.md new file mode 100644 index 0000000..00d8ec4 --- /dev/null +++ b/website-old/docs/algorithms/reccomenders/index.md @@ -0,0 +1,33 @@ +--- +layout: algorithm +title: Recommender Quickstart +theme: + name: retro-mahout +--- + + +# Recommender Overview + +Recommenders have changed over the years. Mahout contains a long list of them, which you can still use. But to get the best out of our more modern aproach we'll need to think of the Recommender as a "model creation" component—supplied by Mahout's new spark-itemsimilarity job, and a "serving" component—supplied by a modern scalable search engine, like Solr. + + + +To integrate with your application you will collect user interactions storing them in a DB and also in a from usable by Mahout. The simplest way to do this is to log user interactions to csv files (user-id, item-id). The DB should be setup to contain the last n user interactions, which will form part of the query for recommendations. + +Mahout's spark-itemsimilarity will create a table of (item-id, list-of-similar-items) in csv form. Think of this as an item collection with one field containing the item-ids of similar items. Index this with your search engine. + +When your application needs recommendations for a specific person, get the latest user history of interactions from the DB and query the indicator collection with this history. You will get back an ordered list of item-ids. These are your recommendations. You may wish to filter out any that the user has already seen but that will depend on your use case. + +All ids for users and items are preserved as string tokens and so work as an external key in DBs or as doc ids for search engines, they also work as tokens for search queries. + +##References + +1. A free ebook, which talks about the general idea: [Practical Machine Learning](https://www.mapr.com/practical-machine-learning) +2. A slide deck, which talks about mixing actions or other indicators: [Creating a Multimodal Recommender with Mahout and a Search Engine](http://occamsmachete.com/ml/2014/10/07/creating-a-unified-recommender-with-mahout-and-a-search-engine/) +3. Two blog posts: [What's New in Recommenders: part #1](http://occamsmachete.com/ml/2014/08/11/mahout-on-spark-whats-new-in-recommenders/) +and [What's New in Recommenders: part #2](http://occamsmachete.com/ml/2014/09/09/mahout-on-spark-whats-new-in-recommenders-part-2/) +3. A post describing the loglikelihood ratio: [Surprise and Coinsidense](http://tdunning.blogspot.com/2008/03/surprise-and-coincidence.html) LLR is used to reduce noise in the data while keeping the calculations O(n) complexity. + +##Mahout Model Creation + +See the page describing [*spark-itemsimilarity*](http://mahout.apache.org/users/recommender/intro-cooccurrence-spark.html) for more details. \ No newline at end of file http://git-wip-us.apache.org/repos/asf/mahout/blob/ec5eb314/website-old/docs/algorithms/regression/fittness-tests.md ---------------------------------------------------------------------- diff --git a/website-old/docs/algorithms/regression/fittness-tests.md b/website-old/docs/algorithms/regression/fittness-tests.md new file mode 100644 index 0000000..1bd8984 --- /dev/null +++ b/website-old/docs/algorithms/regression/fittness-tests.md @@ -0,0 +1,17 @@ +--- +layout: algorithm +title: Regression Fitness Tests +theme: + name: mahout2 +--- + +TODO: Fill this out! +Stub + +### About + +### Parameters + +### Example + + http://git-wip-us.apache.org/repos/asf/mahout/blob/ec5eb314/website-old/docs/algorithms/regression/index.md ---------------------------------------------------------------------- diff --git a/website-old/docs/algorithms/regression/index.md b/website-old/docs/algorithms/regression/index.md new file mode 100644 index 0000000..3639c4f --- /dev/null +++ b/website-old/docs/algorithms/regression/index.md @@ -0,0 +1,21 @@ +--- +layout: algorithm + +title: Regressoin Algorithms +theme: + name: retro-mahout +--- + +Apache Mahout implements the following regression algorithms "off the shelf". + +### Closed Form Solutions + +These methods used close form solutions (not stochastic) to solve regression problems + +[Ordinary Least Squares](ols.html) + +### Autocorrelation Regression + +Serial Correlation of the error terms can lead to biased estimates of regression parameters, the following remedial procedures are provided: + +[Cochrane Orcutt Procedure](cochrane-orcutt.html) http://git-wip-us.apache.org/repos/asf/mahout/blob/ec5eb314/website-old/docs/algorithms/regression/ols.md ---------------------------------------------------------------------- diff --git a/website-old/docs/algorithms/regression/ols.md b/website-old/docs/algorithms/regression/ols.md new file mode 100644 index 0000000..58c74c5 --- /dev/null +++ b/website-old/docs/algorithms/regression/ols.md @@ -0,0 +1,66 @@ +--- +layout: algorithm +title: Ordinary Least Squares Regression +theme: + name: mahout2 +--- + +### About + +The `OrinaryLeastSquares` regressor in Mahout implements a _closed-form_ solution to [Ordinary Least Squares](https://en.wikipedia.org/wiki/Ordinary_least_squares). +This is in stark contrast to many "big data machine learning" frameworks which implement a _stochastic_ approach. From the users perspecive this difference can be reduced to: + +- **_Stochastic_**- A series of guesses at a line line of best fit. +- **_Closed Form_**- A mathimatical approach has been explored, the properties of the parameters are well understood, and problems which arise (and the remedial measures), exist. This is usually the preferred choice of mathematicians/statisticians, but computational limititaions have forced us to resort to SGD. + +### Parameters + +<div class="table-striped"> + <table class="table"> + <tr> + <th>Parameter</th> + <th>Description</th> + <th>Default Value</th> + </tr> + <tr> + <td><code>'calcCommonStatistics</code></td> + <td>Calculate commons statistics such as Coeefficient of Determination and Mean Square Error</td> + <td><code>true</code></td> + </tr> + <tr> + <td><code>'calcStandardErrors</code></td> + <td>Calculate the standard errors (and subsequent "t-scores" and "p-values") of the \(\boldsymbol{\beta}\) estimates</td> + <td><code>true</code></td> + </tr> + <tr> + <td><code>'addIntercept</code></td> + <td>Add an intercept to \(\mathbf{X}\)</td> + <td><code>true</code></td> + </tr> + </table> +</div> + +### Example + +In this example we disable the "calculate common statistics" parameters, so our summary will NOT contain the coefficient of determination (R-squared) or Mean Square Error +```scala +import org.apache.mahout.math.algorithms.regression.OrdinaryLeastSquares + +val drmData = drmParallelize(dense( + (2, 2, 10.5, 10, 29.509541), // Apple Cinnamon Cheerios + (1, 2, 12, 12, 18.042851), // Cap'n'Crunch + (1, 1, 12, 13, 22.736446), // Cocoa Puffs + (2, 1, 11, 13, 32.207582), // Froot Loops + (1, 2, 12, 11, 21.871292), // Honey Graham Ohs + (2, 1, 16, 8, 36.187559), // Wheaties Honey Gold + (6, 2, 17, 1, 50.764999), // Cheerios + (3, 2, 13, 7, 40.400208), // Clusters + (3, 3, 13, 4, 45.811716)), numPartitions = 2) + + +val drmX = drmData(::, 0 until 4) +val drmY = drmData(::, 4 until 5) + +val model = new OrdinaryLeastSquares[Int]().fit(drmX, drmY, 'calcCommonStatistics â false) +println(model.summary) +``` http://git-wip-us.apache.org/repos/asf/mahout/blob/ec5eb314/website-old/docs/algorithms/regression/serial-correlation/cochrane-orcutt.md ---------------------------------------------------------------------- diff --git a/website-old/docs/algorithms/regression/serial-correlation/cochrane-orcutt.md b/website-old/docs/algorithms/regression/serial-correlation/cochrane-orcutt.md new file mode 100644 index 0000000..b155ca8 --- /dev/null +++ b/website-old/docs/algorithms/regression/serial-correlation/cochrane-orcutt.md @@ -0,0 +1,146 @@ +--- +layout: algorithm +title: Cochrane-Orcutt Procedure +theme: + name: mahout2 +--- + +### About + +The [Cochrane Orcutt](https://en.wikipedia.org/wiki/Cochrane%E2%80%93Orcutt_estimation) procedure is use in economics to +adjust a linear model for serial correlation in the error term. + +The cooresponding method in R is [`cochrane.orcutt`](https://cran.r-project.org/web/packages/orcutt/orcutt.pdf) +however the implementation differes slightly. + +#### R Prototype: + library(orcutt) + + df = data.frame(t(data.frame( + c(20.96, 127.3), + c(21.40, 130.0), + c(21.96, 132.7), + c(21.52, 129.4), + c(22.39, 135.0), + c(22.76, 137.1), + c(23.48, 141.2), + c(23.66, 142.8), + c(24.10, 145.5), + c(24.01, 145.3), + c(24.54, 148.3), + c(24.30, 146.4), + c(25.00, 150.2), + c(25.64, 153.1), + c(26.36, 157.3), + c(26.98, 160.7), + c(27.52, 164.2), + c(27.78, 165.6), + c(28.24, 168.7), + c(28.78, 171.7)))) + + rownames(df) <- NULL + colnames(df) <- c("y", "x") + my_lm = lm(y ~ x, data=df) + coch = cochrane.orcutt(my_lm) + + +The R-implementation is kind of...silly. + +The above works- converges at 318 iterations- the transformed DW is 1.72, yet the rho is + .95882. After 318 iteartions, this will also report a rho of .95882 (which sugguests SEVERE + autocorrelation- nothing close to 1.72. + + At anyrate, the real prototype for this is the example from [Applied Linear Statistcal Models + 5th Edition by Kunter, Nachstheim, Neter, and Li](https://www.amazon.com/Applied-Linear-Statistical-Models-Hardcover/dp/B010EWX85C/ref=sr_1_4?ie=UTF8&qid=1493847480&sr=8-4&keywords=applied+linear+statistical+models+5th+edition). + +Steps: +1. Normal Regression +2. Estimate <foo>\(\rho\)</foo> +3. Get Estimates of Transformed Equation +4. Step 5: Use Betas from (4) to recalculate model from (1) +5. Step 6: repeat Step 2 through 5 until a stopping criteria is met. Some models call for convergence- +Kunter et. al reccomend 3 iterations, if you don't achieve desired results, use an alternative method. + +#### Some additional notes from Applied Linear Statistical Models: + They also provide some interesting notes on p 494: + + 1. "Cochrane-Orcutt does not always work properly. A major reason is that when the error terms + are positively autocorrelated, the estimate <foo>\(r\)</foo> in (12.22) tends to underestimate the autocorrelation + parameter <foo>\(\rho\)</foo>. When this bias is serious, it can significantly reduce the effectiveness of the + Cochrane-Orcutt approach. + 1. "There exists an approximate relation between the [Durbin Watson test statistic](dw-test.html) <foo>\(\mathbf{D}\)</foo> in (12.14) + and the estimated autocorrelation paramater <foo>\(r\)</foo> in (12.22): + <center>\(D ~= 2(1-\rho)\)</center> + + They also note on p492: + "... If the process does not terminate after one or two iterations, a different procedure + should be employed." + This differs from the logic found elsewhere, and the method presented in R where, in the simple + example in the prototype, the procedure runs for 318 iterations. This is why the default + maximum iteratoins are 3, and should be left as such. + + Also, the prototype and 'correct answers' are based on the example presented in Kunter et. al on + p492-4 (including dataset). + + +### Parameters + + +<div class="table-striped"> + <table class="table"> + <tr> + <th>Parameter</th> + <th>Description</th> + <th>Default Value</th> + </tr> + <tr> + <td><code>'regressor</code></td> + <td>Any subclass of <code>org.apache.mahout.math.algorithms.regression.LinearRegressorFitter</code></td> + <td><code>OrdinaryLeastSquares()</code></td> + </tr> + <tr> + <td><code>'iteratoins</code></td> + <td>Unlike our friends in R- we stick to the 3 iteration guidance.</td> + <td>3</td> + </tr> + <tr> + <td><code>'cacheHint</code></td> + <td>The DRM Cache Hint to use when holding the data in memory between iterations</td> + <td><code>CacheHint.MEMORY_ONLY</code></td> + </tr> + </table> +</div> + +### Example + + + val alsmBlaisdellCo = drmParallelize( dense( + (20.96, 127.3), + (21.40, 130.0), + (21.96, 132.7), + (21.52, 129.4), + (22.39, 135.0), + (22.76, 137.1), + (23.48, 141.2), + (23.66, 142.8), + (24.10, 145.5), + (24.01, 145.3), + (24.54, 148.3), + (24.30, 146.4), + (25.00, 150.2), + (25.64, 153.1), + (26.36, 157.3), + (26.98, 160.7), + (27.52, 164.2), + (27.78, 165.6), + (28.24, 168.7), + (28.78, 171.7) )) + + val drmY = alsmBlaisdellCo(::, 0 until 1) + val drmX = alsmBlaisdellCo(::, 1 until 2) + + var coModel = new CochraneOrcutt[Int]().fit(drmX, drmY , ('iterations -> 2)) + + println(coModel.rhos) + println(coModel.summary) + http://git-wip-us.apache.org/repos/asf/mahout/blob/ec5eb314/website-old/docs/algorithms/regression/serial-correlation/dw-test.md ---------------------------------------------------------------------- diff --git a/website-old/docs/algorithms/regression/serial-correlation/dw-test.md b/website-old/docs/algorithms/regression/serial-correlation/dw-test.md new file mode 100644 index 0000000..7bdd896 --- /dev/null +++ b/website-old/docs/algorithms/regression/serial-correlation/dw-test.md @@ -0,0 +1,43 @@ +--- +layout: algorithm +title: Durbin-Watson Test +theme: + name: mahout2 +--- + +### About + +The [Durbin Watson Test](https://en.wikipedia.org/wiki/Durbin%E2%80%93Watson_statistic) is a test for serial correlation +in error terms. The Durbin Watson test statistic <foo>\(d\)</foo> can take values between 0 and 4, and in general + +- <foo>\(d \lt 1.5 \)</foo> implies positive autocorrelation +- <foo>\(d \gt 2.5 \)</foo> implies negative autocorrelation +- <foo>\(1.5 \lt d \lt 2.5 \)</foo> implies to autocorrelation. + +Implementation is based off of the `durbinWatsonTest` function in the [`car`](https://cran.r-project.org/web/packages/car/index.html) package in R + +### Parameters + +### Example + +#### R Prototype + + library(car) + residuals <- seq(0, 4.9, 0.1) + ## perform Durbin-Watson test + durbinWatsonTest(residuals) + +#### In Apache Mahout + + + // A DurbinWatson Test must be performed on a model. The model does not matter. + val drmX = drmParallelize( dense((0 until 50).toArray.map( t => Math.pow(-1.0, t)) ) ).t + val drmY = drmX + err1 + 1 + var model = new OrdinaryLeastSquares[Int]().fit(drmX, drmY) + // end arbitrary model + + val err1 = drmParallelize( dense((0.0 until 5.0 by 0.1).toArray) ).t + val syntheticResiduals = err1 + model = AutocorrelationTests.DurbinWatson(model, syntheticResiduals) + val myAnswer: Double = model.testResults.getOrElse('durbinWatsonTestStatistic, -1.0).asInstanceOf[Double] + http://git-wip-us.apache.org/repos/asf/mahout/blob/ec5eb314/website-old/docs/algorithms/template.md ---------------------------------------------------------------------- diff --git a/website-old/docs/algorithms/template.md b/website-old/docs/algorithms/template.md new file mode 100644 index 0000000..4a48829 --- /dev/null +++ b/website-old/docs/algorithms/template.md @@ -0,0 +1,20 @@ +--- +layout: algorithm +title: AsFactor +theme: + name: mahout2 +--- + +TODO: Fill this out! +Stub + +### About + +### Parameters + +### Example + + + + + http://git-wip-us.apache.org/repos/asf/mahout/blob/ec5eb314/website-old/docs/assets ---------------------------------------------------------------------- diff --git a/website-old/docs/assets b/website-old/docs/assets new file mode 120000 index 0000000..ec2e4be --- /dev/null +++ b/website-old/docs/assets @@ -0,0 +1 @@ +../assets \ No newline at end of file http://git-wip-us.apache.org/repos/asf/mahout/blob/ec5eb314/website-old/docs/changelog.md ---------------------------------------------------------------------- diff --git a/website-old/docs/changelog.md b/website-old/docs/changelog.md new file mode 100755 index 0000000..7965e9d --- /dev/null +++ b/website-old/docs/changelog.md @@ -0,0 +1,70 @@ +## Changelog + +Public releases are all root nodes. +Incremental version bumps that were not released publicly are nested where appropriate. + +P.S. If there is a standard (popular) changelog format, please let me know. + +- **0.3.0 : 2013.02.24** + - **Features** + - Update twitter bootstrap to 2.2.2. Add responsiveness and update design a bit. + - @techotaku fixes custom tagline support (finally made it in!) + - @opie4624 adds ability to set tags from the command-line. + - @lax adds support for RSS feed. Adds rss and atom html links for discovery. + - Small typo fixes. + + - **Bug Fixes** + - @xuhdev fixes theme:install bug which does not overwrite theme even if saying 'yes'. + +- **0.2.13 : 2012.03.24** + - **Features** + - 0.2.13 : @mjpieters Updates pages_list helper to only show pages having a title. + - 0.2.12 : @sway recommends showing page tagline only if tagline is set. + - 0.2.11 : @LukasKnuth adds 'description' meta-data field to post/page scaffold. + + - **Bug Fixes** + - 0.2.10 : @koriroys fixes typo in atom feed + +- **0.2.9 : 2012.03.01** + - **Bug Fixes** + - 0.2.9 : @alishutc Fixes the error on post creation if date was not specified. + +- **0.2.8 : 2012.03.01** + - **Features** + - 0.2.8 : @metalelf0 Added option to specify a custom date when creating post. + - 0.2.7 : @daz Updates twitter theme framework to use 2.x while still maintaining core layout. #50 + @philips and @treggats add support for page.tagline metadata. #31 & #48 + - 0.2.6 : @koomar Adds Mixpanel analytics provider. #49 + - 0.2.5 : @nolith Adds ability to load custom rake scripts. #33 + - 0.2.4 : @tommyblue Updated disqus comments provider to be compatible with posts imported from Wordpress. #47 + + - **Bug Fixes** + - 0.2.3 : @3martini Adds Windows MSYS Support and error checks for git system calls. #40 + - 0.2.2 : @sstar Resolved an issue preventing disabling comments for individual pages #44 + - 0.2.1 : Resolve incorrect HOME\_PATH/BASE\_PATH settings + +- **0.2.0 : 2012.02.01** + Features + - Add Theme Packages v 0.1.0 + All themes should be tracked and maintained outside of JB core. + Themes get "installed" via the Theme Installer. + Theme Packages versioning is done separately from JB core with + the main intent being to make sure theme versions are compatible with the given installer. + + - 0.1.2 : @jamesFleeting adds facebook comments support + - 0.1.1 : @SegFaultAX adds tagline as site-wide configuration + +- **0.1.0 : 2012.01.24** + First major versioned release. + Features + - Standardize Public API + - Use name-spacing and modulation where possible. + - Ability to override public methods with custom code. + - Publish the theme API. + - Ship with comments, analytics integration. + +- **0.0.1 : 2011.12.30** + First public release, lots of updates =p + Thank you everybody for dealing with the fast changes and helping + me work out the API to a manageable state. + http://git-wip-us.apache.org/repos/asf/mahout/blob/ec5eb314/website-old/docs/distributed/flink-bindings.md ---------------------------------------------------------------------- diff --git a/website-old/docs/distributed/flink-bindings.md b/website-old/docs/distributed/flink-bindings.md new file mode 100644 index 0000000..4a39e56 --- /dev/null +++ b/website-old/docs/distributed/flink-bindings.md @@ -0,0 +1,49 @@ +--- +layout: page +title: Mahout Samsara Flink Bindings +theme: + name: mahout2 +--- +# Introduction + +This document provides an overview of how the Mahout Samsara environment is implemented over the Apache Flink backend engine. This document gives an overview of the code layout for the Flink backend engine, the source code for which can be found under /flink directory in the Mahout codebase. + +Apache Flink is a distributed big data streaming engine that supports both Streaming and Batch interfaces. Batch processing is an extension of Flinkâs Stream processing engine. + +The Mahout Flink integration presently supports Flinkâs batch processing capabilities leveraging the DataSet API. + +The Mahout DRM, or Distributed Row Matrix, is an abstraction for storing a large matrix of numbers in-memory in a cluster by distributing logical rows among servers. Mahout's scala DSL provides an abstract API on DRMs for backend engines to provide implementations of this API. An example is the Spark backend engine. Each engine has it's own design of mapping the abstract API onto its data model and provides implementations for algebraic operators over that mapping. + +# Flink Overview + +Apache Flink is an open source, distributed Stream and Batch Processing Framework. At it's core, Flink is a Stream Processing engine and Batch processing is an extension of Stream Processing. + +Flink includes several APIs for building applications with the Flink Engine: + + <ol> +<li><b>DataSet API</b> for Batch data in Java, Scala and Python</li> +<li><b>DataStream API</b> for Stream Processing in Java and Scala</li> +<li><b>Table API</b> with SQL-like regular expression language in Java and Scala</li> +<li><b>Gelly</b> Graph Processing API in Java and Scala</li> +<li><b>CEP API</b>, a complex event processing library</li> +<li><b>FlinkML</b>, a Machine Learning library</li> +</ol> +#Flink Environment Engine + +The Flink backend implements the abstract DRM as a Flink DataSet. A Flink job runs in the context of an ExecutionEnvironment (from the Flink Batch processing API). + +# Source Layout + +Within mahout.git, the top level directory, flink/ holds all the source code for the Flink backend engine. Sections of code that interface with the rest of the Mahout components are in Scala, and sections of the code that interface with Flink DataSet API and implement algebraic operators are in Java. Here is a brief overview of what functionality can be found within flink/ folder. + +flink/ - top level directory containing all Flink related code + +flink/src/main/scala/org/apache/mahout/flinkbindings/blas/*.scala - Physical operator code for the Samsara DSL algebra + +flink/src/main/scala/org/apache/mahout/flinkbindings/drm/*.scala - Flink Dataset DRM and broadcast implementation + +flink/src/main/scala/org/apache/mahout/flinkbindings/io/*.scala - Read / Write between DRMDataSet and files on HDFS + +flink/src/main/scala/org/apache/mahout/flinkbindings/FlinkEngine.scala - DSL operator graph evaluator and various abstract API implementations for a distributed engine. + + http://git-wip-us.apache.org/repos/asf/mahout/blob/ec5eb314/website-old/docs/distributed/h2o-internals.md ---------------------------------------------------------------------- diff --git a/website-old/docs/distributed/h2o-internals.md b/website-old/docs/distributed/h2o-internals.md new file mode 100644 index 0000000..2388354 --- /dev/null +++ b/website-old/docs/distributed/h2o-internals.md @@ -0,0 +1,50 @@ +--- +layout: page +title: Mahout Samsara H20 Bindings +theme: + name: mahout2 +--- +# Introduction + +This document provides an overview of how the Mahout Samsara environment is implemented over the H2O backend engine. The document is aimed at Mahout developers, to give a high level description of the design so that one can explore the code inside `h2o/` with some context. + +## H2O Overview + +H2O is a distributed scalable machine learning system. Internal architecture of H2O has a distributed math engine (h2o-core) and a separate layer on top for algorithms and UI. The Mahout integration requires only the math engine (h2o-core). + +## H2O Data Model + +The data model of the H2O math engine is a distributed columnar store (of primarily numbers, but also strings). A column of numbers is called a Vector, which is broken into Chunks (of a few thousand elements). Chunks are distributed across the cluster based on a deterministic hash. Therefore, any member of the cluster knows where a particular Chunk of a Vector is homed. Each Chunk is separately compressed in memory and elements are individually decompressed on the fly upon access with purely register operations (thereby achieving high memory throughput). An ordered set of similarly partitioned Vecs are composed into a Frame. A Frame is therefore a large two dimensional table of numbers. All elements of a logical row in the Frame are guaranteed to be homed in the same server of the cluster. Generally speaking, H2O works well on "tall skinny" data, i.e, lots of rows (100s of millions) and modest number of columns (10s of thousands). + + +## Mahout DRM + +The Mahout DRM, or Distributed Row Matrix, is an abstraction for storing a large matrix of numbers in-memory in a cluster by distributing logical rows among servers. Mahout's scala DSL provides an abstract API on DRMs for backend engines to provide implementations of this API. Examples are the Spark and H2O backend engines. Each engine has it's own design of mapping the abstract API onto its data model and provides implementations for algebraic operators over that mapping. + + +## H2O Environment Engine + +The H2O backend implements the abstract DRM as an H2O Frame. Each logical column in the DRM is an H2O Vector. All elements of a logical DRM row are guaranteed to be homed on the same server. A set of rows stored on a server are presented as a read-only virtual in-core Matrix (i.e BlockMatrix) for the closure method in the `mapBlock(...)` API. + +H2O provides a flexible execution framework called `MRTask`. The `MRTask` framework typically executes over a Frame (or even a Vector), supports various types of map() methods, can optionally modify the Frame or Vector (though this never happens in the Mahout integration), and optionally create a new Vector or set of Vectors (to combine them into a new Frame, and consequently a new DRM). + + +## Source Layout + +Within mahout.git, the top level directory, `h2o/` holds all the source code related to the H2O backend engine. Part of the code (that interfaces with the rest of the Mahout componenets) is in Scala, and part of the code (that interfaces with h2o-core and implements algebraic operators) is in Java. Here is a brief overview of what functionality can be found where within `h2o/`. + + h2o/ - top level directory containing all H2O related code + + h2o/src/main/java/org/apache/mahout/h2obindings/ops/*.java - Physical operator code for the various DSL algebra + + h2o/src/main/java/org/apache/mahout/h2obindings/drm/*.java - DRM backing (onto Frame) and Broadcast implementation + + h2o/src/main/java/org/apache/mahout/h2obindings/H2OHdfs.java - Read / Write between DRM (Frame) and files on HDFS + + h2o/src/main/java/org/apache/mahout/h2obindings/H2OBlockMatrix.java - A vertical block matrix of DRM presented as a virtual copy-on-write in-core Matrix. Used in mapBlock() API + + h2o/src/main/java/org/apache/mahout/h2obindings/H2OHelper.java - A collection of various functionality and helpers. For e.g, convert between in-core Matrix and DRM, various summary statistics on DRM/Frame. + + h2o/src/main/scala/org/apache/mahout/h2obindings/H2OEngine.scala - DSL operator graph evaluator and various abstract API implementations for a distributed engine + + h2o/src/main/scala/org/apache/mahout/h2obindings/* - Various abstract API implementations ("glue work") \ No newline at end of file http://git-wip-us.apache.org/repos/asf/mahout/blob/ec5eb314/website-old/docs/distributed/spark-bindings/MahoutScalaAndSparkBindings.pptx ---------------------------------------------------------------------- diff --git a/website-old/docs/distributed/spark-bindings/MahoutScalaAndSparkBindings.pptx b/website-old/docs/distributed/spark-bindings/MahoutScalaAndSparkBindings.pptx new file mode 100644 index 0000000..ec1de04 Binary files /dev/null and b/website-old/docs/distributed/spark-bindings/MahoutScalaAndSparkBindings.pptx differ
