http://git-wip-us.apache.org/repos/asf/mahout/blob/516e3fb9/website/docs/tutorials/map-reduce/clustering/canopy-commandline.md ---------------------------------------------------------------------- diff --git a/website/docs/tutorials/map-reduce/clustering/canopy-commandline.md b/website/docs/tutorials/map-reduce/clustering/canopy-commandline.md new file mode 100644 index 0000000..cbaa81d --- /dev/null +++ b/website/docs/tutorials/map-reduce/clustering/canopy-commandline.md @@ -0,0 +1,70 @@ +--- +layout: mr_tutorial +title: canopy-commandline +theme: + name: retro-mahout +--- + +<a name="canopy-commandline-RunningCanopyClusteringfromtheCommandLine"></a> +# Running Canopy Clustering from the Command Line +Mahout's Canopy clustering can be launched from the same command line +invocation whether you are running on a single machine in stand-alone mode +or on a larger Hadoop cluster. The difference is determined by the +$HADOOP_HOME and $HADOOP_CONF_DIR environment variables. If both are set to +an operating Hadoop cluster on the target machine then the invocation will +run Canopy on that cluster. If either of the environment variables are +missing then the stand-alone Hadoop configuration will be invoked instead. + + + ./bin/mahout canopy <OPTIONS> + + +* In $MAHOUT_HOME/, build the jar containing the job (mvn install) The job +will be generated in $MAHOUT_HOME/core/target/ and it's name will contain +the Mahout version number. For example, when using Mahout 0.3 release, the +job will be mahout-core-0.3.job + + +<a name="canopy-commandline-Testingitononesinglemachinew/ocluster"></a> +## Testing it on one single machine w/o cluster + +* Put the data: cp <PATH TO DATA> testdata +* Run the Job: + + ./bin/mahout canopy -i testdata -o output -dm +org.apache.mahout.common.distance.CosineDistanceMeasure -ow -t1 5 -t2 2 + + +<a name="canopy-commandline-Runningitonthecluster"></a> +## Running it on the cluster + +* (As needed) Start up Hadoop: $HADOOP_HOME/bin/start-all.sh +* Put the data: $HADOOP_HOME/bin/hadoop fs -put <PATH TO DATA> testdata +* Run the Job: + + export HADOOP_HOME=<Hadoop Home Directory> + export HADOOP_CONF_DIR=$HADOOP_HOME/conf + ./bin/mahout canopy -i testdata -o output -dm +org.apache.mahout.common.distance.CosineDistanceMeasure -ow -t1 5 -t2 2 + +* Get the data out of HDFS and have a look. Use bin/hadoop fs -lsr output +to view all outputs. + +<a name="canopy-commandline-Commandlineoptions"></a> +# Command line options + + --input (-i) input Path to job input directory.Must + be a SequenceFile of + VectorWritable + --output (-o) output The directory pathname for output. + --overwrite (-ow) If present, overwrite the output + directory before running job + --distanceMeasure (-dm) distanceMeasure The classname of the + DistanceMeasure. Default is + SquaredEuclidean + --t1 (-t1) t1 T1 threshold value + --t2 (-t2) t2 T2 threshold value + --clustering (-cl) If present, run clustering after + the iterations have taken place + --help (-h) Print out help +
http://git-wip-us.apache.org/repos/asf/mahout/blob/516e3fb9/website/docs/tutorials/map-reduce/clustering/clustering-of-synthetic-control-data.md ---------------------------------------------------------------------- diff --git a/website/docs/tutorials/map-reduce/clustering/clustering-of-synthetic-control-data.md b/website/docs/tutorials/map-reduce/clustering/clustering-of-synthetic-control-data.md new file mode 100644 index 0000000..abd2d3e --- /dev/null +++ b/website/docs/tutorials/map-reduce/clustering/clustering-of-synthetic-control-data.md @@ -0,0 +1,53 @@ +--- +layout: mr_tutorial +title: Clustering of synthetic control data +theme: + name: retro-mahout +--- + +# Clustering synthetic control data + +## Introduction + +This example will demonstrate clustering of time series data, specifically control charts. [Control charts](http://en.wikipedia.org/wiki/Control_chart) are tools used to determine whether a manufacturing or business process is in a state of statistical control. Such control charts are generated / simulated repeatedly at equal time intervals. A [simulated dataset](http://archive.ics.uci.edu/ml/databases/synthetic_control/synthetic_control.data.html) is available for use in UCI machine learning repository. + +A time series of control charts needs to be clustered into their close knit groups. The data set we use is synthetic and is meant to resemble real world information in an anonymized format. It contains six different classes: Normal, Cyclic, Increasing trend, Decreasing trend, Upward shift, Downward shift. In this example we will use Mahout to cluster the data into corresponding class buckets. + +*For the sake of simplicity, we won't use a cluster in this example, but instead show you the commands to run the clustering examples locally with Hadoop*. + +## Setup + +We need to do some initial setup before we are able to run the example. + + + 1. Start out by downloading the dataset to be clustered from the UCI Machine Learning Repository: [http://archive.ics.uci.edu/ml/databases/synthetic_control/synthetic_control.data](http://archive.ics.uci.edu/ml/databases/synthetic_control/synthetic_control.data). + + 2. Download the [latest release of Mahout](/general/downloads.html). + + 3. Unpack the release binary and switch to the *mahout-distribution-0.x* folder + + 4. Make sure that the *JAVA_HOME* environment variable points to your local java installation + + 5. Create a folder called *testdata* in the current directory and copy the dataset into this folder. + + +## Clustering Examples + +Depending on the clustering algorithm you want to run, the following commands can be used: + + + * [Canopy Clustering](/users/clustering/canopy-clustering.html) + + bin/mahout org.apache.mahout.clustering.syntheticcontrol.canopy.Job + + * [k-Means Clustering](/users/clustering/k-means-clustering.html) + + bin/mahout org.apache.mahout.clustering.syntheticcontrol.kmeans.Job + + + * [Fuzzy k-Means Clustering](/users/clustering/fuzzy-k-means.html) + + bin/mahout org.apache.mahout.clustering.syntheticcontrol.fuzzykmeans.Job + +The clustering output will be produced in the *output* directory. The output data points are in vector format. In order to read/analyze the output, you can use the [clusterdump](/users/clustering/cluster-dumper.html) utility provided by Mahout. + http://git-wip-us.apache.org/repos/asf/mahout/blob/516e3fb9/website/docs/tutorials/map-reduce/clustering/clustering-seinfeld-episodes.md ---------------------------------------------------------------------- diff --git a/website/docs/tutorials/map-reduce/clustering/clustering-seinfeld-episodes.md b/website/docs/tutorials/map-reduce/clustering/clustering-seinfeld-episodes.md new file mode 100644 index 0000000..40ed55f --- /dev/null +++ b/website/docs/tutorials/map-reduce/clustering/clustering-seinfeld-episodes.md @@ -0,0 +1,11 @@ +--- +layout: mr_tutorial +title: Clustering Seinfeld Episodes +theme: + name: retro-mahout +--- + +Below is short tutorial on how to cluster Seinfeld episode transcripts with +Mahout. + +http://blog.jteam.nl/2011/04/04/how-to-cluster-seinfeld-episodes-with-mahout/ http://git-wip-us.apache.org/repos/asf/mahout/blob/516e3fb9/website/docs/tutorials/map-reduce/clustering/clusteringyourdata.md ---------------------------------------------------------------------- diff --git a/website/docs/tutorials/map-reduce/clustering/clusteringyourdata.md b/website/docs/tutorials/map-reduce/clustering/clusteringyourdata.md new file mode 100644 index 0000000..782cbb5 --- /dev/null +++ b/website/docs/tutorials/map-reduce/clustering/clusteringyourdata.md @@ -0,0 +1,126 @@ +--- +layout: mr_tutorial +title: ClusteringYourData +theme: + name: retro-mahout +--- + +# Clustering your data + +After you've done the [Quickstart](quickstart.html) and are familiar with the basics of Mahout, it is time to cluster your own +data. See also [Wikipedia on cluster analysis](en.wikipedia.org/wiki/Cluster_analysis) for more background. + +The following pieces *may* be useful for in getting started: + +<a name="ClusteringYourData-Input"></a> +# Input + +For starters, you will need your data in an appropriate Vector format, see [Creating Vectors](../basics/creating-vectors.html). +In particular for text preparation check out [Creating Vectors from Text](../basics/creating-vectors-from-text.html). + + +<a name="ClusteringYourData-RunningtheProcess"></a> +# Running the Process + +* [Canopy background](canopy-clustering.html) and [canopy-commandline](canopy-commandline.html). + +* [K-Means background](k-means-clustering.html), [k-means-commandline](k-means-commandline.html), and +[fuzzy-k-means-commandline](fuzzy-k-means-commandline.html). + +* [Dirichlet background](dirichlet-process-clustering.html) and [dirichlet-commandline](dirichlet-commandline.html). + +* [Meanshift background](mean-shift-clustering.html) and [mean-shift-commandline](mean-shift-commandline.html). + +* [LDA (Latent Dirichlet Allocation) background](-latent-dirichlet-allocation.html) and [lda-commandline](lda-commandline.html). + +* TODO: kmeans++/ streaming kMeans documentation + + +<a name="ClusteringYourData-RetrievingtheOutput"></a> +# Retrieving the Output + +Mahout has a cluster dumper utility that can be used to retrieve and evaluate your clustering data. + + ./bin/mahout clusterdump <OPTIONS> + + +<a name="ClusteringYourData-Theclusterdumperoptionsare:"></a> +## The cluster dumper options are: + + --help (-h) Print out help + + --input (-i) input The directory containing Sequence + Files for the Clusters + + --output (-o) output The output file. If not specified, + dumps to the console. + + --outputFormat (-of) outputFormat The optional output format to write + the results as. Options: TEXT, CSV, or GRAPH_ML + + --substring (-b) substring The number of chars of the + asFormatString() to print + + --pointsDir (-p) pointsDir The directory containing points + sequence files mapping input vectors to their cluster. If specified, + then the program will output the + points associated with a cluster + + --dictionary (-d) dictionary The dictionary file. + + --dictionaryType (-dt) dictionaryType The dictionary file type + (text|sequencefile) + + --distanceMeasure (-dm) distanceMeasure The classname of the DistanceMeasure. + Default is SquaredEuclidean. + + --numWords (-n) numWords The number of top terms to print + + --tempDir tempDir Intermediate output directory + + --startPhase startPhase First phase to run + + --endPhase endPhase Last phase to run + + --evaluate (-e) Run ClusterEvaluator and CDbwEvaluator over the + input. The output will be appended to the rest of + the output at the end. + + +More information on using clusterdump utility can be found [here](cluster-dumper.html) + +<a name="ClusteringYourData-ValidatingtheOutput"></a> +# Validating the Output + +{quote} +Ted Dunning: A principled approach to cluster evaluation is to measure how well the +cluster membership captures the structure of unseen data. A natural +measure for this is to measure how much of the entropy of the data is +captured by cluster membership. For k-means and its natural L_2 metric, +the natural cluster quality metric is the squared distance from the nearest +centroid adjusted by the log_2 of the number of clusters. This can be +compared to the squared magnitude of the original data or the squared +deviation from the centroid for all of the data. The idea is that you are +changing the representation of the data by allocating some of the bits in +your original representation to represent which cluster each point is in. +If those bits aren't made up by the residue being small then your +clustering is making a bad trade-off. + +In the past, I have used other more heuristic measures as well. One of the +key characteristics that I would like to see out of a clustering is a +degree of stability. Thus, I look at the fractions of points that are +assigned to each cluster or the distribution of distances from the cluster +centroid. These values should be relatively stable when applied to held-out +data. + +For text, you can actually compute perplexity which measures how well +cluster membership predicts what words are used. This is nice because you +don't have to worry about the entropy of real valued numbers. + +Manual inspection and the so-called laugh test is also important. The idea +is that the results should not be so ludicrous as to make you laugh. +Unfortunately, it is pretty easy to kid yourself into thinking your system +is working using this kind of inspection. The problem is that we are too +good at seeing (making up) patterns. +{quote} + http://git-wip-us.apache.org/repos/asf/mahout/blob/516e3fb9/website/docs/tutorials/map-reduce/clustering/fuzzy-k-means-commandline.md ---------------------------------------------------------------------- diff --git a/website/docs/tutorials/map-reduce/clustering/fuzzy-k-means-commandline.md b/website/docs/tutorials/map-reduce/clustering/fuzzy-k-means-commandline.md new file mode 100644 index 0000000..96d9b23 --- /dev/null +++ b/website/docs/tutorials/map-reduce/clustering/fuzzy-k-means-commandline.md @@ -0,0 +1,97 @@ +--- +layout: mr_tutorial +title: fuzzy-k-means-commandline +theme: + name: retro-mahout +--- + +<a name="fuzzy-k-means-commandline-RunningFuzzyk-MeansClusteringfromtheCommandLine"></a> +# Running Fuzzy k-Means Clustering from the Command Line +Mahout's Fuzzy k-Means clustering can be launched from the same command +line invocation whether you are running on a single machine in stand-alone +mode or on a larger Hadoop cluster. The difference is determined by the +$HADOOP_HOME and $HADOOP_CONF_DIR environment variables. If both are set to +an operating Hadoop cluster on the target machine then the invocation will +run FuzzyK on that cluster. If either of the environment variables are +missing then the stand-alone Hadoop configuration will be invoked instead. + + + ./bin/mahout fkmeans <OPTIONS> + + +* In $MAHOUT_HOME/, build the jar containing the job (mvn install) The job +will be generated in $MAHOUT_HOME/core/target/ and it's name will contain +the Mahout version number. For example, when using Mahout 0.3 release, the +job will be mahout-core-0.3.job + + +<a name="fuzzy-k-means-commandline-Testingitononesinglemachinew/ocluster"></a> +## Testing it on one single machine w/o cluster + +* Put the data: cp <PATH TO DATA> testdata +* Run the Job: + + ./bin/mahout fkmeans -i testdata <OPTIONS> + + +<a name="fuzzy-k-means-commandline-Runningitonthecluster"></a> +## Running it on the cluster + +* (As needed) Start up Hadoop: $HADOOP_HOME/bin/start-all.sh +* Put the data: $HADOOP_HOME/bin/hadoop fs -put <PATH TO DATA> testdata +* Run the Job: + + export HADOOP_HOME=<Hadoop Home Directory> + export HADOOP_CONF_DIR=$HADOOP_HOME/conf + ./bin/mahout fkmeans -i testdata <OPTIONS> + +* Get the data out of HDFS and have a look. Use bin/hadoop fs -lsr output +to view all outputs. + +<a name="fuzzy-k-means-commandline-Commandlineoptions"></a> +# Command line options + + --input (-i) input Path to job input directory. + Must be a SequenceFile of + VectorWritable + --clusters (-c) clusters The input centroids, as Vectors. + Must be a SequenceFile of + Writable, Cluster/Canopy. If k + is also specified, then a random + set of vectors will be selected + and written out to this path + first + --output (-o) output The directory pathname for + output. + --distanceMeasure (-dm) distanceMeasure The classname of the + DistanceMeasure. Default is + SquaredEuclidean + --convergenceDelta (-cd) convergenceDelta The convergence delta value. + Default is 0.5 + --maxIter (-x) maxIter The maximum number of + iterations. + --k (-k) k The k in k-Means. If specified, + then a random selection of k + Vectors will be chosen as the + Centroid and written to the + clusters input path. + --m (-m) m coefficient normalization + factor, must be greater than 1 + --overwrite (-ow) If present, overwrite the output + directory before running job + --help (-h) Print out help + --numMap (-u) numMap The number of map tasks. + Defaults to 10 + --maxRed (-r) maxRed The number of reduce tasks. + Defaults to 2 + --emitMostLikely (-e) emitMostLikely True if clustering should emit + the most likely point only, + false for threshold clustering. + Default is true + --threshold (-t) threshold The pdf threshold used for + cluster determination. Default + is 0 + --clustering (-cl) If present, run clustering after + the iterations have taken place + + http://git-wip-us.apache.org/repos/asf/mahout/blob/516e3fb9/website/docs/tutorials/map-reduce/clustering/k-means-commandline.md ---------------------------------------------------------------------- diff --git a/website/docs/tutorials/map-reduce/clustering/k-means-commandline.md b/website/docs/tutorials/map-reduce/clustering/k-means-commandline.md new file mode 100644 index 0000000..9dff963 --- /dev/null +++ b/website/docs/tutorials/map-reduce/clustering/k-means-commandline.md @@ -0,0 +1,94 @@ +--- +layout: mr_tutorial +title: k-means-commandline +theme: + name: retro-mahout +--- + +<a name="k-means-commandline-Introduction"></a> +# kMeans commandline introduction + +This quick start page describes how to run the kMeans clustering algorithm +on a Hadoop cluster. + +<a name="k-means-commandline-Steps"></a> +# Steps + +Mahout's k-Means clustering can be launched from the same command line +invocation whether you are running on a single machine in stand-alone mode +or on a larger Hadoop cluster. The difference is determined by the +$HADOOP_HOME and $HADOOP_CONF_DIR environment variables. If both are set to +an operating Hadoop cluster on the target machine then the invocation will +run k-Means on that cluster. If either of the environment variables are +missing then the stand-alone Hadoop configuration will be invoked instead. + + + ./bin/mahout kmeans <OPTIONS> + + +In $MAHOUT_HOME/, build the jar containing the job (mvn install) The job +will be generated in $MAHOUT_HOME/core/target/ and it's name will contain +the Mahout version number. For example, when using Mahout 0.3 release, the +job will be mahout-core-0.3.job + + +<a name="k-means-commandline-Testingitononesinglemachinew/ocluster"></a> +## Testing it on one single machine w/o cluster + +* Put the data: cp <PATH TO DATA> testdata +* Run the Job: + + ./bin/mahout kmeans -i testdata -o output -c clusters -dm +org.apache.mahout.common.distance.CosineDistanceMeasure -x 5 -ow -cd 1 -k +25 + + +<a name="k-means-commandline-Runningitonthecluster"></a> +## Running it on the cluster + +* (As needed) Start up Hadoop: $HADOOP_HOME/bin/start-all.sh +* Put the data: $HADOOP_HOME/bin/hadoop fs -put <PATH TO DATA> testdata +* Run the Job: + + export HADOOP_HOME=<Hadoop Home Directory> + export HADOOP_CONF_DIR=$HADOOP_HOME/conf + ./bin/mahout kmeans -i testdata -o output -c clusters -dm org.apache.mahout.common.distance.CosineDistanceMeasure -x 5 -ow -cd 1 -k 25 + +* Get the data out of HDFS and have a look. Use bin/hadoop fs -lsr output +to view all outputs. + +<a name="k-means-commandline-Commandlineoptions"></a> +# Command line options + + --input (-i) input Path to job input directory. + Must be a SequenceFile of + VectorWritable + --clusters (-c) clusters The input centroids, as Vectors. + Must be a SequenceFile of + Writable, Cluster/Canopy. If k + is also specified, then a random + set of vectors will be selected + and written out to this path + first + --output (-o) output The directory pathname for + output. + --distanceMeasure (-dm) distanceMeasure The classname of the + DistanceMeasure. Default is + SquaredEuclidean + --convergenceDelta (-cd) convergenceDelta The convergence delta value. + Default is 0.5 + --maxIter (-x) maxIter The maximum number of + iterations. + --maxRed (-r) maxRed The number of reduce tasks. + Defaults to 2 + --k (-k) k The k in k-Means. If specified, + then a random selection of k + Vectors will be chosen as the + Centroid and written to the + clusters input path. + --overwrite (-ow) If present, overwrite the output + directory before running job + --help (-h) Print out help + --clustering (-cl) If present, run clustering after + the iterations have taken place + http://git-wip-us.apache.org/repos/asf/mahout/blob/516e3fb9/website/docs/tutorials/map-reduce/clustering/lda-commandline.md ---------------------------------------------------------------------- diff --git a/website/docs/tutorials/map-reduce/clustering/lda-commandline.md b/website/docs/tutorials/map-reduce/clustering/lda-commandline.md new file mode 100644 index 0000000..e4d7014 --- /dev/null +++ b/website/docs/tutorials/map-reduce/clustering/lda-commandline.md @@ -0,0 +1,83 @@ +--- +layout: mr_tutorial +title: lda-commandline +theme: + name: retro-mahout +--- + +<a name="lda-commandline-RunningLatentDirichletAllocation(algorithm)fromtheCommandLine"></a> +# Running Latent Dirichlet Allocation (algorithm) from the Command Line +[Since Mahout v0.6](https://issues.apache.org/jira/browse/MAHOUT-897) + lda has been implemented as Collapsed Variable Bayes (cvb). + +Mahout's LDA can be launched from the same command line invocation whether +you are running on a single machine in stand-alone mode or on a larger +Hadoop cluster. The difference is determined by the $HADOOP_HOME and +$HADOOP_CONF_DIR environment variables. If both are set to an operating +Hadoop cluster on the target machine then the invocation will run the LDA +algorithm on that cluster. If either of the environment variables are +missing then the stand-alone Hadoop configuration will be invoked instead. + + + + ./bin/mahout cvb <OPTIONS> + + +* In $MAHOUT_HOME/, build the jar containing the job (mvn install) The job +will be generated in $MAHOUT_HOME/core/target/ and it's name will contain +the Mahout version number. For example, when using Mahout 0.3 release, the +job will be mahout-core-0.3.job + + +<a name="lda-commandline-Testingitononesinglemachinew/ocluster"></a> +## Testing it on one single machine w/o cluster + +* Put the data: cp <PATH TO DATA> testdata +* Run the Job: + + ./bin/mahout cvb -i testdata <OTHER OPTIONS> + + +<a name="lda-commandline-Runningitonthecluster"></a> +## Running it on the cluster + +* (As needed) Start up Hadoop: $HADOOP_HOME/bin/start-all.sh +* Put the data: $HADOOP_HOME/bin/hadoop fs -put <PATH TO DATA> testdata +* Run the Job: + + export HADOOP_HOME=<Hadoop Home Directory> + export HADOOP_CONF_DIR=$HADOOP_HOME/conf + ./bin/mahout cvb -i testdata <OTHER OPTIONS> + +* Get the data out of HDFS and have a look. Use bin/hadoop fs -lsr output +to view all outputs. + +<a name="lda-commandline-CommandlineoptionsfromMahoutcvbversion0.8"></a> +# Command line options from Mahout cvb version 0.8 + + mahout cvb -h + --input (-i) input Path to job input directory. + --output (-o) output The directory pathname for output. + --maxIter (-x) maxIter The maximum number of iterations. + --convergenceDelta (-cd) convergenceDelta The convergence delta value + --overwrite (-ow) If present, overwrite the output directory before running job + --num_topics (-k) num_topics Number of topics to learn + --num_terms (-nt) num_terms Vocabulary size + --doc_topic_smoothing (-a) doc_topic_smoothing Smoothing for document/topic distribution + --term_topic_smoothing (-e) term_topic_smoothing Smoothing for topic/term distribution + --dictionary (-dict) dictionary Path to term-dictionary file(s) (glob expression supported) + --doc_topic_output (-dt) doc_topic_output Output path for the training doc/topic distribution + --topic_model_temp_dir (-mt) topic_model_temp_dir Path to intermediate model path (useful for restarting) + --iteration_block_size (-block) iteration_block_size Number of iterations per perplexity check + --random_seed (-seed) random_seed Random seed + --test_set_fraction (-tf) test_set_fraction Fraction of data to hold out for testing + --num_train_threads (-ntt) num_train_threads number of threads per mapper to train with + --num_update_threads (-nut) num_update_threads number of threads per mapper to update the model with + --max_doc_topic_iters (-mipd) max_doc_topic_iters max number of iterations per doc for p(topic|doc) learning + --num_reduce_tasks num_reduce_tasks number of reducers to use during model estimation + --backfill_perplexity enable backfilling of missing perplexity values + --help (-h) Print out help + --tempDir tempDir Intermediate output directory + --startPhase startPhase First phase to run + --endPhase endPhase Last phase to run + http://git-wip-us.apache.org/repos/asf/mahout/blob/516e3fb9/website/docs/tutorials/map-reduce/clustering/viewing-result.md ---------------------------------------------------------------------- diff --git a/website/docs/tutorials/map-reduce/clustering/viewing-result.md b/website/docs/tutorials/map-reduce/clustering/viewing-result.md new file mode 100644 index 0000000..e7c6107 --- /dev/null +++ b/website/docs/tutorials/map-reduce/clustering/viewing-result.md @@ -0,0 +1,15 @@ +--- +layout: mr_tutorial +title: Viewing Result +theme: + name: retro-mahout +--- +* [Algorithm Viewing pages](#ViewingResult-AlgorithmViewingpages) + +There are various technologies available to view the output of Mahout +algorithms. +* Clusters + +<a name="ViewingResult-AlgorithmViewingpages"></a> +# Algorithm Viewing pages +{pagetree:root=@self|excerpt=true|expandCollapseAll=true} http://git-wip-us.apache.org/repos/asf/mahout/blob/516e3fb9/website/docs/tutorials/map-reduce/clustering/viewing-results.md ---------------------------------------------------------------------- diff --git a/website/docs/tutorials/map-reduce/clustering/viewing-results.md b/website/docs/tutorials/map-reduce/clustering/viewing-results.md new file mode 100644 index 0000000..abbe7ef --- /dev/null +++ b/website/docs/tutorials/map-reduce/clustering/viewing-results.md @@ -0,0 +1,49 @@ +--- +layout: mr_tutorial +title: Viewing Results +theme: + name: retro-mahout +--- +<a name="ViewingResults-Intro"></a> +# Intro + +Many of the Mahout libraries run as batch jobs, dumping results into Hadoop +sequence files or other data structures. This page is intended to +demonstrate the various ways one might inspect the outcome of various jobs. + The page is organized by algorithms. + +<a name="ViewingResults-GeneralUtilities"></a> +# General Utilities + +<a name="ViewingResults-SequenceFileDumper"></a> +## Sequence File Dumper + + +<a name="ViewingResults-Clustering"></a> +# Clustering + +<a name="ViewingResults-ClusterDumper"></a> +## Cluster Dumper + +Run the following to print out all options: + + java -cp "*" org.apache.mahout.utils.clustering.ClusterDumper --help + + + +<a name="ViewingResults-Example"></a> +### Example + + java -cp "*" org.apache.mahout.utils.clustering.ClusterDumper --seqFileDir +./solr-clust-n2/out/clusters-2 + --dictionary ./solr-clust-n2/dictionary.txt + --substring 100 --pointsDir ./solr-clust-n2/out/points/ + + + + +<a name="ViewingResults-ClusterLabels(MAHOUT-163)"></a> +## Cluster Labels (MAHOUT-163) + +<a name="ViewingResults-Classification"></a> +# Classification http://git-wip-us.apache.org/repos/asf/mahout/blob/516e3fb9/website/docs/tutorials/map-reduce/clustering/visualizing-sample-clusters.md ---------------------------------------------------------------------- diff --git a/website/docs/tutorials/map-reduce/clustering/visualizing-sample-clusters.md b/website/docs/tutorials/map-reduce/clustering/visualizing-sample-clusters.md new file mode 100644 index 0000000..9d6c827 --- /dev/null +++ b/website/docs/tutorials/map-reduce/clustering/visualizing-sample-clusters.md @@ -0,0 +1,50 @@ +--- +layout: mr_tutorial +title: Visualizing Sample Clusters +theme: + name: retro-mahout +--- + +<a name="VisualizingSampleClusters-Introduction"></a> +# Introduction + +Mahout provides examples to visualize sample clusters that gets created by +our clustering algorithms. Note that the visualization is done by Swing programs. You have to be in a window system on the same +machine you run these, or logged in via a remote desktop. + +For visualizing the clusters, you have to execute the Java +classes under *org.apache.mahout.clustering.display* package in +mahout-examples module. The easiest way to achieve this is to [setup Mahout](users/basics/quickstart.html) in your IDE. + +<a name="VisualizingSampleClusters-Visualizingclusters"></a> +# Visualizing clusters + +The following classes in *org.apache.mahout.clustering.display* can be run +without parameters to generate a sample data set and run the reference +clustering implementations over them: + +1. **DisplayClustering** - generates 1000 samples from three, symmetric +distributions. This is the same data set that is used by the following +clustering programs. It displays the points on a screen and superimposes +the model parameters that were used to generate the points. You can edit +the *generateSamples()* method to change the sample points used by these +programs. +1. **DisplayClustering** - displays initial areas of generated points +1. **DisplayCanopy** - uses Canopy clustering +1. **DisplayKMeans** - uses k-Means clustering +1. **DisplayFuzzyKMeans** - uses Fuzzy k-Means clustering +1. **DisplaySpectralKMeans** - uses Spectral KMeans via map-reduce algorithm + +If you are using Eclipse, just right-click on each of the classes mentioned above and choose "Run As -Java Application". To run these directly from the command line: + + cd $MAHOUT_HOME/examples + mvn -q exec:java -Dexec.mainClass=org.apache.mahout.clustering.display.DisplayClustering + +You can substitute other names above for *DisplayClustering*. + + +Note that some of these programs display the sample points and then superimpose all of the clusters from each iteration. The last iteration's clusters are in +bold red and the previous several are colored (orange, yellow, green, blue, +magenta) in order after which all earlier clusters are in light grey. This +helps to visualize how the clusters converge upon a solution over multiple +iterations. \ No newline at end of file http://git-wip-us.apache.org/repos/asf/mahout/blob/516e3fb9/website/docs/tutorials/map-reduce/misc/mr---map-reduce.md ---------------------------------------------------------------------- diff --git a/website/docs/tutorials/map-reduce/misc/mr---map-reduce.md b/website/docs/tutorials/map-reduce/misc/mr---map-reduce.md new file mode 100644 index 0000000..b03d6ad --- /dev/null +++ b/website/docs/tutorials/map-reduce/misc/mr---map-reduce.md @@ -0,0 +1,19 @@ +--- +layout: default +title: MR - Map Reduce +theme: + name: retro-mahout +--- + +{excerpt}MapReduce is a framework for processing huge datasets on certain +kinds of distributable problems using a large number of computers (nodes), +collectively referred to as a cluster.{excerpt} Computational processing +can occur on data stored either in a filesystem (unstructured) or within a +database (structured). + + Also written M/R + + + See Also +* [http://wiki.apache.org/hadoop/HadoopMapReduce](http://wiki.apache.org/hadoop/HadoopMapReduce) +* [http://en.wikipedia.org/wiki/MapReduce](http://en.wikipedia.org/wiki/MapReduce) http://git-wip-us.apache.org/repos/asf/mahout/blob/516e3fb9/website/docs/tutorials/map-reduce/misc/parallel-frequent-pattern-mining.md ---------------------------------------------------------------------- diff --git a/website/docs/tutorials/map-reduce/misc/parallel-frequent-pattern-mining.md b/website/docs/tutorials/map-reduce/misc/parallel-frequent-pattern-mining.md new file mode 100644 index 0000000..e2978a4 --- /dev/null +++ b/website/docs/tutorials/map-reduce/misc/parallel-frequent-pattern-mining.md @@ -0,0 +1,185 @@ +--- +layout: default +title: Parallel Frequent Pattern Mining +theme: + name: retro-mahout +--- +Mahout has a Top K Parallel FPGrowth Implementation. Its based on the paper [http://infolab.stanford.edu/~echang/recsys08-69.pdf](http://infolab.stanford.edu/~echang/recsys08-69.pdf) + with some optimisations in mining the data. + +Given a huge transaction list, the algorithm finds all unique features(sets +of field values) and eliminates those features whose frequency in the whole +dataset is less that minSupport. Using these remaining features N, we find +the top K closed patterns for each of them, generating a total of NxK +patterns. FPGrowth Algorithm is a generic implementation, we can use any +Object type to denote a feature. Current implementation requires you to use +a String as the object type. You may implement a version for any object by +creating Iterators, Convertors and TopKPatternWritable for that particular +object. For more information please refer the package +org.apache.mahout.fpm.pfpgrowth.convertors.string + + e.g: + FPGrowth<String> fp = new FPGrowth<String>(); + Set<String> features = new HashSet<String>(); + fp.generateTopKStringFrequentPatterns( + new StringRecordIterator(new FileLineIterable(new File(input), +encoding, false), pattern), + fp.generateFList( + new StringRecordIterator(new FileLineIterable(new File(input), +encoding, false), pattern), minSupport), + minSupport, + maxHeapSize, + features, + new StringOutputConvertor(new SequenceFileOutputCollector<Text, +TopKStringPatterns>(writer)) + ); + +* The first argument is the iterator of transaction in this case its +Iterator<List<String>> +* The second argument is the output of generateFList function, which +returns the frequent items and their frequencies from the given database +transaction iterator +* The third argument is the minimum Support of the pattern to be generated +* The fourth argument is the maximum number of patterns to be mined for +each feature +* The fifth argument is the set of features for which the frequent patterns +has to be mined +* The last argument is an output collector which takes \[key, value\](key,-value\.html) + of Feature and TopK Patterns of the format \[String, +List<Pair<List<String>, Long>>\] and writes them to the appropriate writer +class which takes care of storing the object, in this case in a Sequence +File Output format + +<a name="ParallelFrequentPatternMining-RunningFrequentPatternGrowthviacommandline"></a> +## Running Frequent Pattern Growth via command line + +The command line launcher for string transaction data +org.apache.mahout.fpm.pfpgrowth.FPGrowthDriver has other features including +specifying the regex pattern for spitting a string line of a transaction +into the constituent features. + +Input files have to be in the following format. + +<optional document id>TAB<TOKEN1>SPACE<TOKEN2>SPACE.... + +instead of tab you could use , or \| as the default tokenization is done using a java Regex pattern {code}[,\t](,\t.html) +*[,|\t][ ,\t]*{code} +You can override this parameter to parse your log files or transaction +files (each line is a transaction.) The FPGrowth algorithm mines the top K +frequently occurring sets of items and their counts from the given input +data + +$MAHOUT_HOME/core/src/test/resources/retail.dat is a sample dataset in this +format. +Other sample files are accident.dat.gz from [http://fimi.cs.helsinki.fi/data/](http://fimi.cs.helsinki.fi/data/) +. As a quick test, try this: + + + bin/mahout fpg \ + -i core/src/test/resources/retail.dat \ + -o patterns \ + -k 50 \ + -method sequential \ + -regex '[\ ] +' \ + -s 2 + + +The minimumSupport parameter \-s is the minimum number of times a pattern +or a feature needs to occur in the dataset so that it is included in the +patterns generated. You can speed up the process by having a large value of +s. There are cases where you will have less than k patterns for a +particular feature as the rest don't for qualify the minimum support +criteria + +Note that the input to the algorithm, could be uncompressed or compressed +gz file or even a directory containing any number of such files. +We modified the regex to use space to split the token. Note that input +regex string is escaped. + +<a name="ParallelFrequentPatternMining-RunningParallelFPGrowth"></a> +## Running Parallel FPGrowth + +Running parallel FPGrowth is as easy as adding changing the flag \-method +mapreduce and adding the number of groups parameter e.g. \-g 20 for 20 +groups. First, let's run the above sample test in map-reduce mode: + + bin/mahout fpg \ + -i core/src/test/resources/retail.dat \ + -o patterns \ + -k 50 \ + -method mapreduce \ + -regex '[\ ] +' \ + -s 2 + +The above test took 102 seconds on dual-core laptop, v.s. 609 seconds in +the sequential mode, (with 5 gigs of ram allocated). In a separate test, +the first 1000 lines of retail.dat took 20 seconds in map/reduce v.s. 30 +seconds in sequential mode. + +Here is another dataset which, while several times larger, requires much +less time to find frequent patterns, as there are very few. Get +accidents.dat.gz from [http://fimi.cs.helsinki.fi/data/](http://fimi.cs.helsinki.fi/data/) + and place it on your hdfs in a folder named accidents. Then, run the +hadoop version of the FPGrowth job: + + bin/mahout fpg \ + -i accidents \ + -o patterns \ + -k 50 \ + -method mapreduce \ + -regex '[\ ] +' \ + -s 2 + + +OR to run a dataset of this size in sequential mode on a single machine +let's give Mahout a lot more memory and only keep features with more than +300 members: + + export MAHOUT_HEAPSIZE=-Xmx5000m + bin/mahout fpg \ + -i accidents \ + -o patterns \ + -k 50 \ + -method sequential \ + -regex '[\ ] +' \ + -s 2 + + + +The numGroups parameter \-g in FPGrowthJob specifies the number of groups +into which transactions have to be decomposed. The default of 1000 works +very well on a single-machine cluster; this may be very different on large +clusters. + +Note that accidents.dat has 340 unique features. So we chose \-g 10 to +split the transactions across 10 shards where 34 patterns are mined from +each shard. (Note: g doesnt need to be exactly divisible.) The Algorithm +takes care of calculating the split. For better performance in large +datasets and clusters, try not to mine for more than 20-25 features per +shard. Stick to the defaults on a small machine. + +The numTreeCacheEntries parameter \-tc specifies the number of generated +conditional FP-Trees to be kept in memory so that subsequent operations do +not to regenerate them. Increasing this number increases the memory +consumption but might improve speed until a certain point. This depends +entirely on the dataset in question. A value of 5-10 is recommended for +mining up to top 100 patterns for each feature. + +<a name="ParallelFrequentPatternMining-Viewingtheresults"></a> +## Viewing the results +The output will be dumped to a SequenceFile in the frequentpatterns +directory in Text=>TopKStringPatterns format. Run this command to see a few +of the Frequent Patterns: + + bin/mahout seqdumper \ + -i patterns/frequentpatterns/part-?-00000 \ + -n 4 + +or replace -n 4 with -c for the count of patterns. + +Open questions: how does one experiment and monitor with these various +parameters? http://git-wip-us.apache.org/repos/asf/mahout/blob/516e3fb9/website/docs/tutorials/map-reduce/misc/perceptron-and-winnow.md ---------------------------------------------------------------------- diff --git a/website/docs/tutorials/map-reduce/misc/perceptron-and-winnow.md b/website/docs/tutorials/map-reduce/misc/perceptron-and-winnow.md new file mode 100644 index 0000000..308040c --- /dev/null +++ b/website/docs/tutorials/map-reduce/misc/perceptron-and-winnow.md @@ -0,0 +1,41 @@ +--- +layout: default +title: Perceptron and Winnow +theme: + name: retro-mahout +--- +<a name="PerceptronandWinnow-ClassificationwithPerceptronorWinnow"></a> +# Classification with Perceptron or Winnow + +Both algorithms are comparably simple linear classifiers. Given training +data in some n-dimensional vector space that is annotated with binary +labels the algorithms are guaranteed to find a linear separating hyperplane +if one exists. In contrast to the Perceptron, Winnow works only for binary +feature vectors. + +For more information on the Perceptron see for instance: +http://en.wikipedia.org/wiki/Perceptron + +Concise course notes on both algorithms: +http://pages.cs.wisc.edu/~shuchi/courses/787-F07/scribe-notes/lecture24.pdf + +Although the algorithms are comparably simple they still work pretty well +for text classification and are fast to train even for huge example sets. +In contrast to Naive Bayes they are not based on the assumption that all +features (in the domain of text classification: all terms in a document) +are independent. + +<a name="PerceptronandWinnow-Strategyforparallelisation"></a> +## Strategy for parallelisation + +Currently the strategy for parallelisation is simple: Given there is enough +training data, split the training data. Train the classifier on each split. +The resulting hyperplanes are then averaged. + +<a name="PerceptronandWinnow-Roadmap"></a> +## Roadmap + +Currently the patch only contains the code for the classifier itself. It is +planned to provide unit tests and at least one example based on the WebKB +dataset by the end of November for the serial version. After that the +parallelisation will be added. http://git-wip-us.apache.org/repos/asf/mahout/blob/516e3fb9/website/docs/tutorials/map-reduce/misc/testing.md ---------------------------------------------------------------------- diff --git a/website/docs/tutorials/map-reduce/misc/testing.md b/website/docs/tutorials/map-reduce/misc/testing.md new file mode 100644 index 0000000..dc3fd43 --- /dev/null +++ b/website/docs/tutorials/map-reduce/misc/testing.md @@ -0,0 +1,46 @@ +--- +layout: default +title: Testing +theme: + name: retro-mahout +--- +<a name="Testing-Intro"></a> +# Intro + +As Mahout matures, solid testing procedures are needed. This page and its +children capture test plans along with ideas for improving our testing. + +<a name="Testing-TestPlans"></a> +# Test Plans + +* [0.6](0.6.html) + - Test Plans for the 0.6 release +There are no special plans except for unit tests, and user testing of the +Hadoop jobs. + +<a name="Testing-TestIdeas"></a> +# Test Ideas + +<a name="Testing-Regressions/Benchmarks/Integrations"></a> +## Regressions/Benchmarks/Integrations +* Algorithmic quality and speed are not tested, except in a few instances. +Such tests often require much longer run times (minutes to hours), a +running Hadoop cluster, and downloads of large datasets (in the megabytes). +* Standardized speed tests are difficult on different hardware. +* Unit tests of external integrations require access to externals: HDFS, +S3, JDBC, Cassandra, etc. + +Apache Jenkins is not able to support these environments. Commercial +donations would help. + +<a name="Testing-UnitTests"></a> +## Unit Tests +Mahout's current tests are almost entirely unit tests. Algorithm tests +generally supply a few numbers to code paths and verify that expected +numbers come out. 'mvn test' runs these tests. There is "positive" coverage +of a great many utilities and algorithms. A much smaller percent include +"negative" coverage (bogus setups, inputs, combinations). + +<a name="Testing-Other"></a> +## Other + http://git-wip-us.apache.org/repos/asf/mahout/blob/516e3fb9/website/docs/tutorials/map-reduce/misc/using-mahout-with-python-via-jpype.md ---------------------------------------------------------------------- diff --git a/website/docs/tutorials/map-reduce/misc/using-mahout-with-python-via-jpype.md b/website/docs/tutorials/map-reduce/misc/using-mahout-with-python-via-jpype.md new file mode 100644 index 0000000..57378ba --- /dev/null +++ b/website/docs/tutorials/map-reduce/misc/using-mahout-with-python-via-jpype.md @@ -0,0 +1,222 @@ +--- +layout: default +title: Using Mahout with Python via JPype +theme: + name: retro-mahout +--- + +<a name="UsingMahoutwithPythonviaJPype-overview"></a> +# Mahout over Jython - some examples +This tutorial provides some sample code illustrating how we can read and +write sequence files containing Mahout vectors from Python using JPype. +This tutorial is intended for people who want to use Python for analyzing +and plotting Mahout data. Using Mahout from Python turns out to be quite +easy. + +This tutorial concerns the use of cPython (cython) as opposed to Jython. +JPython wasn't an option for me, because (to the best of my knowledge) +JPython doesn't work with Python extensions numpy, matplotlib, or h5py +which I rely on heavily. + +The instructions below explain how to setup a python script to read and +write the output of Mahout clustering. + +You will first need to download and install the JPype package for python. + +The first step to setting up JPype is determining the path to the dynamic +library for the jvm ; on linux this will be a .so file on and on windows it +will be a .dll. + +In your python script, create a global variable with the path to this dll + + + +Next we need to figure out how we need to set the classpath for mahout. The +easiest way to do this is to edit the script in "bin/mahout" to print out +the classpath. Add the line "echo $CLASSPATH" to the script somewhere after +the comment "run it" (this is line 195 or so). Execute the script to print +out the classpath. Copy this output and paste it into a variable in your +python script. The result for me looks like the following + + + + +Now we can create a function to start the jvm in python using jype + + jvm=None + def start_jpype(): + global jvm + if (jvm is None): + cpopt="-Djava.class.path={cp}".format(cp=classpath) + startJVM(jvmlib,"-ea",cpopt) + jvm="started" + + + +<a name="UsingMahoutwithPythonviaJPype-WritingNamedVectorstoSequenceFilesfromPython"></a> +# Writing Named Vectors to Sequence Files from Python +We can now use JPype to create sequence files which will contain vectors to +be used by Mahout for kmeans. The example below is a function which creates +vectors from two Gaussian distributions with unit variance. + + + def create_inputs(ifile,*args,**param): + """Create a sequence file containing some normally distributed + ifile - path to the sequence file to create + """ + + #matrix of the cluster means + cmeans=np.array([[1,1] ,[-1,-1]],np.int) + + nperc=30 #number of points per cluster + + vecs=[] + + vnames=[] + for cind in range(cmeans.shape[0]): + pts=np.random.randn(nperc,2) + pts=pts+cmeans[cind,:].reshape([1,cmeans.shape[1]]) + vecs.append(pts) + + #names for the vectors + #names are just the points with an index + #we do this so we can validate by cross-refencing the name with thevector + vn=np.empty(nperc,dtype=(np.str,30)) + for row in range(nperc): + vn[row]="c"+str(cind)+"_"+pts[row,0].astype((np.str,4))+"_"+pts[row,1].astype((np.str,4)) + vnames.append(vn) + + vecs=np.vstack(vecs) + vnames=np.hstack(vnames) + + + #start the jvm + start_jpype() + + #create the sequence file that we will write to + io=JPackage("org").apache.hadoop.io + FileSystemCls=JPackage("org").apache.hadoop.fs.FileSystem + + PathCls=JPackage("org").apache.hadoop.fs.Path + path=PathCls(ifile) + + ConfCls=JPackage("org").apache.hadoop.conf.Configuration + conf=ConfCls() + + fs=FileSystemCls.get(conf) + + #vector classes + VectorWritableCls=JPackage("org").apache.mahout.math.VectorWritable + DenseVectorCls=JPackage("org").apache.mahout.math.DenseVector + NamedVectorCls=JPackage("org").apache.mahout.math.NamedVector + writer=io.SequenceFile.createWriter(fs, conf, path,io.Text,VectorWritableCls) + + + vecwritable=VectorWritableCls() + for row in range(vecs.shape[0]): + nvector=NamedVectorCls(DenseVectorCls(JArray(JDouble,1)(vecs[row,:])),vnames[row]) + #need to wrap key and value because of overloading + wrapkey=JObject(io.Text("key "+str(row)),io.Writable) + wrapval=JObject(vecwritable,io.Writable) + + vecwritable.set(nvector) + writer.append(wrapkey,wrapval) + + writer.close() + + +<a name="UsingMahoutwithPythonviaJPype-ReadingtheKMeansClusteredPointsfromPython"></a> +# Reading the KMeans Clustered Points from Python +Similarly we can use JPype to easily read the clustered points outputted by +mahout. + + def read_clustered_pts(ifile,*args,**param): + """Read the clustered points + ifile - path to the sequence file containing the clustered points + """ + + #start the jvm + start_jpype() + + #create the sequence file that we will write to + io=JPackage("org").apache.hadoop.io + FileSystemCls=JPackage("org").apache.hadoop.fs.FileSystem + + PathCls=JPackage("org").apache.hadoop.fs.Path + path=PathCls(ifile) + + ConfCls=JPackage("org").apache.hadoop.conf.Configuration + conf=ConfCls() + + fs=FileSystemCls.get(conf) + + #vector classes + VectorWritableCls=JPackage("org").apache.mahout.math.VectorWritable + NamedVectorCls=JPackage("org").apache.mahout.math.NamedVector + + + ReaderCls=io.__getattribute__("SequenceFile$Reader") + reader=ReaderCls(fs, path,conf) + + + key=reader.getKeyClass()() + + + valcls=reader.getValueClass() + vecwritable=valcls() + while (reader.next(key,vecwritable)): + weight=vecwritable.getWeight() + nvec=vecwritable.getVector() + + cname=nvec.__class__.__name__ + if (cname.rsplit('.',1)[1]=="NamedVector"): + print "cluster={key} Name={name} x={x}y={y}".format(key=key.toString(),name=nvec.getName(),x=nvec.get(0),y=nvec.get(1)) + else: + raise NotImplementedError("Vector isn't a NamedVector. Need tomodify/test the code to handle this case.") + + +<a name="UsingMahoutwithPythonviaJPype-ReadingtheKMeansCentroids"></a> +# Reading the KMeans Centroids +Finally we can create a function to print out the actual cluster centers +found by mahout, + + def getClusters(ifile,*args,**param): + """Read the centroids from the clusters outputted by kmenas + ifile - Path to the sequence file containing the centroids + """ + + #start the jvm + start_jpype() + + #create the sequence file that we will write to + io=JPackage("org").apache.hadoop.io + FileSystemCls=JPackage("org").apache.hadoop.fs.FileSystem + + PathCls=JPackage("org").apache.hadoop.fs.Path + path=PathCls(ifile) + + ConfCls=JPackage("org").apache.hadoop.conf.Configuration + conf=ConfCls() + + fs=FileSystemCls.get(conf) + + #vector classes + VectorWritableCls=JPackage("org").apache.mahout.math.VectorWritable + NamedVectorCls=JPackage("org").apache.mahout.math.NamedVector + ReaderCls=io.__getattribute__("SequenceFile$Reader") + reader=ReaderCls(fs, path,conf) + + + key=io.Text() + + + valcls=reader.getValueClass() + + vecwritable=valcls() + + while (reader.next(key,vecwritable)): + center=vecwritable.getCenter() + + print "id={cid}center={center}".format(cid=vecwritable.getId(),center=center.values) + pass + http://git-wip-us.apache.org/repos/asf/mahout/blob/516e3fb9/website/docs/tutorials/map-reduce/recommender/intro-als-hadoop.md ---------------------------------------------------------------------- diff --git a/website/docs/tutorials/map-reduce/recommender/intro-als-hadoop.md b/website/docs/tutorials/map-reduce/recommender/intro-als-hadoop.md new file mode 100644 index 0000000..2acacd0 --- /dev/null +++ b/website/docs/tutorials/map-reduce/recommender/intro-als-hadoop.md @@ -0,0 +1,98 @@ +--- +layout: default +title: Perceptron and Winnow +theme: + name: retro-mahout +--- + +# Introduction to ALS Recommendations with Hadoop + +##Overview + +Mahoutâs ALS recommender is a matrix factorization algorithm that uses Alternating Least Squares with Weighted-Lamda-Regularization (ALS-WR). It factors the user to item matrix *A* into the user-to-feature matrix *U* and the item-to-feature matrix *M*: It runs the ALS algorithm in a parallel fashion. The algorithm details can be referred to in the following papers: + +* [Large-scale Parallel Collaborative Filtering for +the Netflix Prize](http://www.hpl.hp.com/personal/Robert_Schreiber/papers/2008%20AAIM%20Netflix/netflix_aaim08%28submitted%29.pdf) +* [Collaborative Filtering for Implicit Feedback Datasets](http://research.yahoo.com/pub/2433) + +This recommendation algorithm can be used in eCommerce platform to recommend products to customers. Unlike the user or item based recommenders that computes the similarity of users or items to make recommendations, the ALS algorithm uncovers the latent factors that explain the observed user to item ratings and tries to find optimal factor weights to minimize the least squares between predicted and actual ratings. + +Mahout's ALS recommendation algorithm takes as input user preferences by item and generates an output of recommending items for a user. The input customer preference could either be explicit user ratings or implicit feedback such as user's click on a web page. + +One of the strengths of the ALS based recommender, compared to the user or item based recommender, is its ability to handle large sparse data sets and its better prediction performance. It could also gives an intuitive rationale of the factors that influence recommendations. + +##Implementation +At present Mahout has a map-reduce implementation of ALS, which is composed of 2 jobs: a parallel matrix factorization job and a recommendation job. +The matrix factorization job computes the user-to-feature matrix and item-to-feature matrix given the user to item ratings. Its input includes: +<pre> + --input: directory containing files of explicit user to item rating or implicit feedback; + --output: output path of the user-feature matrix and feature-item matrix; + --lambda: regularization parameter to avoid overfitting; + --alpha: confidence parameter only used on implicit feedback + --implicitFeedback: boolean flag to indicate whether the input dataset contains implicit feedback; + --numFeatures: dimensions of feature space; + --numThreadsPerSolver: number of threads per solver mapper for concurrent execution; + --numIterations: number of iterations + --usesLongIDs: boolean flag to indicate whether the input contains long IDs that need to be translated +</pre> +and it outputs the matrices in sequence file format. + +The recommendation job uses the user feature matrix and item feature matrix calculated from the factorization job to compute the top-N recommendations per user. Its input includes: +<pre> + --input: directory containing files of user ids; + --output: output path of the recommended items for each input user id; + --userFeatures: path to the user feature matrix; + --itemFeatures: path to the item feature matrix; + --numRecommendations: maximum number of recommendations per user, default is 10; + --maxRating: maximum rating available; + --numThreads: number of threads per mapper; + --usesLongIDs: boolean flag to indicate whether the input contains long IDs that need to be translated; + --userIDIndex: index for user long IDs (necessary if usesLongIDs is true); + --itemIDIndex: index for item long IDs (necessary if usesLongIDs is true) +</pre> +and it outputs a list of recommended item ids for each user. The predicted rating between user and item is a dot product of the user's feature vector and the item's feature vector. + +##Example + +Letâs look at a simple example of how we could use Mahoutâs ALS recommender to recommend items for users. First, youâll need to get Mahout up and running, the instructions for which can be found [here](https://mahout.apache.org/users/basics/quickstart.html). After you've ensured Mahout is properly installed, weâre ready to run the example. + +**Step 1: Prepare test data** + +Similar to Mahout's item based recommender, the ALS recommender relies on the user to item preference data: *userID*, *itemID* and *preference*. The preference could be explicit numeric rating or counts of actions such as a click (implicit feedback). The test data file is organized as each line is a tab-delimited string, the 1st field is user id, which must be numeric, the 2nd field is item id, which must be numeric and the 3rd field is preference, which should also be a number. + +**Note:** You must create IDs that are ordinal positive integers for all user and item IDs. Often this will require you to keep a dictionary +to map into and out of Mahout IDs. For instance if the first user has ID "xyz" in your application, this would get an Mahout ID of the integer 1 and so on. The same +for item IDs. Then after recommendations are calculated you will have to translate the Mahout user and item IDs back into your application IDs. + +To quickly start, you could specify a text file like following as the input: +<pre> +1 100 1 +1 200 5 +1 400 1 +2 200 2 +2 300 1 +</pre> + +**Step 2: Determine parameters** + +In addition, users need to determine dimension of feature space, the number of iterations to run the alternating least square algorithm, Using 10 features and 15 iterations is a reasonable default to try first. Optionally a confidence parameter can be set if the input preference is implicit user feedback. + +**Step 3: Run ALS** + +Assuming your *JAVA_HOME* is appropriately set and Mahout was installed properly weâre ready to configure our syntax. Enter the following command: + + $ mahout parallelALS --input $als_input --output $als_output --lambda 0.1 --implicitFeedback true --alpha 0.8 --numFeatures 2 --numIterations 5 --numThreadsPerSolver 1 --tempDir tmp + +Running the command will execute a series of jobs the final product of which will be an output file deposited to the output directory specified in the command syntax. The output directory contains 3 sub-directories: *M* stores the item to feature matrix, *U* stores the user to feature matrix and userRatings stores the user's ratings on the items. The *tempDir* parameter specifies the directory to store the intermediate output of the job, such as the matrix output in each iteration and each item's average rating. Using the *tempDir* will help on debugging. + +**Step 4: Make Recommendations** + +Based on the output feature matrices from step 3, we could make recommendations for users. Enter the following command: + + $ mahout recommendfactorized --input $als_recommender_input --userFeatures $als_output/U/ --itemFeatures $als_output/M/ --numRecommendations 1 --output recommendations --maxRating 1 + +The input user file is a sequence file, the sequence record key is user id and value is the user's rated item ids which will be removed from recommendation. The output file generated in our simple example will be a text file giving the recommended item ids for each user. +Remember to translate the Mahout ids back into your application specific ids. + +There exist a variety of parameters for Mahoutâs ALS recommender to accommodate custom business requirements; exploring and testing various configurations to suit your needs will doubtless lead to additional questions. Feel free to ask such questions on the [mailing list](https://mahout.apache.org/general/mailing-lists,-irc-and-archives.html). + http://git-wip-us.apache.org/repos/asf/mahout/blob/516e3fb9/website/docs/tutorials/map-reduce/recommender/intro-cooccurrence-spark.md ---------------------------------------------------------------------- diff --git a/website/docs/tutorials/map-reduce/recommender/intro-cooccurrence-spark.md b/website/docs/tutorials/map-reduce/recommender/intro-cooccurrence-spark.md new file mode 100644 index 0000000..578f3c4 --- /dev/null +++ b/website/docs/tutorials/map-reduce/recommender/intro-cooccurrence-spark.md @@ -0,0 +1,437 @@ +--- +layout: default +title: Perceptron and Winnow +theme: + name: retro-mahout +--- + +#Intro to Cooccurrence Recommenders with Spark + +Mahout provides several important building blocks for creating recommendations using Spark. *spark-itemsimilarity* can +be used to create "other people also liked these things" type recommendations and paired with a search engine can +personalize recommendations for individual users. *spark-rowsimilarity* can provide non-personalized content based +recommendations and when paired with a search engine can be used to personalize content based recommendations. + +##References + +1. A free ebook, which talks about the general idea: [Practical Machine Learning](https://www.mapr.com/practical-machine-learning) +2. A slide deck, which talks about mixing actions or other indicators: [Creating a Unified Recommender](http://occamsmachete.com/ml/2014/10/07/creating-a-unified-recommender-with-mahout-and-a-search-engine/) +3. Two blog posts: [What's New in Recommenders: part #1](http://occamsmachete.com/ml/2014/08/11/mahout-on-spark-whats-new-in-recommenders/) +and [What's New in Recommenders: part #2](http://occamsmachete.com/ml/2014/09/09/mahout-on-spark-whats-new-in-recommenders-part-2/) +3. A post describing the loglikelihood ratio: [Surprise and Coinsidense](http://tdunning.blogspot.com/2008/03/surprise-and-coincidence.html) LLR is used to reduce noise in the data while keeping the calculations O(n) complexity. + +Below are the command line jobs but the drivers and associated code can also be customized and accessed from the Scala APIs. + +##1. spark-itemsimilarity +*spark-itemsimilarity* is the Spark counterpart of the of the Mahout mapreduce job called *itemsimilarity*. It takes in elements of interactions, which have userID, itemID, and optionally a value. It will produce one of more indicator matrices created by comparing every user's interactions with every other user. The indicator matrix is an item x item matrix where the values are log-likelihood ratio strengths. For the legacy mapreduce version, there were several possible similarity measures but these are being deprecated in favor of LLR because in practice it performs the best. + +Mahout's mapreduce version of itemsimilarity takes a text file that is expected to have user and item IDs that conform to +Mahout's ID requirements--they are non-negative integers that can be viewed as row and column numbers in a matrix. + +*spark-itemsimilarity* also extends the notion of cooccurrence to cross-cooccurrence, in other words the Spark version will +account for multi-modal interactions and create indicator matrices allowing the use of much more data in +creating recommendations or similar item lists. People try to do this by mixing different actions and giving them weights. +For instance they might say an item-view is 0.2 of an item purchase. In practice this is often not helpful. Spark-itemsimilarity's +cross-cooccurrence is a more principled way to handle this case. In effect it scrubs secondary actions with the action you want +to recommend. + + + spark-itemsimilarity Mahout 1.0 + Usage: spark-itemsimilarity [options] + + Disconnected from the target VM, address: '127.0.0.1:64676', transport: 'socket' + Input, output options + -i <value> | --input <value> + Input path, may be a filename, directory name, or comma delimited list of HDFS supported URIs (required) + -i2 <value> | --input2 <value> + Secondary input path for cross-similarity calculation, same restrictions as "--input" (optional). Default: empty. + -o <value> | --output <value> + Path for output, any local or HDFS supported URI (required) + + Algorithm control options: + -mppu <value> | --maxPrefs <value> + Max number of preferences to consider per user (optional). Default: 500 + -m <value> | --maxSimilaritiesPerItem <value> + Limit the number of similarities per item to this number (optional). Default: 100 + + Note: Only the Log Likelihood Ratio (LLR) is supported as a similarity measure. + + Input text file schema options: + -id <value> | --inDelim <value> + Input delimiter character (optional). Default: "[,\t]" + -f1 <value> | --filter1 <value> + String (or regex) whose presence indicates a datum for the primary item set (optional). Default: no filter, all data is used + -f2 <value> | --filter2 <value> + String (or regex) whose presence indicates a datum for the secondary item set (optional). If not present no secondary dataset is collected + -rc <value> | --rowIDColumn <value> + Column number (0 based Int) containing the row ID string (optional). Default: 0 + -ic <value> | --itemIDColumn <value> + Column number (0 based Int) containing the item ID string (optional). Default: 1 + -fc <value> | --filterColumn <value> + Column number (0 based Int) containing the filter string (optional). Default: -1 for no filter + + Using all defaults the input is expected of the form: "userID<tab>itemId" or "userID<tab>itemID<tab>any-text..." and all rows will be used + + File discovery options: + -r | --recursive + Searched the -i path recursively for files that match --filenamePattern (optional), Default: false + -fp <value> | --filenamePattern <value> + Regex to match in determining input files (optional). Default: filename in the --input option or "^part-.*" if --input is a directory + + Output text file schema options: + -rd <value> | --rowKeyDelim <value> + Separates the rowID key from the vector values list (optional). Default: "\t" + -cd <value> | --columnIdStrengthDelim <value> + Separates column IDs from their values in the vector values list (optional). Default: ":" + -td <value> | --elementDelim <value> + Separates vector element values in the values list (optional). Default: " " + -os | --omitStrength + Do not write the strength to the output files (optional), Default: false. + This option is used to output indexable data for creating a search engine recommender. + + Default delimiters will produce output of the form: "itemID1<tab>itemID2:value2<space>itemID10:value10..." + + Spark config options: + -ma <value> | --master <value> + Spark Master URL (optional). Default: "local". Note that you can specify the number of cores to get a performance improvement, for example "local[4]" + -sem <value> | --sparkExecutorMem <value> + Max Java heap available as "executor memory" on each node (optional). Default: 4g + -rs <value> | --randomSeed <value> + + -h | --help + prints this usage text + +This looks daunting but defaults to simple fairly sane values to take exactly the same input as legacy code and is pretty flexible. It allows the user to point to a single text file, a directory full of files, or a tree of directories to be traversed recursively. The files included can be specified with either a regex-style pattern or filename. The schema for the file is defined by column numbers, which map to the important bits of data including IDs and values. The files can even contain filters, which allow unneeded rows to be discarded or used for cross-cooccurrence calculations. + +See ItemSimilarityDriver.scala in Mahout's spark module if you want to customize the code. + +###Defaults in the _**spark-itemsimilarity**_ CLI + +If all defaults are used the input can be as simple as: + + userID1,itemID1 + userID2,itemID2 + ... + +With the command line: + + + bash$ mahout spark-itemsimilarity --input in-file --output out-dir + + +This will use the "local" Spark context and will output the standard text version of a DRM + + itemID1<tab>itemID2:value2<space>itemID10:value10... + +###<a name="multiple-actions">How To Use Multiple User Actions</a> + +Often we record various actions the user takes for later analytics. These can now be used to make recommendations. +The idea of a recommender is to recommend the action you want the user to make. For an ecom app this might be +a purchase action. It is usually not a good idea to just treat other actions the same as the action you want to recommend. +For instance a view of an item does not indicate the same intent as a purchase and if you just mixed the two together you +might even make worse recommendations. It is tempting though since there are so many more views than purchases. With *spark-itemsimilarity* +we can now use both actions. Mahout will use cross-action cooccurrence analysis to limit the views to ones that do predict purchases. +We do this by treating the primary action (purchase) as data for the indicator matrix and use the secondary action (view) +to calculate the cross-cooccurrence indicator matrix. + +*spark-itemsimilarity* can read separate actions from separate files or from a mixed action log by filtering certain lines. For a mixed +action log of the form: + + u1,purchase,iphone + u1,purchase,ipad + u2,purchase,nexus + u2,purchase,galaxy + u3,purchase,surface + u4,purchase,iphone + u4,purchase,galaxy + u1,view,iphone + u1,view,ipad + u1,view,nexus + u1,view,galaxy + u2,view,iphone + u2,view,ipad + u2,view,nexus + u2,view,galaxy + u3,view,surface + u3,view,nexus + u4,view,iphone + u4,view,ipad + u4,view,galaxy + +###Command Line + + +Use the following options: + + bash$ mahout spark-itemsimilarity \ + --input in-file \ # where to look for data + --output out-path \ # root dir for output + --master masterUrl \ # URL of the Spark master server + --filter1 purchase \ # word that flags input for the primary action + --filter2 view \ # word that flags input for the secondary action + --itemIDPosition 2 \ # column that has the item ID + --rowIDPosition 0 \ # column that has the user ID + --filterPosition 1 # column that has the filter word + + + +###Output + +The output of the job will be the standard text version of two Mahout DRMs. This is a case where we are calculating +cross-cooccurrence so a primary indicator matrix and cross-cooccurrence indicator matrix will be created + + out-path + |-- similarity-matrix - TDF part files + \-- cross-similarity-matrix - TDF part-files + +The indicator matrix will contain the lines: + + galaxy\tnexus:1.7260924347106847 + ipad\tiphone:1.7260924347106847 + nexus\tgalaxy:1.7260924347106847 + iphone\tipad:1.7260924347106847 + surface + +The cross-cooccurrence indicator matrix will contain: + + iphone\tnexus:1.7260924347106847 iphone:1.7260924347106847 ipad:1.7260924347106847 galaxy:1.7260924347106847 + ipad\tnexus:0.6795961471815897 iphone:0.6795961471815897 ipad:0.6795961471815897 galaxy:0.6795961471815897 + nexus\tnexus:0.6795961471815897 iphone:0.6795961471815897 ipad:0.6795961471815897 galaxy:0.6795961471815897 + galaxy\tnexus:1.7260924347106847 iphone:1.7260924347106847 ipad:1.7260924347106847 galaxy:1.7260924347106847 + surface\tsurface:4.498681156950466 nexus:0.6795961471815897 + +**Note:** You can run this multiple times to use more than two actions or you can use the underlying +SimilarityAnalysis.cooccurrence API, which will more efficiently calculate any number of cross-cooccurrence indicators. + +###Log File Input + +A common method of storing data is in log files. If they are written using some delimiter they can be consumed directly by spark-itemsimilarity. For instance input of the form: + + 2014-06-23 14:46:53.115\tu1\tpurchase\trandom text\tiphone + 2014-06-23 14:46:53.115\tu1\tpurchase\trandom text\tipad + 2014-06-23 14:46:53.115\tu2\tpurchase\trandom text\tnexus + 2014-06-23 14:46:53.115\tu2\tpurchase\trandom text\tgalaxy + 2014-06-23 14:46:53.115\tu3\tpurchase\trandom text\tsurface + 2014-06-23 14:46:53.115\tu4\tpurchase\trandom text\tiphone + 2014-06-23 14:46:53.115\tu4\tpurchase\trandom text\tgalaxy + 2014-06-23 14:46:53.115\tu1\tview\trandom text\tiphone + 2014-06-23 14:46:53.115\tu1\tview\trandom text\tipad + 2014-06-23 14:46:53.115\tu1\tview\trandom text\tnexus + 2014-06-23 14:46:53.115\tu1\tview\trandom text\tgalaxy + 2014-06-23 14:46:53.115\tu2\tview\trandom text\tiphone + 2014-06-23 14:46:53.115\tu2\tview\trandom text\tipad + 2014-06-23 14:46:53.115\tu2\tview\trandom text\tnexus + 2014-06-23 14:46:53.115\tu2\tview\trandom text\tgalaxy + 2014-06-23 14:46:53.115\tu3\tview\trandom text\tsurface + 2014-06-23 14:46:53.115\tu3\tview\trandom text\tnexus + 2014-06-23 14:46:53.115\tu4\tview\trandom text\tiphone + 2014-06-23 14:46:53.115\tu4\tview\trandom text\tipad + 2014-06-23 14:46:53.115\tu4\tview\trandom text\tgalaxy + +Can be parsed with the following CLI and run on the cluster producing the same output as the above example. + + bash$ mahout spark-itemsimilarity \ + --input in-file \ + --output out-path \ + --master spark://sparkmaster:4044 \ + --filter1 purchase \ + --filter2 view \ + --inDelim "\t" \ + --itemIDPosition 4 \ + --rowIDPosition 1 \ + --filterPosition 2 + +##2. spark-rowsimilarity + +*spark-rowsimilarity* is the companion to *spark-itemsimilarity* the primary difference is that it takes a text file version of +a matrix of sparse vectors with optional application specific IDs and it finds similar rows rather than items (columns). Its use is +not limited to collaborative filtering. The input is in text-delimited form where there are three delimiters used. By +default it reads (rowID<tab>columnID1:strength1<space>columnID2:strength2...) Since this job only supports LLR similarity, + which does not use the input strengths, they may be omitted in the input. It writes +(rowID<tab>rowID1:strength1<space>rowID2:strength2...) +The output is sorted by strength descending. The output can be interpreted as a row ID from the primary input followed +by a list of the most similar rows. + +The command line interface is: + + spark-rowsimilarity Mahout 1.0 + Usage: spark-rowsimilarity [options] + + Input, output options + -i <value> | --input <value> + Input path, may be a filename, directory name, or comma delimited list of HDFS supported URIs (required) + -o <value> | --output <value> + Path for output, any local or HDFS supported URI (required) + + Algorithm control options: + -mo <value> | --maxObservations <value> + Max number of observations to consider per row (optional). Default: 500 + -m <value> | --maxSimilaritiesPerRow <value> + Limit the number of similarities per item to this number (optional). Default: 100 + + Note: Only the Log Likelihood Ratio (LLR) is supported as a similarity measure. + Disconnected from the target VM, address: '127.0.0.1:49162', transport: 'socket' + + Output text file schema options: + -rd <value> | --rowKeyDelim <value> + Separates the rowID key from the vector values list (optional). Default: "\t" + -cd <value> | --columnIdStrengthDelim <value> + Separates column IDs from their values in the vector values list (optional). Default: ":" + -td <value> | --elementDelim <value> + Separates vector element values in the values list (optional). Default: " " + -os | --omitStrength + Do not write the strength to the output files (optional), Default: false. + This option is used to output indexable data for creating a search engine recommender. + + Default delimiters will produce output of the form: "itemID1<tab>itemID2:value2<space>itemID10:value10..." + + File discovery options: + -r | --recursive + Searched the -i path recursively for files that match --filenamePattern (optional), Default: false + -fp <value> | --filenamePattern <value> + Regex to match in determining input files (optional). Default: filename in the --input option or "^part-.*" if --input is a directory + + Spark config options: + -ma <value> | --master <value> + Spark Master URL (optional). Default: "local". Note that you can specify the number of cores to get a performance improvement, for example "local[4]" + -sem <value> | --sparkExecutorMem <value> + Max Java heap available as "executor memory" on each node (optional). Default: 4g + -rs <value> | --randomSeed <value> + + -h | --help + prints this usage text + +See RowSimilarityDriver.scala in Mahout's spark module if you want to customize the code. + +#3. Using *spark-rowsimilarity* with Text Data + +Another use case for *spark-rowsimilarity* is in finding similar textual content. For instance given the tags associated with +a blog post, + which other posts have similar tags. In this case the columns are tags and the rows are posts. Since LLR is +the only similarity method supported this is not the optimal way to determine general "bag-of-words" document similarity. +LLR is used more as a quality filter than as a similarity measure. However *spark-rowsimilarity* will produce +lists of similar docs for every doc if input is docs with lists of terms. The Apache [Lucene](http://lucene.apache.org) project provides several methods of [analyzing and tokenizing](http://lucene.apache.org/core/4_9_0/core/org/apache/lucene/analysis/package-summary.html#package_description) documents. + +#<a name="unified-recommender">4. Creating a Unified Recommender</a> + +Using the output of *spark-itemsimilarity* and *spark-rowsimilarity* you can build a unified cooccurrence and content based + recommender that can be used in both or either mode depending on indicators available and the history available at +runtime for a user. + +##Requirements + +1. Mahout 0.10.0 or later +2. Hadoop +3. Spark, the correct version for your version of Mahout and Hadoop +4. A search engine like Solr or Elasticsearch + +##Indicators + +Indicators come in 3 types + +1. **Cooccurrence**: calculated with *spark-itemsimilarity* from user actions +2. **Content**: calculated from item metadata or content using *spark-rowsimilarity* +3. **Intrinsic**: assigned to items as metadata. Can be anything that describes the item. + +The query for recommendations will be a mix of values meant to match one of your indicators. The query can be constructed +from user history and values derived from context (category being viewed for instance) or special precalculated data +(popularity rank for instance). This blending of indicators allows for creating many flavors or recommendations to fit +a very wide variety of circumstances. + +With the right mix of indicators developers can construct a single query that works for completely new items and new users +while working well for items with lots of interactions and users with many recorded actions. In other words by adding in content and intrinsic +indicators developers can create a solution for the "cold-start" problem that gracefully improves with more user history +and as items have more interactions. It is also possible to create a completely content-based recommender that personalizes +recommendations. + +##Example with 3 Indicators + +You will need to decide how you store user action data so they can be processed by the item and row similarity jobs and +this is most easily done by using text files as described above. The data that is processed by these jobs is considered the +training data. You will need some amount of user history in your recs query. It is typical to use the most recent user history +but need not be exactly what is in the training set, which may include a greater volume of historical data. Keeping the user +history for query purposes could be done with a database by storing it in a users table. In the example above the two +collaborative filtering actions are "purchase" and "view", but let's also add tags (taken from catalog categories or other +descriptive metadata). + +We will need to create 1 cooccurrence indicator from the primary action (purchase) 1 cross-action cooccurrence indicator +from the secondary action (view) +and 1 content indicator (tags). We'll have to run *spark-itemsimilarity* once and *spark-rowsimilarity* once. + +We have described how to create the collaborative filtering indicator and cross-cooccurrence indicator for purchase and view (the [How to use Multiple User +Actions](#multiple-actions) section) but tags will be a slightly different process. We want to use the fact that +certain items have tags similar to the ones associated with a user's purchases. This is not a collaborative filtering indicator +but rather a "content" or "metadata" type indicator since you are not using other users' history, only the +individual that you are making recs for. This means that this method will make recommendations for items that have +no collaborative filtering data, as happens with new items in a catalog. New items may have tags assigned but no one + has purchased or viewed them yet. In the final query we will mix all 3 indicators. + +##Content Indicator + +To create a content-indicator we'll make use of the fact that the user has purchased items with certain tags. We want to find +items with the most similar tags. Notice that other users' behavior is not considered--only other item's tags. This defines a +content or metadata indicator. They are used when you want to find items that are similar to other items by using their +content or metadata, not by which users interacted with them. + +For this we need input of the form: + + itemID<tab>list-of-tags + ... + +The full collection will look like the tags column from a catalog DB. For our ecom example it might be: + + 3459860b<tab>men long-sleeve chambray clothing casual + 9446577d<tab>women tops chambray clothing casual + ... + +We'll use *spark-rowimilairity* because we are looking for similar rows, which encode items in this case. As with the +collaborative filtering indicator and cross-cooccurrence indicator we use the --omitStrength option. The strengths created are +probabilistic log-likelihood ratios and so are used to filter unimportant similarities. Once the filtering or downsampling +is finished we no longer need the strengths. We will get an indicator matrix of the form: + + itemID<tab>list-of-item IDs + ... + +This is a content indicator since it has found other items with similar content or metadata. + + 3459860b<tab>3459860b 3459860b 6749860c 5959860a 3434860a 3477860a + 9446577d<tab>9446577d 9496577d 0943577d 8346577d 9442277d 9446577e + ... + +We now have three indicators, two collaborative filtering type and one content type. + +##Unified Recommender Query + +The actual form of the query for recommendations will vary depending on your search engine but the intent is the same. +For a given user, map their history of an action or content to the correct indicator field and perform an OR'd query. + +We have 3 indicators, these are indexed by the search engine into 3 fields, we'll call them "purchase", "view", and "tags". +We take the user's history that corresponds to each indicator and create a query of the form: + + Query: + field: purchase; q:user's-purchase-history + field: view; q:user's view-history + field: tags; q:user's-tags-associated-with-purchases + +The query will result in an ordered list of items recommended for purchase but skewed towards items with similar tags to +the ones the user has already purchased. + +This is only an example and not necessarily the optimal way to create recs. It illustrates how business decisions can be +translated into recommendations. This technique can be used to skew recommendations towards intrinsic indicators also. +For instance you may want to put personalized popular item recs in a special place in the UI. Create a popularity indicator +by tagging items with some category of popularity (hot, warm, cold for instance) then +index that as a new indicator field and include the corresponding value in a query +on the popularity field. If we use the ecom example but use the query to get "hot" recommendations it might look like this: + + Query: + field: purchase; q:user's-purchase-history + field: view; q:user's view-history + field: popularity; q:"hot" + +This will return recommendations favoring ones that have the intrinsic indicator "hot". + +##Notes +1. Use as much user action history as you can gather. Choose a primary action that is closest to what you want to recommend and the others will be used to create cross-cooccurrence indicators. Using more data in this fashion will almost always produce better recommendations. +2. Content can be used where there is no recorded user behavior or when items change too quickly to get much interaction history. They can be used alone or mixed with other indicators. +3. Most search engines support "boost" factors so you can favor one or more indicators. In the example query, if you want tags to only have a small effect you could boost the CF indicators. +4. In the examples we have used space delimited strings for lists of IDs in indicators and in queries. It may be better to use arrays of strings if your storage system and search engine support them. For instance Solr allows multi-valued fields, which correspond to arrays.
