[
https://issues.apache.org/jira/browse/MAHOUT-831?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13123350#comment-13123350
]
Dan Brickley edited comment on MAHOUT-831 at 10/8/11 2:25 AM:
--------------------------------------------------------------
Nice to have in the Java for sure, but where exactly? and how does it relate to
other metadata about 'algorithms' (and tools/utils/jobs)?
The closest two things I've seen to any attempt at overview of algorithms are:
* config for bin/mahout utility:
http://svn.apache.org/repos/asf/mahout/trunk/src/conf/driver.classes.props
* main Algorithms Wiki page:
https://cwiki.apache.org/confluence/display/MAHOUT/Algorithms
The Wiki list is fairly free-form; but commonly it also offers links to
background info in the Wiki, and flags where something is in-progress or 'open'
for contributions. Also often associates JIRA IDs.
>From the WIki:
Classification:
* Logistic Regression (SGD)
* Bayesian
* Support Vector Machines (SVM) (open: MAHOUT-14, MAHOUT-232 and MAHOUT-334)
* Perceptron and Winnow (open: MAHOUT-85)
* Neural Network (open, but MAHOUT-228 might help)
* Random Forests (integrated - MAHOUT-122, MAHOUT-140, MAHOUT-145)
* Restricted Boltzmann Machines (open, MAHOUT-375, GSOC2010)
* Online Passive Aggressive (awaiting patch commit, MAHOUT-702)
Clustering:
* Canopy Clustering (MAHOUT-3 - integrated)
* K-Means Clustering (MAHOUT-5 - integrated)
* Fuzzy K-Means (MAHOUT-74 - integrated)
* Expectation Maximization (EM) (MAHOUT-28)
* Mean Shift Clustering (MAHOUT-15 - integrated)
* Hierarchical Clustering (MAHOUT-19)
* Dirichlet Process Clustering (MAHOUT-30 - integrated)
* Latent Dirichlet Allocation (MAHOUT-123 - integrated)
* Spectral Clustering (MAHOUT-363 - integrated)
* Minhash Clustering (MAHOUT-344 - integrated)
Other topics (I'm lazy to transcribe every heading):
Parallel FP Growth Algorithm (Also known as Frequent Itemset mining)
Locally Weighted Linear Regression (open)
Singular Value Decomposition and other Dimension Reduction Techniques
(available since 0.3)
Principal Components Analysis (PCA) (open)
Independent Component Analysis (open)
Gaussian Discriminative Analysis (GDA) (open)
Evolutionary Algorithms 'see also: MAHOUT-56 (integrated)'
Non-distributed recommenders ("Taste") (integrated)
Distributed recommenders (item-based) (integrated)
RowSimilarityJob – Builds an inverted index and then computes distances between
items that have co-occurrences. This is a fully distributed calculation.
VectorDistanceJob – Does a map side join between a set of "seed" vectors and
all of the input vectors.
Collocations
Non-MapReduce algorithms - 'Hidden Markov Models (HMM) (open)'
Compare with the bin/mahout config, which I'll copy below for completeness.
This associates a class name with a short commandname and a one line
description.
Somehow this all should join up better, but it's not clear to me where the
status info should canonically live. The Java annotation thing is nice, but it
won't help users who come via the Wiki algorithms list, or the commandline
utility.
How about:
1. the Java annotations should be targeted to match with the items listed in
driver.classes.props
2. either the short or long name from driver.classes.props be used as key in an
extra config file, that adds extra metadata about the components: maybe their
JIRA, their status, their canonical wiki URL for more reading. And any other
category info that will help distinguish
algorithms ('svd', 'kmeans', 'lda', ...?) from tools/utilities ('vectordump',
'clusterdump','seqdumper', 'splitDataset', the matrix stuff?) and examples
('prepare20newsgroups', 'wikipediaXMLSplitter' ...). Or just expand
driver.classes.props, at risk of breaking anything that reads the current
format.
BTW what about stability of the short command names understood by bin/mahout?
We've got 'seqdumper' yet 'vectordump' here, which can make things hard to
remember. Also the names 'svd' and 'cleansvd' don't make clear the association
with Lanczos, rather than the other SVD variants in there.
With a little more metadata it ought to be possible for the (a) guts of the
main commandline-accessible Algorithms wiki page to be machine-generated
periodically (b) each main algorithm or commandline option to have a wiki page
(and again auto-gen'd basic documentation of the commandline options) (c) there
to be a clear path from commandline --help to an agreed Wiki URL, so that notes
and so on can be more easily found. For Java coders, a canonical and fresh
Javadoc would also be a great thing to be able to link to; is there a preferred
site for that? (The links in MAHOUT-547 no longer work...). Excuse the
wishlistery...
org.apache.mahout.utils.vectors.VectorDumper = vectordump : Dump vectors from a
sequence file to text
org.apache.mahout.utils.clustering.ClusterDumper = clusterdump : Dump cluster
output to text
org.apache.mahout.utils.SequenceFileDumper = seqdumper : Generic Sequence File
dumper
org.apache.mahout.cf.taste.hadoop.als.eval.DatasetSplitter = splitDataset :
split a rating dataset into training and probe parts
org.apache.mahout.cf.taste.hadoop.als.eval.InMemoryFactorizationEvaluator =
evaluateFactorization : compute RMSE of a rating matrix factorization against
probes in memory
org.apache.mahout.cf.taste.hadoop.als.eval.ParallelFactorizationEvaluator =
evaluateFactorizationParallel : compute RMSE of a rating matrix factorization
against probes
org.apache.mahout.clustering.kmeans.KMeansDriver = kmeans : K-means clustering
org.apache.mahout.clustering.fuzzykmeans.FuzzyKMeansDriver = fkmeans : Fuzzy
K-means clustering
org.apache.mahout.clustering.lda.LDADriver = lda : Latent Dirchlet Allocation
org.apache.mahout.clustering.lda.LDAPrintTopics = ldatopics : LDA Print Topics
org.apache.mahout.fpm.pfpgrowth.FPGrowthDriver = fpg : Frequent Pattern Growth
org.apache.mahout.clustering.dirichlet.DirichletDriver = dirichlet : Dirichlet
Clustering
org.apache.mahout.clustering.meanshift.MeanShiftCanopyDriver = meanshift : Mean
Shift clustering
org.apache.mahout.clustering.canopy.CanopyDriver = canopy : Canopy clustering
org.apache.mahout.math.hadoop.TransposeJob = transpose : Take the transpose of
a matrix
org.apache.mahout.math.hadoop.MatrixMultiplicationJob = matrixmult : Take the
product of two matrices
org.apache.mahout.utils.vectors.lucene.Driver = lucene.vector : Generate
Vectors from a Lucene index
org.apache.mahout.utils.vectors.arff.Driver = arff.vector : Generate Vectors
from an ARFF file or directory
org.apache.mahout.text.SequenceFilesFromDirectory = seqdirectory : Generate
sequence files (of Text) from a directory
org.apache.mahout.vectorizer.SparseVectorsFromSequenceFiles = seq2sparse:
Sparse Vector generation from Text sequence files
org.apache.mahout.utils.vectors.RowIdJob = rowid : Map
SequenceFile<Text,VectorWritable> to {SequenceFile<IntWritable,VectorWritable>,
SequenceFile<IntWritable,Text>}
org.apache.mahout.text.WikipediaToSequenceFile = seqwiki : Wikipedia xml dump
to sequence file
org.apache.mahout.classifier.bayes.TestClassifier = testclassifier : Test the
text based Bayes Classifier
org.apache.mahout.classifier.bayes.TrainClassifier = trainclassifier : Train
the text based Bayes Classifier
org.apache.mahout.classifier.bayes.PrepareTwentyNewsgroups =
prepare20newsgroups : Reformat 20 newsgroups data
org.apache.mahout.math.hadoop.decomposer.DistributedLanczosSolver = svd :
Lanczos Singular Value Decomposition
org.apache.mahout.math.hadoop.decomposer.EigenVerificationJob = cleansvd :
Cleanup and verification of SVD output
org.apache.mahout.math.hadoop.similarity.cooccurrence.RowSimilarityJob =
rowsimilarity : Compute the pairwise similarities of the rows of a matrix
org.apache.mahout.math.hadoop.similarity.VectorDistanceSimilarityJob = vecdist
: Compute the distances between a set of Vectors (or Cluster or Canopy, they
must fit in memory) and a list of Vectors
org.apache.mahout.cf.taste.hadoop.similarity.item.ItemSimilarityJob =
itemsimilarity : Compute the item-item-similarities for item-based
collaborative filtering
org.apache.mahout.cf.taste.hadoop.item.RecommenderJob = recommenditembased :
Compute recommendations using item-based collaborative filtering
org.apache.mahout.classifier.sgd.TrainLogistic = trainlogistic : Train a
logistic regression using stochastic gradient descent
org.apache.mahout.classifier.sgd.RunLogistic = runlogistic : Run a logistic
regression model against CSV data
org.apache.mahout.classifier.sgd.PrintResourceOrFile = cat : Print a file or
resource as the logistic regression models would see it
org.apache.mahout.classifier.sgd.TrainAdaptiveLogistic = trainAdaptiveLogistic
: Train an AdaptivelogisticRegression model
org.apache.mahout.classifier.sgd.ValidateAdaptiveLogistic =
validateAdaptiveLogistic : Validate an AdaptivelogisticRegression model against
hold-out data set
org.apache.mahout.classifier.sgd.RunAdaptiveLogistic = runAdaptiveLogistic :
Score new production data using a probably trained and validated
AdaptivelogisticRegression model
org.apache.mahout.classifier.bayes.WikipediaXmlSplitter = wikipediaXMLSplitter
: Reads wikipedia data and creates ch
org.apache.mahout.classifier.bayes.WikipediaDatasetCreatorDriver =
wikipediaDataSetCreator : Splits data set of wikipedia wrt feature like country
org.apache.mahout.math.hadoop.stochasticsvd.SSVDCli = ssvd : Stochastic SVD
org.apache.mahout.clustering.spectral.eigencuts.EigencutsDriver = eigencuts :
Eigencuts spectral clustering
org.apache.mahout.clustering.spectral.kmeans.SpectralKMeansDriver =
spectralkmeans : Spectral k-means clustering
org.apache.mahout.cf.taste.hadoop.als.ParallelALSFactorizationJob = parallelALS
: ALS-WR factorization of a rating matrix
org.apache.mahout.cf.taste.hadoop.als.PredictionJob = predictFromFactorization
: predict preferences from a factorization of a rating matrix
org.apache.mahout.classifier.sequencelearning.hmm.BaumWelchTrainer = baumwelch
: Baum-Welch algorithm for unsupervised HMM training
org.apache.mahout.classifier.sequencelearning.hmm.ViterbiEvaluator = viterbi :
Viterbi decoding of hidden states from given output states sequence
org.apache.mahout.classifier.sequencelearning.hmm.RandomSequenceGenerator =
hmmpredict : Generate random sequence of observations by given HMM
org.apache.mahout.utils.SplitInput = split : Split Input data into test and
train sets
org.apache.mahout.classifier.naivebayes.training.TrainNaiveBayesJob = trainnb :
Train the Vector-based Bayes classifier
org.apache.mahout.classifier.naivebayes.test.TestNaiveBayesDriver = testnb :
Test the Vector-based Bayes classifier
was (Author: danbri):
Nice to have in the Java for sure, but where exactly? and how does it relate to
other metadata about 'algorithms' (and tools/utils/jobs)?
The closest two things I've seen to any attempt at overview of algorithms are:
* config for bin/mahout utility:
http://svn.apache.org/repos/asf/mahout/trunk/src/conf/driver.classes.props
* main Algorithms Wiki page:
https://cwiki.apache.org/confluence/display/MAHOUT/Algorithms
The Wiki list is fairly free-form; but commonly it also offers links to
background info in the Wiki, and flags where something is in-progress or 'open'
for contributions. Also often associates JIRA IDs.
Classification:
* Logistic Regression (SGD)
* Bayesian
* Support Vector Machines (SVM) (open: MAHOUT-14, MAHOUT-232 and MAHOUT-334)
* Perceptron and Winnow (open: MAHOUT-85)
* Neural Network (open, but MAHOUT-228 might help)
* Random Forests (integrated - MAHOUT-122, MAHOUT-140, MAHOUT-145)
* Restricted Boltzmann Machines (open, MAHOUT-375, GSOC2010)
* Online Passive Aggressive (awaiting patch commit, MAHOUT-702)
Clustering:
* Canopy Clustering (MAHOUT-3 - integrated)
* K-Means Clustering (MAHOUT-5 - integrated)
* Fuzzy K-Means (MAHOUT-74 - integrated)
* Expectation Maximization (EM) (MAHOUT-28)
* Mean Shift Clustering (MAHOUT-15 - integrated)
* Hierarchical Clustering (MAHOUT-19)
* Dirichlet Process Clustering (MAHOUT-30 - integrated)
* Latent Dirichlet Allocation (MAHOUT-123 - integrated)
* Spectral Clustering (MAHOUT-363 - integrated)
* Minhash Clustering (MAHOUT-344 - integrated)
Other topics (I'm lazy to transcribe every heading):
Parallel FP Growth Algorithm (Also known as Frequent Itemset mining)
Locally Weighted Linear Regression (open)
Singular Value Decomposition and other Dimension Reduction Techniques
(available since 0.3)
Principal Components Analysis (PCA) (open)
Independent Component Analysis (open)
Gaussian Discriminative Analysis (GDA) (open)
Evolutionary Algorithms 'see also: MAHOUT-56 (integrated)'
Non-distributed recommenders ("Taste") (integrated)
Distributed recommenders (item-based) (integrated)
RowSimilarityJob – Builds an inverted index and then computes distances between
items that have co-occurrences. This is a fully distributed calculation.
VectorDistanceJob – Does a map side join between a set of "seed" vectors and
all of the input vectors.
Collocations
Non-MapReduce algorithms - 'Hidden Markov Models (HMM) (open)'
Compare with the bin/mahout config, which I'll copy below for completeness.
This associates a class name with a short commandname and a one line
description.
Somehow this all should join up better, but it's not clear to me where the
status info should canonically live. The Java annotation thing is nice, but it
won't help users who come via the Wiki algorithms list, or the commandline
utility, with a bit more effort.
How about:
1. the Java annotations should be targeted to match with the items listed in
driver.classes.props
2. either the short or long name from driver.classes.props be used as key in an
extra config file, that adds extra metadata about the components: maybe their
JIRA, their status, their canonical wiki URL for more reading. And any other
category info that will help distinguish
algorithms ('svd', 'kmeans', 'lda', ...?) from tools/utilities ('vectordump',
'clusterdump','seqdumper', 'splitDataset', the matrix stuff?) and examples
('prepare20newsgroups', 'wikipediaXMLSplitter' ...). Or just expand
driver.classes.props, at risk of breaking anything that reads the current
format.
BTW what about stability of the short command names understood by bin/mahout?
We've got 'seqdumper' yet 'vectordump' here, which can make things hard to
remember. Also the names 'svd' and 'cleansvd' don't make clear the association
with Lanczos, rather than the other SVD variants in there.
With a little more metadata it ought to be possible for the (a) guts of the
main commandline-accessible Algorithms wiki page to be machine-generated
periodically (b) each main algorithm or commandline option to have a wiki page
(and again auto-gen'd basic documentation of the commandline options) (c) there
to be a clear path from commandline --help to an agreed Wiki URL, so that notes
and so on can be more easily found. For Java coders, a canonical and fresh
Javadoc would also be a great thing to be able to link to; is there a preferred
site for that? (The links in MAHOUT-547 no longer work...). Excuse the
wishlistery...
org.apache.mahout.utils.vectors.VectorDumper = vectordump : Dump vectors from a
sequence file to text
org.apache.mahout.utils.clustering.ClusterDumper = clusterdump : Dump cluster
output to text
org.apache.mahout.utils.SequenceFileDumper = seqdumper : Generic Sequence File
dumper
org.apache.mahout.cf.taste.hadoop.als.eval.DatasetSplitter = splitDataset :
split a rating dataset into training and probe parts
org.apache.mahout.cf.taste.hadoop.als.eval.InMemoryFactorizationEvaluator =
evaluateFactorization : compute RMSE of a rating matrix factorization against
probes in memory
org.apache.mahout.cf.taste.hadoop.als.eval.ParallelFactorizationEvaluator =
evaluateFactorizationParallel : compute RMSE of a rating matrix factorization
against probes
org.apache.mahout.clustering.kmeans.KMeansDriver = kmeans : K-means clustering
org.apache.mahout.clustering.fuzzykmeans.FuzzyKMeansDriver = fkmeans : Fuzzy
K-means clustering
org.apache.mahout.clustering.lda.LDADriver = lda : Latent Dirchlet Allocation
org.apache.mahout.clustering.lda.LDAPrintTopics = ldatopics : LDA Print Topics
org.apache.mahout.fpm.pfpgrowth.FPGrowthDriver = fpg : Frequent Pattern Growth
org.apache.mahout.clustering.dirichlet.DirichletDriver = dirichlet : Dirichlet
Clustering
org.apache.mahout.clustering.meanshift.MeanShiftCanopyDriver = meanshift : Mean
Shift clustering
org.apache.mahout.clustering.canopy.CanopyDriver = canopy : Canopy clustering
org.apache.mahout.math.hadoop.TransposeJob = transpose : Take the transpose of
a matrix
org.apache.mahout.math.hadoop.MatrixMultiplicationJob = matrixmult : Take the
product of two matrices
org.apache.mahout.utils.vectors.lucene.Driver = lucene.vector : Generate
Vectors from a Lucene index
org.apache.mahout.utils.vectors.arff.Driver = arff.vector : Generate Vectors
from an ARFF file or directory
org.apache.mahout.text.SequenceFilesFromDirectory = seqdirectory : Generate
sequence files (of Text) from a directory
org.apache.mahout.vectorizer.SparseVectorsFromSequenceFiles = seq2sparse:
Sparse Vector generation from Text sequence files
org.apache.mahout.utils.vectors.RowIdJob = rowid : Map
SequenceFile<Text,VectorWritable> to {SequenceFile<IntWritable,VectorWritable>,
SequenceFile<IntWritable,Text>}
org.apache.mahout.text.WikipediaToSequenceFile = seqwiki : Wikipedia xml dump
to sequence file
org.apache.mahout.classifier.bayes.TestClassifier = testclassifier : Test the
text based Bayes Classifier
org.apache.mahout.classifier.bayes.TrainClassifier = trainclassifier : Train
the text based Bayes Classifier
org.apache.mahout.classifier.bayes.PrepareTwentyNewsgroups =
prepare20newsgroups : Reformat 20 newsgroups data
org.apache.mahout.math.hadoop.decomposer.DistributedLanczosSolver = svd :
Lanczos Singular Value Decomposition
org.apache.mahout.math.hadoop.decomposer.EigenVerificationJob = cleansvd :
Cleanup and verification of SVD output
org.apache.mahout.math.hadoop.similarity.cooccurrence.RowSimilarityJob =
rowsimilarity : Compute the pairwise similarities of the rows of a matrix
org.apache.mahout.math.hadoop.similarity.VectorDistanceSimilarityJob = vecdist
: Compute the distances between a set of Vectors (or Cluster or Canopy, they
must fit in memory) and a list of Vectors
org.apache.mahout.cf.taste.hadoop.similarity.item.ItemSimilarityJob =
itemsimilarity : Compute the item-item-similarities for item-based
collaborative filtering
org.apache.mahout.cf.taste.hadoop.item.RecommenderJob = recommenditembased :
Compute recommendations using item-based collaborative filtering
org.apache.mahout.classifier.sgd.TrainLogistic = trainlogistic : Train a
logistic regression using stochastic gradient descent
org.apache.mahout.classifier.sgd.RunLogistic = runlogistic : Run a logistic
regression model against CSV data
org.apache.mahout.classifier.sgd.PrintResourceOrFile = cat : Print a file or
resource as the logistic regression models would see it
org.apache.mahout.classifier.sgd.TrainAdaptiveLogistic = trainAdaptiveLogistic
: Train an AdaptivelogisticRegression model
org.apache.mahout.classifier.sgd.ValidateAdaptiveLogistic =
validateAdaptiveLogistic : Validate an AdaptivelogisticRegression model against
hold-out data set
org.apache.mahout.classifier.sgd.RunAdaptiveLogistic = runAdaptiveLogistic :
Score new production data using a probably trained and validated
AdaptivelogisticRegression model
org.apache.mahout.classifier.bayes.WikipediaXmlSplitter = wikipediaXMLSplitter
: Reads wikipedia data and creates ch
org.apache.mahout.classifier.bayes.WikipediaDatasetCreatorDriver =
wikipediaDataSetCreator : Splits data set of wikipedia wrt feature like country
org.apache.mahout.math.hadoop.stochasticsvd.SSVDCli = ssvd : Stochastic SVD
org.apache.mahout.clustering.spectral.eigencuts.EigencutsDriver = eigencuts :
Eigencuts spectral clustering
org.apache.mahout.clustering.spectral.kmeans.SpectralKMeansDriver =
spectralkmeans : Spectral k-means clustering
org.apache.mahout.cf.taste.hadoop.als.ParallelALSFactorizationJob = parallelALS
: ALS-WR factorization of a rating matrix
org.apache.mahout.cf.taste.hadoop.als.PredictionJob = predictFromFactorization
: predict preferences from a factorization of a rating matrix
org.apache.mahout.classifier.sequencelearning.hmm.BaumWelchTrainer = baumwelch
: Baum-Welch algorithm for unsupervised HMM training
org.apache.mahout.classifier.sequencelearning.hmm.ViterbiEvaluator = viterbi :
Viterbi decoding of hidden states from given output states sequence
org.apache.mahout.classifier.sequencelearning.hmm.RandomSequenceGenerator =
hmmpredict : Generate random sequence of observations by given HMM
org.apache.mahout.utils.SplitInput = split : Split Input data into test and
train sets
org.apache.mahout.classifier.naivebayes.training.TrainNaiveBayesJob = trainnb :
Train the Vector-based Bayes classifier
org.apache.mahout.classifier.naivebayes.test.TestNaiveBayesDriver = testnb :
Test the Vector-based Bayes classifier
> @Experimental annotation to indicate which implementations are not intended
> for production use
> ----------------------------------------------------------------------------------------------
>
> Key: MAHOUT-831
> URL: https://issues.apache.org/jira/browse/MAHOUT-831
> Project: Mahout
> Issue Type: Improvement
> Affects Versions: 0.6
> Reporter: Sebastian Schelter
> Assignee: Sebastian Schelter
> Attachments: MAHOUT-831.patch
>
>
--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators:
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira