[jira] [Issue Comment Edited] (MAHOUT-831) @Experimental annotation to indicate which implementations are not intended for production use

Dan Brickley (Issue Comment Edited) (JIRA) Fri, 07 Oct 2011 19:27:55 -0700

    [ 
https://issues.apache.org/jira/browse/MAHOUT-831?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13123350#comment-13123350
 ]


Dan Brickley edited comment on MAHOUT-831 at 10/8/11 2:25 AM:
--------------------------------------------------------------

Nice to have in the Java for sure, but where exactly? and how does it relate to 
other metadata about 'algorithms' (and tools/utils/jobs)?

The closest two things I've seen to any attempt at overview of algorithms are:

* config for bin/mahout utility: 
http://svn.apache.org/repos/asf/mahout/trunk/src/conf/driver.classes.props
* main Algorithms Wiki page: 
https://cwiki.apache.org/confluence/display/MAHOUT/Algorithms

The Wiki list is fairly free-form; but commonly it also offers links to 
background info in the Wiki, and flags where something is in-progress or 'open' 
for contributions. Also often associates JIRA IDs.

>From the WIki:
 
Classification:
* Logistic Regression (SGD)
* Bayesian
* Support Vector Machines (SVM) (open: MAHOUT-14, MAHOUT-232 and MAHOUT-334)
* Perceptron and Winnow (open: MAHOUT-85)
* Neural Network (open, but MAHOUT-228 might help)
* Random Forests (integrated - MAHOUT-122, MAHOUT-140, MAHOUT-145)
* Restricted Boltzmann Machines (open, MAHOUT-375, GSOC2010)
* Online Passive Aggressive (awaiting patch commit, MAHOUT-702)

Clustering:
* Canopy Clustering (MAHOUT-3 - integrated)
* K-Means Clustering (MAHOUT-5 - integrated)
* Fuzzy K-Means (MAHOUT-74 - integrated)
* Expectation Maximization (EM) (MAHOUT-28)
* Mean Shift Clustering (MAHOUT-15 - integrated)
* Hierarchical Clustering (MAHOUT-19)
* Dirichlet Process Clustering (MAHOUT-30 - integrated)
* Latent Dirichlet Allocation (MAHOUT-123 - integrated)
* Spectral Clustering (MAHOUT-363 - integrated)
* Minhash Clustering (MAHOUT-344 - integrated)

Other topics (I'm lazy to transcribe every heading):

Parallel FP Growth Algorithm (Also known as Frequent Itemset mining)
Locally Weighted Linear Regression (open)
Singular Value Decomposition and other Dimension Reduction Techniques 
(available since 0.3)
Principal Components Analysis (PCA) (open)
Independent Component Analysis (open)
Gaussian Discriminative Analysis (GDA) (open)
Evolutionary Algorithms 'see also: MAHOUT-56 (integrated)'
Non-distributed recommenders ("Taste") (integrated)
Distributed recommenders (item-based) (integrated)
RowSimilarityJob – Builds an inverted index and then computes distances between 
items that have co-occurrences. This is a fully distributed calculation.
VectorDistanceJob – Does a map side join between a set of "seed" vectors and 
all of the input vectors.
Collocations
Non-MapReduce algorithms - 'Hidden Markov Models (HMM) (open)'



Compare with the bin/mahout config, which I'll copy below for completeness. 
This associates a class name with a short commandname and a one line 
description.

Somehow this all should join up better, but it's not clear to me where the 
status info should canonically live. The Java annotation thing is nice, but it 
won't help users who come via the Wiki algorithms list, or the commandline 
utility.



How about:

1. the Java annotations should be targeted to match with the items listed in 
driver.classes.props
2. either the short or long name from driver.classes.props be used as key in an 
extra config file, that adds extra metadata about the components: maybe their 
JIRA, their status, their canonical wiki URL for more reading. And any other 
category info that will help distinguish 
algorithms ('svd', 'kmeans', 'lda', ...?) from tools/utilities ('vectordump', 
'clusterdump','seqdumper', 'splitDataset', the matrix stuff?) and examples 
('prepare20newsgroups', 'wikipediaXMLSplitter' ...). Or just expand 
driver.classes.props, at risk of breaking anything that reads the current 
format.

BTW what about stability of the short command names understood by bin/mahout? 
We've got 'seqdumper' yet 'vectordump' here, which can make things hard to 
remember. Also the names 'svd' and 'cleansvd' don't make clear the association 
with Lanczos, rather than the other SVD variants in there.

With a little more metadata it ought to be possible for the (a) guts of the 
main commandline-accessible Algorithms wiki page to be machine-generated 
periodically (b) each main algorithm or commandline option to have a wiki page 
(and again auto-gen'd basic documentation of the commandline options) (c) there 
to be a clear path from commandline --help to an agreed Wiki URL, so that notes 
and so on can be more easily found. For Java coders, a canonical and fresh 
Javadoc would also be a great thing to be able to link to; is there a preferred 
site for that? (The links in MAHOUT-547 no longer work...). Excuse the 
wishlistery...

org.apache.mahout.utils.vectors.VectorDumper = vectordump : Dump vectors from a 
sequence file to text
org.apache.mahout.utils.clustering.ClusterDumper = clusterdump : Dump cluster 
output to text
org.apache.mahout.utils.SequenceFileDumper = seqdumper : Generic Sequence File 
dumper
org.apache.mahout.cf.taste.hadoop.als.eval.DatasetSplitter = splitDataset : 
split a rating dataset into training and probe parts
org.apache.mahout.cf.taste.hadoop.als.eval.InMemoryFactorizationEvaluator = 
evaluateFactorization : compute RMSE of a rating matrix factorization against 
probes in memory
org.apache.mahout.cf.taste.hadoop.als.eval.ParallelFactorizationEvaluator = 
evaluateFactorizationParallel : compute RMSE of a rating matrix factorization 
against probes
org.apache.mahout.clustering.kmeans.KMeansDriver = kmeans : K-means clustering
org.apache.mahout.clustering.fuzzykmeans.FuzzyKMeansDriver = fkmeans : Fuzzy 
K-means clustering
org.apache.mahout.clustering.lda.LDADriver = lda : Latent Dirchlet Allocation
org.apache.mahout.clustering.lda.LDAPrintTopics = ldatopics : LDA Print Topics
org.apache.mahout.fpm.pfpgrowth.FPGrowthDriver = fpg : Frequent Pattern Growth
org.apache.mahout.clustering.dirichlet.DirichletDriver = dirichlet : Dirichlet 
Clustering
org.apache.mahout.clustering.meanshift.MeanShiftCanopyDriver = meanshift : Mean 
Shift clustering
org.apache.mahout.clustering.canopy.CanopyDriver = canopy : Canopy clustering
org.apache.mahout.math.hadoop.TransposeJob = transpose : Take the transpose of 
a matrix
org.apache.mahout.math.hadoop.MatrixMultiplicationJob = matrixmult : Take the 
product of two matrices
org.apache.mahout.utils.vectors.lucene.Driver = lucene.vector : Generate 
Vectors from a Lucene index
org.apache.mahout.utils.vectors.arff.Driver = arff.vector : Generate Vectors 
from an ARFF file or directory 
org.apache.mahout.text.SequenceFilesFromDirectory = seqdirectory : Generate 
sequence files (of Text) from a directory
org.apache.mahout.vectorizer.SparseVectorsFromSequenceFiles = seq2sparse: 
Sparse Vector generation from Text sequence files
org.apache.mahout.utils.vectors.RowIdJob = rowid : Map 
SequenceFile<Text,VectorWritable> to {SequenceFile<IntWritable,VectorWritable>, 
SequenceFile<IntWritable,Text>}
org.apache.mahout.text.WikipediaToSequenceFile = seqwiki : Wikipedia xml dump 
to sequence file
org.apache.mahout.classifier.bayes.TestClassifier = testclassifier : Test the 
text based Bayes Classifier
org.apache.mahout.classifier.bayes.TrainClassifier = trainclassifier : Train 
the text based Bayes Classifier
org.apache.mahout.classifier.bayes.PrepareTwentyNewsgroups = 
prepare20newsgroups : Reformat 20 newsgroups data
org.apache.mahout.math.hadoop.decomposer.DistributedLanczosSolver = svd : 
Lanczos Singular Value Decomposition
org.apache.mahout.math.hadoop.decomposer.EigenVerificationJob = cleansvd : 
Cleanup and verification of SVD output
org.apache.mahout.math.hadoop.similarity.cooccurrence.RowSimilarityJob = 
rowsimilarity : Compute the pairwise similarities of the rows of a matrix
org.apache.mahout.math.hadoop.similarity.VectorDistanceSimilarityJob =  vecdist 
: Compute the distances between a set of Vectors (or Cluster or Canopy, they 
must fit in memory) and a list of Vectors
org.apache.mahout.cf.taste.hadoop.similarity.item.ItemSimilarityJob = 
itemsimilarity : Compute the item-item-similarities for item-based 
collaborative filtering
org.apache.mahout.cf.taste.hadoop.item.RecommenderJob = recommenditembased : 
Compute recommendations using item-based collaborative filtering
org.apache.mahout.classifier.sgd.TrainLogistic = trainlogistic : Train a 
logistic regression using stochastic gradient descent
org.apache.mahout.classifier.sgd.RunLogistic = runlogistic : Run a logistic 
regression model against CSV data
org.apache.mahout.classifier.sgd.PrintResourceOrFile = cat : Print a file or 
resource as the logistic regression models would see it
org.apache.mahout.classifier.sgd.TrainAdaptiveLogistic = trainAdaptiveLogistic 
: Train an AdaptivelogisticRegression model
org.apache.mahout.classifier.sgd.ValidateAdaptiveLogistic = 
validateAdaptiveLogistic : Validate an AdaptivelogisticRegression model against 
hold-out data set
org.apache.mahout.classifier.sgd.RunAdaptiveLogistic = runAdaptiveLogistic : 
Score new production data using a probably trained and validated 
AdaptivelogisticRegression model
org.apache.mahout.classifier.bayes.WikipediaXmlSplitter = wikipediaXMLSplitter 
: Reads wikipedia data and creates ch  
org.apache.mahout.classifier.bayes.WikipediaDatasetCreatorDriver = 
wikipediaDataSetCreator : Splits data set of wikipedia wrt feature like country
org.apache.mahout.math.hadoop.stochasticsvd.SSVDCli = ssvd : Stochastic SVD
org.apache.mahout.clustering.spectral.eigencuts.EigencutsDriver = eigencuts : 
Eigencuts spectral clustering
org.apache.mahout.clustering.spectral.kmeans.SpectralKMeansDriver = 
spectralkmeans : Spectral k-means clustering 
org.apache.mahout.cf.taste.hadoop.als.ParallelALSFactorizationJob = parallelALS 
: ALS-WR factorization of a rating matrix
org.apache.mahout.cf.taste.hadoop.als.PredictionJob = predictFromFactorization 
: predict preferences from a factorization of a rating matrix
org.apache.mahout.classifier.sequencelearning.hmm.BaumWelchTrainer = baumwelch 
: Baum-Welch algorithm for unsupervised HMM training
org.apache.mahout.classifier.sequencelearning.hmm.ViterbiEvaluator = viterbi : 
Viterbi decoding of hidden states from given output states sequence
org.apache.mahout.classifier.sequencelearning.hmm.RandomSequenceGenerator = 
hmmpredict : Generate random sequence of observations by given HMM
org.apache.mahout.utils.SplitInput = split : Split Input data into test and 
train sets
org.apache.mahout.classifier.naivebayes.training.TrainNaiveBayesJob = trainnb : 
Train the Vector-based Bayes classifier
org.apache.mahout.classifier.naivebayes.test.TestNaiveBayesDriver = testnb : 
Test the Vector-based Bayes classifier


                
      was (Author: danbri):
    
Nice to have in the Java for sure, but where exactly? and how does it relate to 
other metadata about 'algorithms' (and tools/utils/jobs)?

The closest two things I've seen to any attempt at overview of algorithms are:

* config for bin/mahout utility: 
http://svn.apache.org/repos/asf/mahout/trunk/src/conf/driver.classes.props
* main Algorithms Wiki page: 
https://cwiki.apache.org/confluence/display/MAHOUT/Algorithms

The Wiki list is fairly free-form; but commonly it also offers links to 
background info in the Wiki, and flags where something is in-progress or 'open' 
for contributions. Also often associates JIRA IDs.
 
Classification:
* Logistic Regression (SGD)
* Bayesian
* Support Vector Machines (SVM) (open: MAHOUT-14, MAHOUT-232 and MAHOUT-334)
* Perceptron and Winnow (open: MAHOUT-85)
* Neural Network (open, but MAHOUT-228 might help)
* Random Forests (integrated - MAHOUT-122, MAHOUT-140, MAHOUT-145)
* Restricted Boltzmann Machines (open, MAHOUT-375, GSOC2010)
* Online Passive Aggressive (awaiting patch commit, MAHOUT-702)

Clustering:
* Canopy Clustering (MAHOUT-3 - integrated)
* K-Means Clustering (MAHOUT-5 - integrated)
* Fuzzy K-Means (MAHOUT-74 - integrated)
* Expectation Maximization (EM) (MAHOUT-28)
* Mean Shift Clustering (MAHOUT-15 - integrated)
* Hierarchical Clustering (MAHOUT-19)
* Dirichlet Process Clustering (MAHOUT-30 - integrated)
* Latent Dirichlet Allocation (MAHOUT-123 - integrated)
* Spectral Clustering (MAHOUT-363 - integrated)
* Minhash Clustering (MAHOUT-344 - integrated)

Other topics (I'm lazy to transcribe every heading):

Parallel FP Growth Algorithm (Also known as Frequent Itemset mining)
Locally Weighted Linear Regression (open)
Singular Value Decomposition and other Dimension Reduction Techniques 
(available since 0.3)
Principal Components Analysis (PCA) (open)
Independent Component Analysis (open)
Gaussian Discriminative Analysis (GDA) (open)
Evolutionary Algorithms 'see also: MAHOUT-56 (integrated)'
Non-distributed recommenders ("Taste") (integrated)
Distributed recommenders (item-based) (integrated)
RowSimilarityJob – Builds an inverted index and then computes distances between 
items that have co-occurrences. This is a fully distributed calculation.
VectorDistanceJob – Does a map side join between a set of "seed" vectors and 
all of the input vectors.
Collocations
Non-MapReduce algorithms - 'Hidden Markov Models (HMM) (open)'



Compare with the bin/mahout config, which I'll copy below for completeness. 
This associates a class name with a short commandname and a one line 
description.

Somehow this all should join up better, but it's not clear to me where the 
status info should canonically live. The Java annotation thing is nice, but it 
won't help users who come via the Wiki algorithms list, or the commandline 
utility, with a bit more effort.



How about:

1. the Java annotations should be targeted to match with the items listed in 
driver.classes.props
2. either the short or long name from driver.classes.props be used as key in an 
extra config file, that adds extra metadata about the components: maybe their 
JIRA, their status, their canonical wiki URL for more reading. And any other 
category info that will help distinguish 
algorithms ('svd', 'kmeans', 'lda', ...?) from tools/utilities ('vectordump', 
'clusterdump','seqdumper', 'splitDataset', the matrix stuff?) and examples 
('prepare20newsgroups', 'wikipediaXMLSplitter' ...). Or just expand 
driver.classes.props, at risk of breaking anything that reads the current 
format.

BTW what about stability of the short command names understood by bin/mahout? 
We've got 'seqdumper' yet 'vectordump' here, which can make things hard to 
remember. Also the names 'svd' and 'cleansvd' don't make clear the association 
with Lanczos, rather than the other SVD variants in there.

With a little more metadata it ought to be possible for the (a) guts of the 
main commandline-accessible Algorithms wiki page to be machine-generated 
periodically (b) each main algorithm or commandline option to have a wiki page 
(and again auto-gen'd basic documentation of the commandline options) (c) there 
to be a clear path from commandline --help to an agreed Wiki URL, so that notes 
and so on can be more easily found. For Java coders, a canonical and fresh 
Javadoc would also be a great thing to be able to link to; is there a preferred 
site for that? (The links in MAHOUT-547 no longer work...). Excuse the 
wishlistery...

org.apache.mahout.utils.vectors.VectorDumper = vectordump : Dump vectors from a 
sequence file to text
org.apache.mahout.utils.clustering.ClusterDumper = clusterdump : Dump cluster 
output to text
org.apache.mahout.utils.SequenceFileDumper = seqdumper : Generic Sequence File 
dumper
org.apache.mahout.cf.taste.hadoop.als.eval.DatasetSplitter = splitDataset : 
split a rating dataset into training and probe parts
org.apache.mahout.cf.taste.hadoop.als.eval.InMemoryFactorizationEvaluator = 
evaluateFactorization : compute RMSE of a rating matrix factorization against 
probes in memory
org.apache.mahout.cf.taste.hadoop.als.eval.ParallelFactorizationEvaluator = 
evaluateFactorizationParallel : compute RMSE of a rating matrix factorization 
against probes
org.apache.mahout.clustering.kmeans.KMeansDriver = kmeans : K-means clustering
org.apache.mahout.clustering.fuzzykmeans.FuzzyKMeansDriver = fkmeans : Fuzzy 
K-means clustering
org.apache.mahout.clustering.lda.LDADriver = lda : Latent Dirchlet Allocation
org.apache.mahout.clustering.lda.LDAPrintTopics = ldatopics : LDA Print Topics
org.apache.mahout.fpm.pfpgrowth.FPGrowthDriver = fpg : Frequent Pattern Growth
org.apache.mahout.clustering.dirichlet.DirichletDriver = dirichlet : Dirichlet 
Clustering
org.apache.mahout.clustering.meanshift.MeanShiftCanopyDriver = meanshift : Mean 
Shift clustering
org.apache.mahout.clustering.canopy.CanopyDriver = canopy : Canopy clustering
org.apache.mahout.math.hadoop.TransposeJob = transpose : Take the transpose of 
a matrix
org.apache.mahout.math.hadoop.MatrixMultiplicationJob = matrixmult : Take the 
product of two matrices
org.apache.mahout.utils.vectors.lucene.Driver = lucene.vector : Generate 
Vectors from a Lucene index
org.apache.mahout.utils.vectors.arff.Driver = arff.vector : Generate Vectors 
from an ARFF file or directory 
org.apache.mahout.text.SequenceFilesFromDirectory = seqdirectory : Generate 
sequence files (of Text) from a directory
org.apache.mahout.vectorizer.SparseVectorsFromSequenceFiles = seq2sparse: 
Sparse Vector generation from Text sequence files
org.apache.mahout.utils.vectors.RowIdJob = rowid : Map 
SequenceFile<Text,VectorWritable> to {SequenceFile<IntWritable,VectorWritable>, 
SequenceFile<IntWritable,Text>}
org.apache.mahout.text.WikipediaToSequenceFile = seqwiki : Wikipedia xml dump 
to sequence file
org.apache.mahout.classifier.bayes.TestClassifier = testclassifier : Test the 
text based Bayes Classifier
org.apache.mahout.classifier.bayes.TrainClassifier = trainclassifier : Train 
the text based Bayes Classifier
org.apache.mahout.classifier.bayes.PrepareTwentyNewsgroups = 
prepare20newsgroups : Reformat 20 newsgroups data
org.apache.mahout.math.hadoop.decomposer.DistributedLanczosSolver = svd : 
Lanczos Singular Value Decomposition
org.apache.mahout.math.hadoop.decomposer.EigenVerificationJob = cleansvd : 
Cleanup and verification of SVD output
org.apache.mahout.math.hadoop.similarity.cooccurrence.RowSimilarityJob = 
rowsimilarity : Compute the pairwise similarities of the rows of a matrix
org.apache.mahout.math.hadoop.similarity.VectorDistanceSimilarityJob =  vecdist 
: Compute the distances between a set of Vectors (or Cluster or Canopy, they 
must fit in memory) and a list of Vectors
org.apache.mahout.cf.taste.hadoop.similarity.item.ItemSimilarityJob = 
itemsimilarity : Compute the item-item-similarities for item-based 
collaborative filtering
org.apache.mahout.cf.taste.hadoop.item.RecommenderJob = recommenditembased : 
Compute recommendations using item-based collaborative filtering
org.apache.mahout.classifier.sgd.TrainLogistic = trainlogistic : Train a 
logistic regression using stochastic gradient descent
org.apache.mahout.classifier.sgd.RunLogistic = runlogistic : Run a logistic 
regression model against CSV data
org.apache.mahout.classifier.sgd.PrintResourceOrFile = cat : Print a file or 
resource as the logistic regression models would see it
org.apache.mahout.classifier.sgd.TrainAdaptiveLogistic = trainAdaptiveLogistic 
: Train an AdaptivelogisticRegression model
org.apache.mahout.classifier.sgd.ValidateAdaptiveLogistic = 
validateAdaptiveLogistic : Validate an AdaptivelogisticRegression model against 
hold-out data set
org.apache.mahout.classifier.sgd.RunAdaptiveLogistic = runAdaptiveLogistic : 
Score new production data using a probably trained and validated 
AdaptivelogisticRegression model
org.apache.mahout.classifier.bayes.WikipediaXmlSplitter = wikipediaXMLSplitter 
: Reads wikipedia data and creates ch  
org.apache.mahout.classifier.bayes.WikipediaDatasetCreatorDriver = 
wikipediaDataSetCreator : Splits data set of wikipedia wrt feature like country
org.apache.mahout.math.hadoop.stochasticsvd.SSVDCli = ssvd : Stochastic SVD
org.apache.mahout.clustering.spectral.eigencuts.EigencutsDriver = eigencuts : 
Eigencuts spectral clustering
org.apache.mahout.clustering.spectral.kmeans.SpectralKMeansDriver = 
spectralkmeans : Spectral k-means clustering 
org.apache.mahout.cf.taste.hadoop.als.ParallelALSFactorizationJob = parallelALS 
: ALS-WR factorization of a rating matrix
org.apache.mahout.cf.taste.hadoop.als.PredictionJob = predictFromFactorization 
: predict preferences from a factorization of a rating matrix
org.apache.mahout.classifier.sequencelearning.hmm.BaumWelchTrainer = baumwelch 
: Baum-Welch algorithm for unsupervised HMM training
org.apache.mahout.classifier.sequencelearning.hmm.ViterbiEvaluator = viterbi : 
Viterbi decoding of hidden states from given output states sequence
org.apache.mahout.classifier.sequencelearning.hmm.RandomSequenceGenerator = 
hmmpredict : Generate random sequence of observations by given HMM
org.apache.mahout.utils.SplitInput = split : Split Input data into test and 
train sets
org.apache.mahout.classifier.naivebayes.training.TrainNaiveBayesJob = trainnb : 
Train the Vector-based Bayes classifier
org.apache.mahout.classifier.naivebayes.test.TestNaiveBayesDriver = testnb : 
Test the Vector-based Bayes classifier


                  
> @Experimental annotation to indicate which implementations are not intended 
> for production use
> ----------------------------------------------------------------------------------------------
>
>                 Key: MAHOUT-831
>                 URL: https://issues.apache.org/jira/browse/MAHOUT-831
>             Project: Mahout
>          Issue Type: Improvement
>    Affects Versions: 0.6
>            Reporter: Sebastian Schelter
>            Assignee: Sebastian Schelter
>         Attachments: MAHOUT-831.patch
>
>


--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Issue Comment Edited] (MAHOUT-831) @Experimental annotation to indicate which implementations are not intended for production use

Reply via email to