[CONF] Apache Mahout > Algorithms

confluence Mon, 31 Oct 2011 06:30:31 -0700

Space: Apache Mahout (https://cwiki.apache.org/confluence/display/MAHOUT)
Page: Algorithms (https://cwiki.apache.org/confluence/display/MAHOUT/Algorithms)



Edited by Grant Ingersoll:
---------------------------------------------------------------------
h2. Algorithms

This section contains links to information, examples, use cases, etc. for the 
various algorithms we intend to implement.  Click the individual links to learn 
more. The initial algorithms descriptions have been copied here from the 
original project proposal. The algorithms are grouped by the application 
setting, they can be used for. In case of multiple applications, the version 
presented in the paper was chosen, versions as implemented in our project will 
be added as soon as we are working on them.

Original Paper: [Map Reduce for Machine Learning on 
Multicore|http://www.cs.stanford.edu/people/ang//papers/nips06-mapreducemulticore.pdf]

Papers related to Map Reduce:
* [Evaluating MapReduce for Multi-core and Multiprocessor 
Systems|http://csl.stanford.edu/~christos/publications/2007.cmp_mapreduce.hpca.pdf]
* [Map Reduce: Distributed Computing for Machine 
Learning|http://www.icsi.berkeley.edu/~arlo/publications/gillick_cs262a_proj.pdf]

For Papers, videos and books related to machine learning in general, see 
[Machine Learning Resources]

All algorithms are either marked as _integrated_, that is the implementation is 
integrated into the development version of Mahout. Algorithms that are 
currently being developed are annotated with a link to the JIRA issue that 
deals with the specific implementation. Usually these issues already contain 
patches that are more or less major, depending on how much work was spent on 
the issue so far. Algorithms that have so far not been touched are marked as 
_open_.

[What, When, Where, Why (but not How or Who)] \- Community tips, tricks, etc. 
for when to use which algorithm in what situations, what to watch out for in 
terms of errors.  That is, practical advice on using Mahout for your problems.

h3. Classification

A general introduction to the most common text classification algorithms can be 
found at Google Answers: 
[http://answers.google.com/answers/main?cmd=threadview&id=225316] For 
information on the algorithms implemented in Mahout (or scheduled for 
implementation) please visit the following pages.

[Logistic Regression] (SGD)

[Bayesian]

[Support Vector Machines] (SVM) (open: 
[MAHOUT-14|http://issues.apache.org/jira/browse/MAHOUT-14], 
[MAHOUT-232|http://issues.apache.org/jira/browse/MAHOUT-232] and 
[MAHOUT-334|https://issues.apache.org/jira/browse/MAHOUT-334]) 

[Perceptron and Winnow] (open: 
[MAHOUT-85|http://issues.apache.org/jira/browse/MAHOUT-85])

[Neural Network] (open, but 
[MAHOUT-228|http://issues.apache.org/jira/browse/MAHOUT-228] might help)

[Random Forests] (integrated - 
[MAHOUT-122|http://issues.apache.org/jira/browse/MAHOUT-122], 
[MAHOUT-140|http://issues.apache.org/jira/browse/MAHOUT-140], 
[MAHOUT-145|http://issues.apache.org/jira/browse/MAHOUT-145])

[Restricted Boltzmann Machines] (open, 
[MAHOUT-375|http://issues.apache.org/jira/browse/MAHOUT-375], GSOC2010)

[Online Passive Aggressive] (awaiting patch commit, 
[MAHOUT-702|http://issues.apache.org/jira/browse/MAHOUT-702])

h3. Clustering

[Reference Reading]

[MAHOUT:Canopy Clustering] 
([MAHOUT-3|https://issues.apache.org/jira/browse/MAHOUT-3] - integrated)

[K-Means Clustering] ([MAHOUT-5|https://issues.apache.org/jira/browse/MAHOUT-5] 
- integrated)

[Fuzzy K-Means] ([MAHOUT-74|https://issues.apache.org/jira/browse/MAHOUT-74] - 
integrated)

[Expectation Maximization] (EM) 
([MAHOUT-28|http://issues.apache.org/jira/browse/MAHOUT-28])

[Mean Shift Clustering] 
([MAHOUT-15|https://issues.apache.org/jira/browse/MAHOUT-15] - integrated)

[Hierarchical Clustering] 
([MAHOUT-19|http://issues.apache.org/jira/browse/MAHOUT-19])

[Dirichlet Process Clustering] 
([MAHOUT-30|http://issues.apache.org/jira/browse/MAHOUT-30] - integrated)

[Latent Dirichlet Allocation] 
([MAHOUT-123|http://issues.apache.org/jira/browse/MAHOUT-123] - integrated)

[Spectral Clustering] 
([MAHOUT-363|https://issues.apache.org/jira/browse/MAHOUT-363] - integrated)

[Minhash Clustering] 
([MAHOUT-344|https://issues.apache.org/jira/browse/MAHOUT-344] - integrated)

h3. Pattern Mining

[Parallel FP Growth Algorithm|Parallel Frequent Pattern Mining] (Also known as 
Frequent Itemset mining)

h3. Regression

[Locally Weighted Linear Regression] (open)


h3. Dimension reduction

[Singular Value Decomposition and other Dimension Reduction 
Techniques|Dimensional Reduction] (available since 0.3)

[Principal Components Analysis] (PCA) (open)

[Independent Component Analysis] (open)

[Gaussian Discriminative Analysis] (GDA) (open)

h3. Evolutionary Algorithms

see also: [MAHOUT-56 
(integrated)|http://issues.apache.org/jira/browse/MAHOUT-56]

You will find here information, examples, use cases, etc. related to 
Evolutionary Algorithms.

Introductions and Tutorials:
* [Evolutionary Algorithms 
Introduction|http://www.geatbx.com/docu/algindex.html]
* [How to distribute the fitness evaluation using Mahout.GA|Mahout.GA.Tutorial]

Examples:
* [Traveling Salesman]
* [Class Discovery]

h3. Recommenders / Collaborative Filtering

Mahout contains both simple non-distributed recommender implementations and 
distributed Hadoop-based recommenders.

 * [Non-distributed recommenders ("Taste")|Recommender Documentation] 
(integrated)
 * [Distributed recommenders (item-based)|Itembased Collaborative Filtering] 
(integrated)
 * [First-timer FAQ|Recommender First-Timer FAQ]

h3. Vector Similarity

Mahout contains implementations that allow one to compare one or more vectors 
with another set of vectors.  This can be useful if one is, for instance, 
trying to calculate the pairwise similarity between all documents (or a subset 
of docs) in a corpus.

* RowSimilarityJob -- Builds an inverted index and then computes distances 
between items that have co-occurrences.  This is a fully distributed 
calculation.
* VectorDistanceJob -- Does a map side join between a set of "seed" vectors and 
all of the input vectors.

h3. Other

 * [Collocations]

h3. Non-MapReduce algorithms

Some algorithms and applications appeared on the mailing list, that have not 
been published in map reduce form so far. As we do not restrict ourselves to 
Hadoop-only versions, these proposals are listed here.

[Hidden Markov Models] (HMM) (MAHOUT-627, MAHOUT-396, MAHOUT-734)



Change your notification preferences: 
https://cwiki.apache.org/confluence/users/viewnotifications.action

[CONF] Apache Mahout > Algorithms

Reply via email to