Space: Apache Lucene Mahout (http://cwiki.apache.org/confluence/display/MAHOUT)
Page: DimensionalReduction 
(http://cwiki.apache.org/confluence/display/MAHOUT/DimensionalReduction)


Edited by Ted Dunning:
---------------------------------------------------------------------
h1. Dimensional Reduction

Matrix algebra underpins the way many Big Data algorithms and data structures 
are composed: full-text search can be viewed as doing matrix multiplication of 
the term-document matrix by the query vector (giving a vector over documents 
where the components are the relevance score), computing co-occurrences in a 
collaborative filtering context (people who viewed X also viewed Y, or 
ratings-based CF like the Netflix Prize contest) is taking the squaring the 
user-item interaction matrix, calculating users who are k-degrees separated 
from each other in a social network or web-graph can be found by looking at the 
k-fold product of the graph adjacency matrix, and the list goes on (and these 
are all cases where the linear structure of the matrix is preserved!)

Each of these examples deal with cases of matrices which tend to be 
tremendously large (often millions to tens of millions to hundreds of millions 
of rows or more, by sometimes a comparable number of columns), but also rather 
sparse. Sparse matrices are nice in some respects: dense matrices which are 
10^7 on a side would have 100 trillion non-zero entries! But the sparsity is 
often problematic, because any given two rows (or columns) of the matrix may 
have zero overlap. Additionally, any machine-learning work done on the data 
which comprises the rows has to deal with what is known as "the curse of 
dimensionality", and for example, there are too many columns to train most 
regression or classification problems on them independently.

One of the more useful approaches to dealing with such huge sparse data sets is 
the concept of dimensionality reduction, where a lower dimensional space of the 
original column (feature) space of your data is found / constructed, and your 
rows are mapped into that subspace (or sub-manifold).  In this reduced 
dimensional space, "important" components to distance between points are 
exaggerated, and unimportant ones washed away, and additionally, sparsity of 
your rows is traded for drastically reduced dimensional, but dense 
"signatures". While this loss of sparsity can lead to its own complications, a 
proper dimensionality reduction can help reveal the most important features of 
your data, expose correlations among your supposedly independent original 
variables, and smooth over the zeroes in your correlation matrix.

One of the most straightforward techniques for dimensionality reduction is the 
matrix decomposition: singular value decomposition, eigen decomposition, 
non-negative matrix factorization, etc. In their truncated form these 
decompositions are an excellent first approach toward linearity preserving 
unsupervised feature selection and dimensional reduction. Of course, sparse 
matrices which don't fit in RAM need special treatment as far as decomposition 
is concerned. Parallelizable and/or stream-oriented algorithms are needed.

h1. Singular Value Decomposition

Currently implemented in Mahout (as of 0.3, the first release with MAHOUT-180 
applied), are two scalable implementations of SVD, a stream-oriented 
implementation using the Asymmetric Generalized Hebbian Algorithm outlined in 
Genevieve Gorrell & Brandyn Webb's paper ([Gorrell and Webb 2005| 
http://www.dcs.shef.ac.uk/~genevieve/gorrell_webb.pdf]); and there is a 
[Lanczos | http://en.wikipedia.org/wiki/Lanczos_algorithm] implementation, both 
single-threaded, and in the o.a.m.math.decomposer.lanczos package (math 
module), as a hadoop map-reduce (series of) job(s) in 
o.a.m.math.hadoop.decomposer package (core module). Coming soon: stochastic 
decomposition.



Change your notification preferences: 
http://cwiki.apache.org/confluence/users/viewnotifications.action    

Reply via email to