Re: PCA with Mahout

Jake Mannix Tue, 02 Feb 2010 09:37:24 -0800

Hi danny,

  The hadoopified version of the lanczos impl currently in mahout is really
easy, but not ported in from decomposer yet.  There were some other things I
wanted to get ready first, but users priorities trump long term plans!  I'll
get that over this week.


  I will also be adding stochastic decomposition soon after.

  Regarding the scaling of what is in there now, however, I should say that
the stream-oriented GHA svd impl, it isn't hadoop, but it scales to 10's of
millions  of rows on one box, but does require a nice stream oriented matrix
impl (also coming this week).

  -jake

On Feb 2, 2010 8:17 AM, "Danny Leshem" <[email protected]> wrote:

Hi,

I'm researching a massive dataset (say, ~100M rows) in which URLs are mapped
to word counts. These counts refer to words from a rather big dictionary,
say in the 5M range (obviously, the dataset is extremely sparse).

My research aims to predict some variable related to these URLs, based on
latent word counts information. I wish to start with basic models, namely
linear regression of some sort (probably with some kind of regularization).
Since fitting such models over 5M predictor variables is rather, hmm,
unpleasent - I wish to reduce the dataset's dimensionality. Selecting
features based on PCA on the dataset seems like a reasonable approach.

I'm wondering if Mahout, at its current stage, is suitable for this kind of
analysis.

Specifically, I considered using Jake Mannix's latest
porting<http://issues.apache.org/jira/browse/MAHOUT-180>of the
excellent decomposer library to Mahout, but encountered two problems:
1) There is no org.apache.mahout.Matrix implementation that "lives" in HDFS
(my covariance matrix is way too big to fit in a single machine).
2) The org.apache.mahout.math.decomposer classes currently do not use
MapReduce, rendering them unfit for such large-scale computations.

Am I missing something here?
If not, is there a plan to address these issues in Mahout (can anyone
provide an estimated time frame)?

I'm also wondering if there are plans to implement in Mahout other
algorithms for matrix decomposition, e.g. stochastic approximation
algorithms as reviewed here <http://arxiv.org/abs/0909.4061>.

Best,
Danny

Re: PCA with Mahout

Reply via email to