Hi danny, The hadoopified version of the lanczos impl currently in mahout is really easy, but not ported in from decomposer yet. There were some other things I wanted to get ready first, but users priorities trump long term plans! I'll get that over this week.
I will also be adding stochastic decomposition soon after. Regarding the scaling of what is in there now, however, I should say that the stream-oriented GHA svd impl, it isn't hadoop, but it scales to 10's of millions of rows on one box, but does require a nice stream oriented matrix impl (also coming this week). -jake On Feb 2, 2010 8:17 AM, "Danny Leshem" <[email protected]> wrote: Hi, I'm researching a massive dataset (say, ~100M rows) in which URLs are mapped to word counts. These counts refer to words from a rather big dictionary, say in the 5M range (obviously, the dataset is extremely sparse). My research aims to predict some variable related to these URLs, based on latent word counts information. I wish to start with basic models, namely linear regression of some sort (probably with some kind of regularization). Since fitting such models over 5M predictor variables is rather, hmm, unpleasent - I wish to reduce the dataset's dimensionality. Selecting features based on PCA on the dataset seems like a reasonable approach. I'm wondering if Mahout, at its current stage, is suitable for this kind of analysis. Specifically, I considered using Jake Mannix's latest porting<http://issues.apache.org/jira/browse/MAHOUT-180>of the excellent decomposer library to Mahout, but encountered two problems: 1) There is no org.apache.mahout.Matrix implementation that "lives" in HDFS (my covariance matrix is way too big to fit in a single machine). 2) The org.apache.mahout.math.decomposer classes currently do not use MapReduce, rendering them unfit for such large-scale computations. Am I missing something here? If not, is there a plan to address these issues in Mahout (can anyone provide an estimated time frame)? I'm also wondering if there are plans to implement in Mahout other algorithms for matrix decomposition, e.g. stochastic approximation algorithms as reviewed here <http://arxiv.org/abs/0909.4061>. Best, Danny
