You're right, I skipped that, sorry.
On Tue, Mar 26, 2013 at 11:30 PM, Suneel Marthi <[email protected]>wrote: > Gokhan, > > Thinking loud here, I have not tried running LDA so I could be wrong. > > As a precursor to Step 2 below, did u try running the RowIdJob that should > create <IntWritable, VectorWritable> pairs. > It also a creates a 'docIndex' which is <IntWritable, Text> to map the > Ints back to the original Text. > > Suneel > > > ________________________________ > From: Gokhan Capan <[email protected]> > To: [email protected] > Sent: Tuesday, March 26, 2013 4:35 PM > Subject: Mahout Suggestions - Refactoring Effort > > I am moving my email that I wrote to Call to Action upon request. > > I'll start with an example that I experience when I use Mahout, and list my > humble suggestions. > > When I try to run Latent Dirichlet Allocation for topic discovery, here are > the steps to follow: > > 1- First I use seq2sparse to convert text to vectors. The output is Text, > VectorWritable pairs (If I have a csv data file –which is understandable-, > which has lines of id, text pairs, I need to develop my own tool to convert > it to vectors.) > > 2- I run LDA on data I transformed, but it doesn’t work, because LDA needs > IntWritable, VectorWritable pairs. > > 3- I convert Text keys to IntWritable ones with a custom tool. > > 4- Then I run LDA, and to see the results, I need to run vectordump with > sort flag (It usually throws OutOfMemoryError). An ldadump tool does not > exist. What I see is fairly different from clusterdump results, so I spend > some time to understand what that means. (And I need to know there exists a > vectordump tool to see the results) > > 5- After running LDA, when I have a document that I want to assign to a > topic, there is no way -or I am not aware- to use my learned LDA model to > assign this document to a topic. > > I can give further examples, but I believe this will make my point clear. > > > Would you consider to refactor Mahout, so that the project follows a clear, > layered structure for all algorithms, and to document it? > > IMO Knowledge Discovery process has a certain path, and Mahout can define > rules, those would force developers and guide users. For example: > > > - All algorithms take Mahout matrices as input and output. > - All preprocessing tools should be generic enough, so that they produce > appropriate input for mahout algorithms. > - All algorithms should output a model that users can use them beyond > training and testing > - Tools those dump results should follow a strictly defined format > suggested by community > - All similar kinds of algorithms should use same evaluation tools > - ... > > There may be separated layers: preprocessing layer, algorithms layer, > evaluation layer, and so on. > > This way users would be aware of the steps they need to perform, and one > step can be replaced by an alternative. > > Developers would contribute to the layer they feel comfortable with, and > would satisfy the expected input and output, to preserve the integrity. > > Mahout has tools for nearly all of these layers, but personally when I use > Mahout (and I’ve been using it for a long time), I feel lost in the steps I > should follow. > > Moreover, the refactoring may eliminate duplicate data structures, and > stick to Mahout matrices if available. All similarity measures operate on > Mahout Vectors, for example. > > We, in the lab and in our company, do some of that. An example: > > We implemented an HBase backed Mahout Matrix, which we use for our projects > where online learning algorithms operate on large input and learn a big > parameter matrix (one needs this for matrix factorization based > recommenders). Then the persistent parameter matrix becomes an input for > the live system. Then we used the same matrix implementation as the > underlying data store of Recommender DataModels. This was advantageous in > many ways: > > - Everyone knows that any dataset should be in Mahout matrix format, and > applies appropriate preprocessing, or writes one > - We can use different recommenders interchangeably > - Any optimization on matrix operations apply everywhere > - Different people can work on different parts (evaluation, model > optimization, recommender algorithms) without bothering others > > Apart from all, I should say that I am always eager to contribute to > Mahout, as some of committers already know. > > Best Regards > > Gokhan > -- Gokhan
