I totally agree. Additionally, let me suggest a starting point. :) I'm working with Ted on StreamingKMeans. I'll be uploading patches for the MapReduce version and command line program soon. I could really use feedback on the parameters, documentation, comments, package structure, naming...
On Wed, Mar 27, 2013 at 12:14 AM, Gokhan Capan <[email protected]> wrote: > What I actually believe is to define a set of design principles for Mahout. > > This may be per different kinds of algorithms, may be a set for > unsupervised ones, supervised ones, and dyadic prediction algorithms > (recommenders), I am not sure yet. > > I consider myself a 'power user' of Mahout, and a Mahout enthusiast without > letting committers know:) I talk to many people who reach me to use Mahout > for their internal commercial projects, I mentor some undergrads who want > to use Mahout for their school projects, and I contribute myself from time > to time. > > I actually have a lot of pre and post processing tools to make Mahout > easier to use, and if there is an effort to make the architecture more > precise, I will definitely join. > > These were what I collected from people's feedback, and it makes me feel > sad to see excellent algorithms cannot be used just because the usage path > is blurry. > > > On Tue, Mar 26, 2013 at 11:56 PM, Sebastian Schelter <[email protected]> wrote: > >> Hi Gokhan, >> >> I like the idea, but I'm not sure whether its completely feasible for >> all parts of Mahout. A lot of jobs need a little more than a matrix, for >> example an additional dictionary for text-based stuff >> >> In the collaborative filtering code, we already have a common input >> format: All recommenders can work with textual files that have a >> (user,item,rating) triple per line. >> >> Internally the Hadoop stuff works on vectors, which are created by the >> PreparePreferenceMatrixJob, but we found it easier to use the textual >> format as input for the jobs. >> >> So in summary, I think your refactoring is a good idea, but you should >> choose a particular part of Mahout to start with, maybe by creating an >> easy-to-use pipeline for LDA. >> >> Best, >> Sebastian >> >> On 26.03.2013 21:35, Gokhan Capan wrote: >> > I am moving my email that I wrote to Call to Action upon request. >> > >> > I'll start with an example that I experience when I use Mahout, and list >> my >> > humble suggestions. >> > >> > When I try to run Latent Dirichlet Allocation for topic discovery, here >> are >> > the steps to follow: >> > >> > 1- First I use seq2sparse to convert text to vectors. The output is Text, >> > VectorWritable pairs (If I have a csv data file –which is >> understandable-, >> > which has lines of id, text pairs, I need to develop my own tool to >> convert >> > it to vectors.) >> > >> > 2- I run LDA on data I transformed, but it doesn’t work, because LDA >> needs >> > IntWritable, VectorWritable pairs. >> > >> > 3- I convert Text keys to IntWritable ones with a custom tool. >> > >> > 4- Then I run LDA, and to see the results, I need to run vectordump with >> > sort flag (It usually throws OutOfMemoryError). An ldadump tool does not >> > exist. What I see is fairly different from clusterdump results, so I >> spend >> > some time to understand what that means. (And I need to know there >> exists a >> > vectordump tool to see the results) >> > >> > 5- After running LDA, when I have a document that I want to assign to a >> > topic, there is no way -or I am not aware- to use my learned LDA model to >> > assign this document to a topic. >> > >> > I can give further examples, but I believe this will make my point clear. >> > >> > >> > Would you consider to refactor Mahout, so that the project follows a >> clear, >> > layered structure for all algorithms, and to document it? >> > >> > IMO Knowledge Discovery process has a certain path, and Mahout can define >> > rules, those would force developers and guide users. For example: >> > >> > >> > - All algorithms take Mahout matrices as input and output. >> > - All preprocessing tools should be generic enough, so that they >> produce >> > appropriate input for mahout algorithms. >> > - All algorithms should output a model that users can use them beyond >> > training and testing >> > - Tools those dump results should follow a strictly defined format >> > suggested by community >> > - All similar kinds of algorithms should use same evaluation tools >> > - ... >> > >> > There may be separated layers: preprocessing layer, algorithms layer, >> > evaluation layer, and so on. >> > >> > This way users would be aware of the steps they need to perform, and one >> > step can be replaced by an alternative. >> > >> > Developers would contribute to the layer they feel comfortable with, and >> > would satisfy the expected input and output, to preserve the integrity. >> > >> > Mahout has tools for nearly all of these layers, but personally when I >> use >> > Mahout (and I’ve been using it for a long time), I feel lost in the >> steps I >> > should follow. >> > >> > Moreover, the refactoring may eliminate duplicate data structures, and >> > stick to Mahout matrices if available. All similarity measures operate on >> > Mahout Vectors, for example. >> > >> > We, in the lab and in our company, do some of that. An example: >> > >> > We implemented an HBase backed Mahout Matrix, which we use for our >> projects >> > where online learning algorithms operate on large input and learn a big >> > parameter matrix (one needs this for matrix factorization based >> > recommenders). Then the persistent parameter matrix becomes an input for >> > the live system. Then we used the same matrix implementation as the >> > underlying data store of Recommender DataModels. This was advantageous in >> > many ways: >> > >> > - Everyone knows that any dataset should be in Mahout matrix format, >> and >> > applies appropriate preprocessing, or writes one >> > - We can use different recommenders interchangeably >> > - Any optimization on matrix operations apply everywhere >> > - Different people can work on different parts (evaluation, model >> > optimization, recommender algorithms) without bothering others >> > >> > Apart from all, I should say that I am always eager to contribute to >> > Mahout, as some of committers already know. >> > >> > Best Regards >> > >> > Gokhan >> > >> >> > > > -- > Gokhan
