Re: Mahout Suggestions - Refactoring Effort

Gokhan Capan Tue, 26 Mar 2013 15:36:19 -0700

You're right, I skipped that, sorry.


On Tue, Mar 26, 2013 at 11:30 PM, Suneel Marthi <[email protected]>wrote:

> Gokhan,
>
> Thinking loud here, I have not tried running LDA so I could be wrong.
>
> As a precursor to Step 2 below, did u try running the RowIdJob that should
> create <IntWritable, VectorWritable> pairs.
> It also a creates a 'docIndex' which is <IntWritable, Text>  to map the
> Ints back to the original Text.
>
> Suneel
>
>
> ________________________________
>  From: Gokhan Capan <[email protected]>
> To: [email protected]
> Sent: Tuesday, March 26, 2013 4:35 PM
> Subject: Mahout Suggestions - Refactoring Effort
>
> I am moving my email that I wrote to Call to Action upon request.
>
> I'll start with an example that I experience when I use Mahout, and list my
> humble suggestions.
>
> When I try to run Latent Dirichlet Allocation for topic discovery, here are
> the steps  to follow:
>
> 1- First I use seq2sparse to convert text to vectors. The output is Text,
> VectorWritable pairs (If I have a csv data file –which is understandable-,
> which has lines of id, text pairs, I need to develop my own tool to convert
> it to vectors.)
>
> 2- I run LDA on data I transformed, but it doesn’t work, because LDA needs
> IntWritable, VectorWritable pairs.
>
> 3- I convert Text keys to IntWritable ones with a custom tool.
>
> 4- Then I run LDA, and to see the results, I need to run vectordump with
> sort flag (It usually throws OutOfMemoryError). An ldadump tool does not
> exist. What I see is fairly different from clusterdump results, so I spend
> some time to understand what that means. (And I need to know there exists a
> vectordump tool to see the results)
>
> 5- After running LDA, when I have a document that I want to assign to a
> topic, there is no way -or I am not aware- to use my learned LDA model to
> assign this document to a topic.
>
> I can give further examples, but I believe this will make my point clear.
>
>
> Would you consider to refactor Mahout, so that the project follows a clear,
> layered structure for all algorithms, and to document it?
>
> IMO Knowledge Discovery process has a certain path, and Mahout can define
> rules, those would force developers and guide users. For example:
>
>
>    - All algorithms take Mahout matrices as input and output.
>    - All preprocessing tools should be generic enough, so that they produce
>    appropriate input for mahout algorithms.
>    - All algorithms should output a model that users can use them beyond
>    training and testing
>    - Tools those dump results should follow a strictly defined format
>    suggested by community
>    - All similar kinds of algorithms should use same evaluation tools
>    - ...
>
> There may be separated layers: preprocessing layer, algorithms layer,
> evaluation layer, and so on.
>
> This way users would be aware of the steps they need to perform, and one
> step can be replaced by an alternative.
>
> Developers would contribute to the layer they feel comfortable with, and
> would satisfy the expected input and output, to preserve the integrity.
>
> Mahout has tools for nearly all of these layers, but personally when I use
> Mahout (and I’ve been using it for a long time), I feel lost in the steps I
> should follow.
>
> Moreover, the refactoring may eliminate duplicate data structures, and
> stick to Mahout matrices if available. All similarity measures operate on
> Mahout Vectors, for example.
>
> We, in the lab and in our company, do some of that. An example:
>
> We implemented an HBase backed Mahout Matrix, which we use for our projects
> where online learning algorithms operate on large input and learn a big
> parameter matrix (one needs this for matrix factorization based
> recommenders). Then the persistent parameter matrix becomes an input for
> the live system. Then we used the same matrix implementation as the
> underlying data store of Recommender DataModels. This was advantageous in
> many ways:
>
>    - Everyone knows that any dataset should be in Mahout matrix format, and
>    applies appropriate preprocessing, or writes one
>    - We can use different recommenders interchangeably
>    - Any optimization on matrix operations apply everywhere
>    - Different people can work on different parts (evaluation, model
>    optimization, recommender algorithms) without bothering others
>
> Apart from all, I should say that I am always eager to contribute to
> Mahout, as some of committers already know.
>
> Best Regards
>
> Gokhan
>



-- 
Gokhan

Re: Mahout Suggestions - Refactoring Effort

Reply via email to