Re: Mahout Suggestions - Refactoring Effort

Dan Filimon Tue, 26 Mar 2013 15:21:36 -0700

I totally agree.
Additionally, let me suggest a starting point. :)

I'm working with Ted on StreamingKMeans. I'll be uploading patches for
the MapReduce version and command line program soon.
I could really use feedback on the parameters, documentation,
comments, package structure, naming...



On Wed, Mar 27, 2013 at 12:14 AM, Gokhan Capan <[email protected]> wrote:
> What I actually believe is to define a set of design principles for Mahout.
>
> This may be per different kinds of algorithms, may be a set for
> unsupervised ones, supervised ones, and dyadic prediction algorithms
> (recommenders), I am not sure yet.
>
> I consider myself a 'power user' of Mahout, and a Mahout enthusiast without
> letting committers know:) I talk to many people who reach me to use Mahout
> for their internal commercial projects, I mentor some undergrads who want
> to use Mahout for their school projects, and I contribute myself from time
> to time.
>
> I actually have a lot of pre and post processing tools to make Mahout
> easier to use, and if there is an effort to make the architecture more
> precise, I will definitely join.
>
> These were what I collected from people's feedback, and it makes me feel
> sad to see excellent algorithms cannot be used just because the usage path
> is blurry.
>
>
> On Tue, Mar 26, 2013 at 11:56 PM, Sebastian Schelter <[email protected]> wrote:
>
>> Hi Gokhan,
>>
>> I like the idea, but I'm not sure whether its completely feasible for
>> all parts of Mahout. A lot of jobs need a little more than a matrix, for
>> example an additional dictionary for text-based stuff
>>
>> In the collaborative filtering code, we already have a common input
>> format: All recommenders can work with textual files that have a
>> (user,item,rating) triple per line.
>>
>> Internally the Hadoop stuff works on vectors, which are created by the
>> PreparePreferenceMatrixJob, but we found it easier to use the textual
>> format as input for the jobs.
>>
>> So in summary, I think your refactoring is a good idea, but you should
>> choose a particular part of Mahout to start with, maybe by creating an
>> easy-to-use pipeline for LDA.
>>
>> Best,
>> Sebastian
>>
>> On 26.03.2013 21:35, Gokhan Capan wrote:
>> > I am moving my email that I wrote to Call to Action upon request.
>> >
>> > I'll start with an example that I experience when I use Mahout, and list
>> my
>> > humble suggestions.
>> >
>> > When I try to run Latent Dirichlet Allocation for topic discovery, here
>> are
>> > the steps  to follow:
>> >
>> > 1- First I use seq2sparse to convert text to vectors. The output is Text,
>> > VectorWritable pairs (If I have a csv data file –which is
>> understandable-,
>> > which has lines of id, text pairs, I need to develop my own tool to
>> convert
>> > it to vectors.)
>> >
>> > 2- I run LDA on data I transformed, but it doesn’t work, because LDA
>> needs
>> > IntWritable, VectorWritable pairs.
>> >
>> > 3- I convert Text keys to IntWritable ones with a custom tool.
>> >
>> > 4- Then I run LDA, and to see the results, I need to run vectordump with
>> > sort flag (It usually throws OutOfMemoryError). An ldadump tool does not
>> > exist. What I see is fairly different from clusterdump results, so I
>> spend
>> > some time to understand what that means. (And I need to know there
>> exists a
>> > vectordump tool to see the results)
>> >
>> > 5- After running LDA, when I have a document that I want to assign to a
>> > topic, there is no way -or I am not aware- to use my learned LDA model to
>> > assign this document to a topic.
>> >
>> > I can give further examples, but I believe this will make my point clear.
>> >
>> >
>> > Would you consider to refactor Mahout, so that the project follows a
>> clear,
>> > layered structure for all algorithms, and to document it?
>> >
>> > IMO Knowledge Discovery process has a certain path, and Mahout can define
>> > rules, those would force developers and guide users. For example:
>> >
>> >
>> >    - All algorithms take Mahout matrices as input and output.
>> >    - All preprocessing tools should be generic enough, so that they
>> produce
>> >    appropriate input for mahout algorithms.
>> >    - All algorithms should output a model that users can use them beyond
>> >    training and testing
>> >    - Tools those dump results should follow a strictly defined format
>> >    suggested by community
>> >    - All similar kinds of algorithms should use same evaluation tools
>> >    - ...
>> >
>> > There may be separated layers: preprocessing layer, algorithms layer,
>> > evaluation layer, and so on.
>> >
>> > This way users would be aware of the steps they need to perform, and one
>> > step can be replaced by an alternative.
>> >
>> > Developers would contribute to the layer they feel comfortable with, and
>> > would satisfy the expected input and output, to preserve the integrity.
>> >
>> > Mahout has tools for nearly all of these layers, but personally when I
>> use
>> > Mahout (and I’ve been using it for a long time), I feel lost in the
>> steps I
>> > should follow.
>> >
>> > Moreover, the refactoring may eliminate duplicate data structures, and
>> > stick to Mahout matrices if available. All similarity measures operate on
>> > Mahout Vectors, for example.
>> >
>> > We, in the lab and in our company, do some of that. An example:
>> >
>> > We implemented an HBase backed Mahout Matrix, which we use for our
>> projects
>> > where online learning algorithms operate on large input and learn a big
>> > parameter matrix (one needs this for matrix factorization based
>> > recommenders). Then the persistent parameter matrix becomes an input for
>> > the live system. Then we used the same matrix implementation as the
>> > underlying data store of Recommender DataModels. This was advantageous in
>> > many ways:
>> >
>> >    - Everyone knows that any dataset should be in Mahout matrix format,
>> and
>> >    applies appropriate preprocessing, or writes one
>> >    - We can use different recommenders interchangeably
>> >    - Any optimization on matrix operations apply everywhere
>> >    - Different people can work on different parts (evaluation, model
>> >    optimization, recommender algorithms) without bothering others
>> >
>> > Apart from all, I should say that I am always eager to contribute to
>> > Mahout, as some of committers already know.
>> >
>> > Best Regards
>> >
>> > Gokhan
>> >
>>
>>
>
>
> --
> Gokhan

Re: Mahout Suggestions - Refactoring Effort

Reply via email to