Re: Mahout Suggestions - Refactoring Effort

Marty Kube Tue, 26 Mar 2013 20:15:34 -0700

IMHO usability is really important. I've posted a couple of patchesrecently around making the RF classifiers easier to use. I found myselfworking on consistent data format and command line option support. It'snot glamorous but it's important.


On 3/26/2013 8:26 PM, Ted Dunning wrote:

Gokhan,


I think that the general drift of your recommendation is an excellent
suggestion and it is something that we have wrestled with a lot over time.
  The recommendations side of the house has more coherence in this matter
than other parts largely because there was a clear flow early on.

Now, however, the flow is becoming more clear for non-recommendation parts
of the system.

- we have 2-3 external kinds of input.  These include text and matrices.
  Text comes in two major forms, those being text in files with unspecified
separators and text in Lucene/Solr indexes.  Matrices come in several forms
including triples, CSV files, binary matrices and sequence files of vectors.

- there are currently only a few ways to convert text and external data to
matrices.  The two most prominent are dictionary based and hashed encoding.
  Hashed encoding is currently not as invertible as it should be.
  Dictionary based has the virtue of being invertible, but hashed encoding
has considerably more generality.  We have almost no support for multiple
fields in dictionary based encoding.

- good conversion backwards and forwards depends on having schema
information that we don't retain or specify well.

- knowledge discovery pathways need more flexibility than recommendation
pathways regarding input and visualization.

- key knowledge discovery pathways that I know about include (a) input
summarization, (b) vectorization, (c) unsupervised analysis such as LDA,
LLL, clustering, SVD, (d) supervised training such as SGD, Naive Bayes and
random forest, and (e) visualization of results

I see that the major problems in Mahout are what Gokhan said, but with a
few extras

1) as Gokhan said, the exploratory pathways are inconsistent

2) I think that our visualization pathways are also hideous

3) I think that we need a good document format with a reasonable schema.
  Rather than create such a thing, I would nominate Lucene/Solr indexes as a
first class object in Mahout.

4) our current command lines with all the (many) different options with
incompatible conventions is a bit of a shambles

Expressed this way, I think that these usability issues are fixable.

What does everybody else think?  Would this leave us with a significantly
better system?



On Tue, Mar 26, 2013 at 9:35 PM, Gokhan Capan <[email protected]> wrote:

I am moving my email that I wrote to Call to Action upon request.

I'll start with an example that I experience when I use Mahout, and list my
humble suggestions.

When I try to run Latent Dirichlet Allocation for topic discovery, here are
the steps  to follow:

1- First I use seq2sparse to convert text to vectors. The output is Text,
VectorWritable pairs (If I have a csv data file –which is understandable-,
which has lines of id, text pairs, I need to develop my own tool to convert
it to vectors.)

2- I run LDA on data I transformed, but it doesn’t work, because LDA needs
IntWritable, VectorWritable pairs.

3- I convert Text keys to IntWritable ones with a custom tool.

4- Then I run LDA, and to see the results, I need to run vectordump with
sort flag (It usually throws OutOfMemoryError). An ldadump tool does not
exist. What I see is fairly different from clusterdump results, so I spend
some time to understand what that means. (And I need to know there exists a
vectordump tool to see the results)

5- After running LDA, when I have a document that I want to assign to a
topic, there is no way -or I am not aware- to use my learned LDA model to
assign this document to a topic.

I can give further examples, but I believe this will make my point clear.


Would you consider to refactor Mahout, so that the project follows a clear,
layered structure for all algorithms, and to document it?

IMO Knowledge Discovery process has a certain path, and Mahout can define
rules, those would force developers and guide users. For example:


    - All algorithms take Mahout matrices as input and output.
    - All preprocessing tools should be generic enough, so that they produce
    appropriate input for mahout algorithms.
    - All algorithms should output a model that users can use them beyond
    training and testing
    - Tools those dump results should follow a strictly defined format
    suggested by community
    - All similar kinds of algorithms should use same evaluation tools
    - ...

There may be separated layers: preprocessing layer, algorithms layer,
evaluation layer, and so on.

This way users would be aware of the steps they need to perform, and one
step can be replaced by an alternative.

Developers would contribute to the layer they feel comfortable with, and
would satisfy the expected input and output, to preserve the integrity.

Mahout has tools for nearly all of these layers, but personally when I use
Mahout (and I’ve been using it for a long time), I feel lost in the steps I
should follow.

Moreover, the refactoring may eliminate duplicate data structures, and
stick to Mahout matrices if available. All similarity measures operate on
Mahout Vectors, for example.

We, in the lab and in our company, do some of that. An example:

We implemented an HBase backed Mahout Matrix, which we use for our projects
where online learning algorithms operate on large input and learn a big
parameter matrix (one needs this for matrix factorization based
recommenders). Then the persistent parameter matrix becomes an input for
the live system. Then we used the same matrix implementation as the
underlying data store of Recommender DataModels. This was advantageous in
many ways:

    - Everyone knows that any dataset should be in Mahout matrix format, and
    applies appropriate preprocessing, or writes one
    - We can use different recommenders interchangeably
    - Any optimization on matrix operations apply everywhere
    - Different people can work on different parts (evaluation, model
    optimization, recommender algorithms) without bothering others

Apart from all, I should say that I am always eager to contribute to
Mahout, as some of committers already know.

Best Regards

Gokhan

Re: Mahout Suggestions - Refactoring Effort

Reply via email to