Finally coming back to this...

On Apr 8, 2011, at 3:35 PM, Frank Scholten wrote:

> On Mon, Apr 4, 2011 at 4:43 PM, Grant Ingersoll <[email protected]> wrote:
>> I've always wondered why Collections doesn't show up in IntelliJ when I 
>> point it at the top level POM.
>> 
>> FWIW, you are active in the community in terms of discussions, etc.   
>> Emeritus isn't solely about code.
>> 
>> Longer term, I'd love us to have a REST based front end for Mahout.  Submit 
>> jobs over REST, run them in the cluster or on EMR, etc.   Ted's Thrift 
>> server in Mahout in Action might be the basis for one.
>> That's maybe more than "plumbing repair"
> 
> REST interface would be really cool indeed. Maybe Restlet could be
> used for that? See www.restlet.org

Yeah, that's what I was thinking and I've got a start of one thanks to a really 
long plane ride (mostly b/c I've been meaning to learn restlet).   The thing 
is, our cmd line interface is nice (although it's getting a bit cumbersome in 
terms of the number of items in it -- just try doing ./mahout --help to see 
what I mean) but most people want programmatic access and the APIs are not 
exactly clear cut or well documented.

<aside>
> This makes a great example because it shows all the kinds of plumbing needed
> for a classifier farm.  It isn't easily
> generalized because the actual classification operation needed is not very
> general.  In real life, people need to send
> 1000 classification requests at once.  Or they need to run a single data
> point against 1000 models.  Or ... stuff.

Ted, would it make sense to start w/ the first case, i.e. a high volume server 
for classifying?  We don't have to solve the world's problems all at once.  
That one seems a little bit more realistic and easier to do: you have a model, 
it's loaded up, requests come in, you classify them, lather, rinse, repeat.
</aside>

Back to restlets, I started with the recommenders, but I think it can be 
extended to clustering and classification and that it can account for the 
different approaches Ted described in the earlier email, but it will take a bit 
to get there.

For the recommenders, the start of what I have can (will be able to), amongst 
other things, input ratings, get recommendations, delete ratings, etc.   Still 
needs a configuration framework for storing settings (such as what recommender 
to use, persisting the model, etc., would also be cool if it could 
replicate/distribute the model, too, perhaps using some distributed data store 
like Cassandra or others)  I can see this also doing other things like spawning 
the offline Hadoop jobs and then loading up the model when done (possibly even 
to Amazon using the EC2/EMR SDK)

For classifiers/clustering, it seems like one should be able to start simple:

1. Train
2. Test (including cross-validation)
3. Classify/Cluster (for each algorithm) both sequentially and on M/R (again, 
could submit files to external resources like Amazon)
4. Add/Delete/Update examples (for training and testing)

I realize this is non-trivial and there are a lot of details to work out, 
particularly on the spawning of M/R jobs and the "single data point against 
multiple models" approach, but the rest isn't as bad, I don't think.

I haven't thought about the other things yet, like frequent patternset mining, 
but that should also work, esp. once we have a framework in place for 
submitting jobs, etc.

-Grant

Reply via email to